MapR 5.0 Documentation : Run Spark Jobs with Oozie

You can use Oozie 4.1.0 or greater to run Spark jobs. Complete the following steps to configure Oozie to run Spark jobs:

Update the Spark Shared Libraries (optional)

By default, Oozie ships with shared libraries for a specific Spark version. To update the shared libraries with the version of Spark that you are running, complete the following steps: 

  1. Stop Oozie.

    maprcli node services -name oozie -action stop -nodes <space delimited list of nodes>
  2. Remove all *.jar files EXCEPT oozie-sharelib-spark-<version>-mapr.jar from the /opt/mapr/oozie/oozie-<version>/share2/lib/spark directory. 

  3. As of Oozie 4.2.0-1510, also remove all *.jar files EXCEPT oozie-sharelib-spark-<version>-mapr.jar from  /opt/mapr/oozie/oozie-<version>/share1/lib/spark directory. 

  4. Copy spark-assembly-*.jar to /opt/mapr/oozie/oozie-<version>/share2/lib/spark/ directory.

    cp /opt/mapr/spark/spark-<version>/lib/spark-assembly-*.jar /opt/mapr/oozie/oozie-<version>/share2/lib/spark/
  5. As of Oozie 4.2.0-1510, also copy spark-assembly-*.jar to /opt/mapr/oozie/oozie-/share1/lib/spark/ directory. 

    cp /opt/mapr/spark/spark-/lib/spark-assembly-*.jar /opt/mapr/oozie/oozie-/share1/lib/spark/
  6. Start Oozie.

    maprcli node services -name oozie -action start -nodes <space delimited list of nodes>

Configure a Spark Action

You can use Oozie 4.1.0 or greater to run a Spark job. To run a Spark job, add a Spark action to the workflow.xml associated with the workflow that should run the Spark job.

When you configure Spark action in the workflow.xml, specify the master element based on the mode of the Spark job:

  • For Spark standalone mode, specify the Spark Master URL in the master element.
    For example, if your SparkMaster URL is spark://ubuntu2:7077, you would replace the <master>local[*]</master> in the example below with master>. 
     
  • For Spark on YARN mode, specify yarn-client or yarn-cluster in the master element. For example, for yarn-cluster mode, you would replace <master>local[*]</master> with <master>yarn-cluster</master>.

Here is an example of a Spark action within a workflow.xml file: 

<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'>
    <start to='spark-node' />
    <action name='spark-node'>
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <master>local[*]</master>
            <name>Spark-FileCopy</name>
            <class>org.apache.oozie.example.SparkFileCopy</class>
            <jar>${nameNode}/user/${wf:user()}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>
            <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/input-data/text/data.txt</arg>
            <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/output</arg>
        </spark>
        <ok to="end" />
        <error to="fail" />
    </action>
    <kill name="fail">
        <message>Workflow failed, error
            message[${wf:errorMessage(wf:lastErrorNode())}]
        </message>
    </kill>
    <end name='end' />
</workflow-app>