Run Spark Jobs with Oozie

Complete the following steps to configure Oozie to run Spark jobs:
  1. Update the Spark shared libraries. By default, Oozie ships with shared libraries for a specific Spark version. To update the shared libraries with the version of Spark that you are running, complete the following steps for Oozie 4.2.0-1609 and earlier:
    1. Stop Oozie:
      maprcli node services -name oozie -action stop -nodes <space delimited list of nodes>
    2. In the /opt/mapr/oozie/oozie-<version>/share2/lib/spark directory, remove all *.jar files EXCEPT oozie-sharelib-spark-<version>-mapr.jar.
    3. In the /opt/mapr/oozie/oozie-<version>/share1/lib/spark directory, also remove all *.jar files EXCEPT oozie-sharelib-spark-<version>-mapr.jar.
    4. Copy spark-assembly-*.jar to /opt/mapr/oozie/oozie-<version>/share2/lib/spark/ directory.
      cp /opt/mapr/spark/spark-<version>/lib/spark-assembly-*.jar /opt/mapr/oozie/oozie-<version>/share2/lib/spark/
    5. Copy spark-assembly-*.jar to /opt/mapr/oozie/oozie-<version>/share1/lib/spark/ directory.
      cp /opt/mapr/spark/spark-<version>/lib/spark-assembly-*.jar /opt/mapr/oozie/oozie-<version>/share1/lib/spark/
    6. Start Oozie:
      maprcli node services -name oozie -action start -nodes <space delimited list of nodes>
    7. If needed, update the Oozie shared libraries as described in Updating the Oozie Shared Libraries.
  2. Configure a Spark action:

    For MEP 1.x or a prior Oozie version:

    1. Add Spark configuration files to the Spark action:
      1. To add spark-default.conf to the Spark action, copy it to /opt/mapr/oozie/oozie-<version>/conf/spark-conf/ directory:
        mkdir /opt/mapr/oozie/oozie-<version>/conf/spark-conf
        cp /opt/mapr/spark/spark-<version>/conf/spark-defaults.conf /opt/mapr/oozie/oozie-<version>/conf/spark-conf/
      2. Restart Oozie:
        maprcli node services -name oozie -action restart -nodes
        <space-delimited list of nodes>

      For MEP 2.x and later:

      1. To add spark configuration files (spark-defaults.conf, hive-site.xml, etc) to a Spark action, copy the files to the {OOZIE_HOME}/share2/lib/spark/ directory.
      2. If needed, update the Oozie shared libraries as described in Updating the Oozie Shared Libraries.
      3. Run pySpark using Spark Action:
        1. Run pySpark using Spark Action by specifying pyspark and py4j zip files to the sharelib:
          cp /{SPARK_HOME}/python/lib/ pyspark*.zip {OOZIE_HOME}/share2/lib/spark/
          cp /{SPARK_HOME}/python/lib/py4j*src.zip {OOZIE_HOME}/share2/lib/spark/
        2. Update the Oozie shared libraries as described in Updating the Oozie Shared Libraries.
    When you configure a Spark action in the workflow.xml, specify the master and mode elements of the Spark job:
    • For Spark standalone mode, specify the Spark Master URL in the master element. For example, if your SparkMaster URL is spark://ubuntu2:7077, you would replace the <master> [SPARK MASTER URL]</master> in the example below with <master> spark://ubuntu2:7077</master>.
    • If you are using Oozie 4.2.0-1609 or earlier, you should perform Step 1 (Update the the Spark shared libraries) to run the Spark action in Spark on YARN mode.
    • For Spark on YARN mode, specify yarn-client or yarn-cluster in the master element. For example, for yarn-cluster mode, you would replace <master> [SPARK MASTER URL]</master> with <master>yarn</master> and <mode>[SPARK MODE]</mode> with <mode>cluster</mode>.

      Here is an example of a Spark action within a workflow.xml file:

      <workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'>
          <start to='spark-node' />
          <action name='spark-node'>
              <spark xmlns="uri:oozie:spark-action:0.1">
                  <job-tracker>${jobTracker}</job-tracker>
                  <name-node>${nameNode}</name-node>
                     <master>[SPARK MASTER URL]</master>
                     <mode>[SPARK MODE]</mode>
                  <name>Spark-FileCopy</name>
                  <class>org.apache.oozie.example.SparkFileCopy</class>
                  <jar>${nameNode}/user/${wf:user()}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>
                  <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/input-data/text/data.txt</arg>
                  <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/output</arg>
              </spark>
              <ok to="end" />
              <error to="fail" />
          </action>
          <kill name="fail">
              <message>Workflow failed, error
                  message[${wf:errorMessage(wf:lastErrorNode())}]
              </message>
          </kill>
          <end name='end' />
      </workflow-app>