Run Spark Jobs with Oozie

Complete the following steps to configure Oozie to run Spark jobs:

Configure a Spark action:

  1. For running a Spark action through Oozie, you should be able to connect to Hive on a secure cluster. Make sure the hive-site.xml file that is used by Oozie has the following property set:
    <property>
      <name>hive.metastore.sasl.enabled</name>
      <value>true</value>
    </property>        
  2. To add Spark configuration files (spark-defaults.conf, hive-site.xml, etc) to a Spark action, copy the files to the {OOZIE_HOME}/share/lib/spark/ directory.
  3. If needed, update the Oozie shared libraries as described in Updating the Oozie Shared Libraries.
  4. Run pySpark using Spark Action:
    1. Run pySpark using Spark Action by specifying pyspark and py4j zip files to the sharelib:
      cp /{SPARK_HOME}/python/lib/ pyspark*.zip {OOZIE_HOME}/share/lib/spark/
      cp /{SPARK_HOME}/python/lib/py4j*src.zip {OOZIE_HOME}/share/lib/spark/
    2. Update the Oozie shared libraries as described in Updating the Oozie Shared Libraries.
  5. When you configure a Spark action in the workflow.xml, specify the master and mode elements of the Spark job:
    • For Spark standalone mode, specify the Spark Master URL in the master element. For example, if your SparkMaster URL is spark://ubuntu2:7077, you would replace the <master> [SPARK MASTER URL]</master> in the example below with <master> spark://ubuntu2:7077</master>.
    • For Spark on YARN mode, specify yarn-client or yarn-cluster in the master element. For example, for yarn-cluster mode, you would replace <master> [SPARK MASTER URL]</master> with <master>yarn</master> and <mode>[SPARK MODE]</mode> with <mode>cluster</mode>.

      Here is an example of a Spark action within a workflow.xml file:

      <workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'>
        <start to='spark-node' />
        <action name='spark-node'>
           <spark xmlns="uri:oozie:spark-action:0.1">
              <job-tracker>${jobTracker}</job-tracker>
              <name-node>${nameNode}</name-node>
              <master>[SPARK MASTER URL]</master>
              <mode>[SPARK MODE]</mode>
              <name>Spark-FileCopy</name>
              <class>org.apache.oozie.example.SparkFileCopy</class>
              <jar>${nameNode}/user/${wf:user()}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>
              <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/input-data/text/data.txt</arg>
              <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/output</arg>
            </spark>
            <ok to="end" />
            <error to="fail" />
         </action>
         <kill name="fail">
            <message>Workflow failed, error
                     message[${wf:errorMessage(wf:lastErrorNode())}]
            </message>
         </kill>
        <end name='end' />
      </workflow-app>