MapR 5.0 Documentation : Integrate Spark

You can integrate Spark with other ecosystem components. This section includes the following topics:

Integrate Spark-SQL with Hive 

Integrate Spark-SQL with Hive when you want to run Spark-SQL queries on Hive tables. Spark 1.5.2 is built using Hive 1.2 artifacts; however, you can configure Spark-SQL to work with Hive 0.13 and Hive 1.0. Spark 1.3.1 and Spark 1.4.1 are build using Hive 0.13;other versions of Hive are not supported with Spark-SQL. For additional details on Spark-SQL and Hive support, see Spark Feature Support.

If you installed Spark with the MapR Installer, the following steps are not required.  
  1. Copy hive-site.xml file into the SPARK_HOME/conf directory so that Spark and Spark-SQL recognize the Hive Metastore configuration. 

  2. Configure the Hive version in the /opt/mapr/spark/spark-<version>/mapr-util/compatibility.version file: 

    hive_versions=<version>
  3. If you are running Spark 1.5.2, add the following additional properties to the /opt/mapr/spark/spark-<version>/conf/spark-defaults.conf file:

    PropertyConfiguration Requirements
    spark.yarn.dist.files

    Option 1: For Spark on YARN, specify the location of the hive-site.xml and the datanucleus JARs.

    /opt/mapr/hive/hive-<hive-version>/conf/hive-site.xml,/opt/mapr/hive/<version>/lib/datanucleus-api-jdo-<version>.jar,/opt/mapr/hive/<version>/lib/datanucleus-core-<version>.jar,/opt/mapr/hive/hive-1.2/lib/datanucleus-rdbms-<version>.jar

    Option 2:  For Spark on YARN, store hive-site.xml and datanucleus JARs on MapR-FS and use the following syntax:

    maprfs://<path to hive-site.xml>,maprfs://<path to datanucleus jar files>
    spark.sql.hive.metastore.version

    Specify the Hive version that you are using:

    -For Hive 1.2.0, set the value to 1.2.0

    -For Hive 0.13, set the value to 0.13

    -For Hive 1.0, set the value to 1.0.0

    spark.sql.hive.metastore.jars

    Specify the classpath to JARS for Hive, Hive dependencies, and Hadoop. These files must be available on the node from which you submit Spark jobs.

    Example
    /opt/mapr/hadoop/hadoop-<hadoop-version>/etc/hadoop:/opt/mapr/hadoop/hadoop-<hadoop-version>/share/hadoop/common/lib/*:<rest of hadoop classpath>:/opt/mapr/hive/hive-<version>/lib/accumulo-core-<version>.jar:/opt/mapr/hive/hive-<version>/lib/hive-contrib-<version>.jar:<rest
    of hive classpath>

    For example, if you run Spark 1.5.2 with Hive 1.2 you can set the following classpath: 

    /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/./hadoop-2.7.0/share/hadoop/mapreduce/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hive/hive-1.2/lib/accumulo-core-1.6.0.jar:/opt/mapr/hive/hive-1.2/lib/hive-contrib-1.2.0-mapr-1508.jar:/opt/mapr/hive/hive-1.2/lib/*

    For more information, see the Apache Spark documentation.

  4. To verify the integration, run the following command as the mapr user or as a user that mapr impersonates:

    MASTER=<master-url> <spark-home>/bin/run-example sql.hive.HiveFromSpark

    The master URL for the cluster is either spark://<host>:7077 , yarn-client, or yarn-cluster.

     

The default port for both HiveServer 2 and the Spark Thrift server is 10000. Therefore, before you start the Spark Thrift server on node where HiveServer 2 is running, verify that there is no port conflict.

Integrate Spark with HBase 

Integrate Spark with HBase or MapR-DB when you want to run Spark jobs on HBase or MapR-DB tables.

If you installed Spark with the MapR Installer, these steps are not required.  
  1. Add the following line to spark.executor.extraClassPath in the /opt/mapr/spark/spark-<version>/conf/spark-defaults.conf file:

    /opt/mapr/hbase/hbase-<version>/lib/*
  2. Configure the HBase version in the /opt/mapr/spark/spark-<version>/mapr-util/compatibility.version file:

    hbase_versions=<version>
  3. To verify the integration, complete the following steps:
    1. Create an HBase or MapR-DB table and populate it with some data.
    2. Run the following command as the mapr user or as a user that mapr impersonates: 

      MASTER=<master-url> <spark-home>/bin/run-example HBaseTest <table-name>  
      

      The master URL for the cluster is either spark://<host>:7077 , yarn-client, or yarn-cluster.

Integrate Spark with R

 As of Spark 1.5.2, you can integrate Spark with R.  Integrate Spark with R when you want to run R programs as Spark jobs.

  1. Install R 3.2.2 or greater on each node that will submit Spark jobs:
    • On Ubuntu:

       apt-get install r-base-dev
    • On Centos/RedHat: 

      yum install R

      For more information on installing R, see the R documentation.

       
  2. To verify the integration, run the following commands as the mapr user or as a user that mapr impersonates:

    1. Start SparkR:

      /opt/mapr/spark/spark-1.5.2/bin/sparkR --master <master-url>
    2. Run the following command to create a DataFrame using sample data:

      people <- read.df(sqlContext, "file:///opt/mapr/spark/spark-1.5.2/examples/src/main/resources/people.json", "json")
    3. Run the following command to display the data from the DataFrame that you just created:

      head(people)

Integrate Spark-SQL with Avro

Integrate Spark-SQL with Avro when you want to read and write Avro data. As of Spark 1.5.2, you must perform the following steps to perform the integration. Previous versions of Spark do not require these steps.
  1.  Download the Avro 1.7.7 JAR file to the Spark lib (opt/mapr/spark/spark-<version>/lib) directory. 
    You can download the file from the maven repository: http://mvnrepository.com/artifact/org.apache.avro/avro/1.7.7
     
  2. Perform one of the following methods to add Avro 1.7.7 JAR to the classpath:

    • Prepend the Avro 1.7.7 JAR file to the spark.executor.extraClassPath and spark.driver.extraClassPath in the spark-defaults.conf (/opt/mapr/spark/spark-<version>/conf/spark-defaults.conf) file:

      spark.executor.extraClassPath  /opt/mapr/spark/spark-1.5.2/lib/avro-1.7.7.jar:<rest_of_path>
      spark.driver.extraClassPath  /opt/mapr/spark/spark-1.5.2/lib/avro-1.7.7.jar:<rest_of_path>
    • Specify the Avro 1.7.7 JAR files with command line arguments on the spark shell: 

      /opt/mapr/spark/spark-<version>/bin/spark-shell \
      --packages com.databricks:spark-avro_2.10:2.0.1 \
      --driver-class-path /opt/mapr/spark/spark-<version>/lib/avro-1.7.7.jar \
      --conf spark.executor.extraClassPath=/opt/mapr/spark/spark-<version>/lib/avro-1.7.7.jar --master <master-url>

      In this case, the master URL for the cluster is either spark://<host>:7077 or yarn-client as yarn-cluster is not supported on the spark shell.