Integrate Spark with HBase

You integrate Spark with HBase or MapR-DB when you want to run Spark jobs on HBase or MapR-DB tables.

If you installed Spark with the MapR Installer, these steps are not required.
  1. Configure the HBase version in the /opt/mapr/spark/spark-<version>/mapr-util/compatibility.version file:
    hbase_versions=<version>
  2. If you want to create HBase tables with Spark, add the following property to hbase-site.xml:
    <property>
    <name>hbase.table.sanity.checks</name> 
    <value>false</value>
    </property>
  3. On each Spark node, copy the hbase-site.xml to the {SPARK_HOME}/conf/ directory.
  4. Specify the hbase-site.xml file in the SPARK_HOME/conf/spark-defaults.conf
    spark.yarn.dist.files SPARK_HOME/conf/hbase-site.xml
  5. To verify the integration, complete the following steps:
    1. Create an HBase or MapR-DB table, and populate it with some data.
    2. Run the following command as the mapr user or as a user that mapr impersonates:
      • On Spark 2.0.1 and later:
        <spark-home>/bin/run-example --master <master> [--deploy-mode <deploy-mode>] HBaseTest <table-name>

        The master URL for the cluster is either spark://<host>:7077 or yarn. The deploy-mode is either client or cluster.

      • On Spark 1.6.1:
        MASTER=<master-url> <spark-home>/bin/run-example HBaseTest <table-name>

        The master URL for the cluster is either spark://<host>:7077, yarn-client, or yarn-cluster.