MapR-DB OJAI Connector for Apache Spark

The MapR-DB OJAI Connector for Apache Spark is a tool that makes it easier to build real-time or batch pipelines between your data and MapR-DB and leverage Spark within the pipeline.

Included are a set of APIs that that enable MapR users to write applications that consume MapR-DB JSON tables and use them in Spark. The MapR-DB OJAI Connector for Apache Spark is a companion to the MapR-DB Binary Connector for Apache Spark, which enables users to write applications that consume HBase binary tables and use them in Spark. For more information about the MapR-DB Binary Connector for Apache Spark, see MapR-DB Binary Connector for Apache Spark.

Batch Data Transformation with MapR-DB as a Source and Destination for Spark

The MapR-DB OJAI Connector for Apache Spark can be used with batch data. In this diagram, data from MapR-DB or MapR-FS is extracted and transformed using Spark and then loaded into MapR-DB JSON:

MapR-DB OJAI Connector for Apache Spark Features

Principal features of the MapR-DB OJAI Connector for Apache Spark include:
  • Two new APIs that allow you to load data from a MapR-DB JSON table to a Spark RDD or save a Spark RDD to a MapR-DB JSON table:
    • def loadFromMapRDB[T](table: String): RDD[T]
      • The loadFromMapRDB API also supports SELECT and WHERE clauses. These clauses can be used to push down projection of subsets of fields or filter out documents to improve performance.
    • def saveToMapRDB(tablename: String, createTable: Boolean, bulkInsert: Boolean, idFieldPath: String): Unit
      • With the saveToMapRDB API, you can take advantage of normal or bulk insert options.
  • Support for Scala bean classes. You can load OJAI documents as an RDD of Scala bean classes.
  • A custom partitioner that allows you to partition data for better performance. For more information, see Using the Custom Partitioner.
  • Data locality: When the connector reads data from MapR-DB, it uses the data locality feature of MapR-DB to spawn the Spark executors.
The following features are not supported:
  • Java is not supported. The MapR-DB OJAI Connector for Apache Spark is currently only supported in Scala.
  • MapR-DB BINARY tables are not supported. Only MapR-DB JSON tables are supported; access to MapR-DB binary tables is provided through the MapR-DB Binary Connector for Apache Spark.
  • DataFrame and DataSet APIs are not currently supported. Current support is only for RDDs.

Supported Product Versions and System Requirements

To use the MapR-DB OJAI Connector for Apache Spark, you must have the following minimum software versions:
  • MapR release: 5.2.1 or later
  • MEP 3.0 or later
  • Spark 2.1.0 or later
  • Languages: Scala 2.11 or later

OJAI API

The MapR-DB OJAI Connector for Apache Spark uses the OJAI API internally to talk to MapR-DB JSON tables. A JSON table is a collection of OJAI documents stored in an optimized format in MapR-DB.