Configure Spark to Consume MapR Streams Messages

Using the Kafka 0.9 API, you can configure a Spark application to query MapR Streams for new messages at a given interval.

  1. Add the following dependency:
    groupId = org.apache.spark
    artifactId = spark-streaming-kafka-v09_2.10
    version = <spark_version>-mapr-<mapr_eco_version>
    Note: If you are using Spark 1.6.1-1607, specify the version as 1.6.1-mapr-1607.
  2. Consider the following when you write the Spark application:
    1. Verify that it meets the following requirements:
      • Imports and use classes from org.apache.spark.streaming.kafka.v09.
      • Defines key and value deserializers in the kafkaParams map.
      • Does not configure a broker address or Zookeeper as these are not required for MapR Streams.
    2. Optionally, define a value for spark.kafka.poll.time in the kafkaParams map.
      Note: As of Spark 1.6.1-1607, if you do not configure spark.kafka.poll.time, the default is 1000 milliseconds. For previous Spark versions, the default is 100 milliseconds.
    val kafkaParams = Map[String, String](
     ConsumerConfig.GROUP_ID_CONFIG -> groupId,
    ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
    ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
    ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> offsetReset,
    ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false", 
    "spark.kafka.poll.time" -> pollTimeout)