9 min read
MapR Streams is a new distributed messaging system for streaming event data at scale, and it’s integrated into the MapR converged platform. MapR Streams uses the Apache Kafka API, so if you’re already familiar with Kafka, you’ll find it particularly easy to get started with MapR Streams.
Although MapR Streams generally uses the Apache Kafka programming model, there are a few key differences. For instance, there is a new kind of object in the MapR file-system called, appropriately enough, a stream. Each stream can handle a huge number of topics, and you can have many streams in a single cluster. Policies such as time-to-live or ACEs (Access Control Expressions) can be set at the stream level for convenient management of many topics together. You can find out more about streaming architectures using Kafka and MapR Streams in the new short book Streaming Architectures: New Designs Using Apache Kafka and MapR Streams, available as a free download from the MapR website.
If you already have Kafka applications, it’s easy to migrate them over to MapR Streams. You can find out more in the MapR documentation at https://mapr.com/docs/51/ - MapR_Streams/migrating_kafka_applications_to_mapr_streams.html
In this current blog we describe how to run a simple application we originally wrote for Kafka using MapR Streams instead.
As mentioned above, MapR Streams uses Kafka API 0.9.0, which means it is possible to reuse the same application with minor changes. Before diving into a concrete example, let’s take a look at what has to be changed:
topic-name" to "
/stream-name:topic-name" as MapR organizes the topics in streams for management reasons (security, TTL, etc.).
You can find a complete application on the Sample Programs for MapR Streams page. It’s a simple copy that includes minor changes of the Sample Programs for Kafka 0.9 API project. This Kafka project has been documented in this article.
You will need basic Java programming skills as well as access to:
A stream is a collection of topics that you can manage together by:
You can find more information about MapR Streams concepts in the documentation.
Run the following command, as
mapr user, on your MapR cluster:
$ maprcli stream create -path /sample-stream
By default, the produce and consume topic permissions are defaulted to the creator of the streams—the unix user you are using to run the maprcli command. It is possible to configure the permission by editing the streams. For example, to make all of the topics available to anybody (public permission), you can run the following command:
$ maprcli stream edit -path /sample-stream -produceperm p -consumeperm p -topicperm p
We need two topics for the example program, which we can be created using
$ maprcli stream topic create -path /sample-stream -topic fast-messages $ maprcli stream topic create -path /sample-stream -topic summary-markers
These topics can be listed using the following command:
$ maprcli stream topic list -path /sample-stream topic partitions logicalsize consumers maxlag physicalsize fast-messages 1 0 0 0 0 summary-markers 1 0 0 0 0
Note that the program will automatically create the topic if it does not already exist. For your applications you should decide whether it is better to allow programs to automatically create topics simply by virtue of having mentioning them or whether it is better to strictly control which topics exist.
Go back to the directory where you have the example programs and build the example programs.
$ cd .. $ mvn package ...
The project creates a jar with all external dependencies (
Note that you can build the project with the Apache Kafka dependencies as long as you do not package them into your application when you run and deploy it. This example has a dependency on the MapR Streams client instead which can be found in the
mapr.com maven repository.
<repositories> <repository> <id>mapr-maven</id> <url>http://repository.mapr.com/maven</url> <releases><enabled>true</enabled></releases> <snapshots><enabled>false</enabled></snapshots> </repository> </repositories> ... <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>0.9.0.0-mapr-1602</version> <scope>provided</scope> </dependency> ...
You can install the MapR Client and run the application locally, or copy the jar file onto your cluster (any node). If you are installing the MapR Client be sure you also install the MapR Kafka package using the following command on CentOS/RHEL :
`yum install mapr-kafka`
$ scp ./target/mapr-streams-examples-1.0-SNAPSHOT-jar-with-dependencies.jar mapr@<YOUR_MAPR_CLUSTER>:/home/mapr
The producer will send a large number of messages to
/sample-stream:fast-messages along with occasional messages to
/sample-stream:summary-markers. Since there isn't any consumer running yet, nobody will receive the messages.
If you compare this with the Kafka example used to build this application, the topic name is the only change to the code.
Any MapR Streams application will need the MapR Client libraries. One way to make these libraries available to add them to the application classpath using the
/opt/mapr/bin/mapr classpath command. For example:
$ java -cp $(mapr classpath):./mapr-streams-examples-1.0-SNAPSHOT-jar-with-dependencies.jar com.mapr.examples.Run producer Sent msg number 0 Sent msg number 1000 ... Sent msg number 998000 Sent msg number 999000
The only important difference here between an Apache Kafka application and MapR Streams application is that the client libraries are different. This causes the MapR Producer to connect to the MapR cluster to post the messages, and not to a Kafka broker.
In another window, you can run the consumer using the following command:
$ java -cp $(mapr classpath):./mapr-streams-examples-1.0-SNAPSHOT-jar-with-dependencies.jar com.mapr.examples.Run consumer 1 messages received in period, latency(min, max, avg, 99%) = 20352, 20479, 20416.0, 20479 (ms) 1 messages received overall, latency(min, max, avg, 99%) = 20352, 20479, 20416.0, 20479 (ms) 1000 messages received in period, latency(min, max, avg, 99%) = 19840, 20095, 19968.3, 20095 (ms) 1001 messages received overall, latency(min, max, avg, 99%) = 19840, 20479, 19968.7, 20095 (ms) ... 1000 messages received in period, latency(min, max, avg, 99%) = 12032, 12159, 12119.4, 12159 (ms) <998001 1000="" 12095="" 19583="" 999001="" messages="" received="" overall,="" latency(min,="" max,="" avg,="" 99%)="12032," 20479,="" 15073.9,="" (ms)="" in="" period,="" 12095,="" 12064.0,="" 15070.9,="" (ms)<="" pre=""> Note that there is a latency listed in the summaries for the message batches. This is because the consumer wasn't running when the messages were sent to MapR Streams, and thus it is only getting them much later, long after they were sent. ## Monitoring your topics At any time you can use the maprcli tool to get some information about the topic. For example:
$ maprcli stream topic info -path /sample-stream -topic fast-messages -json
The `-json` option is used to get the topic information as a JSON document. ## Cleaning up When you are done playing, you can delete the stream and all associated topics using the following command:
$ maprcli stream delete -path /sample-stream
## Conclusion Using this example built from an Apache Kafka application, you have learned how to write, deploy, and run your first MapR Streams application. As you can see, the application code is really similar, and only a few changes need to be made (such as changing the topic names). This means it is possible to easily deploy your Kafka applications on MapR and reap the benefits of all the features of MapR Streams, such as advanced security, geographically distributed deployment, very large number of topics, and many more. This also means that you can immediately use all of your Apache Kafka skills on a MapR deployment. If you have any questions about running a MapR Streams application, please ask them in the comments section below.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.