Extending Your Stream of Record with MapR 6.0.1 and MEP 5.0

Contributed by

6 min read

Today we released MapR 6.0.1 and MEP 5.0, ushering in new capabilities and enhancements across the product, from core platform to MapR Database, Data Science Refinery, and Drill. The area which saw the most action was MapR Event Store, with several enhancements being made that improve developer productivity, application compatibility, and its ability to be a system of record.

"Stream of record" is a design pattern that we have been seeing take off in enterprises, since MapR-ES (Now called MapR Event Store) debuted two years ago. One of our MapR customers, Liaison Technologies, provides a compelling case study of both the technical and business benefits of this approach. In this blog, I'll walk through how we're making these architectures even more flexible.

Spark Structured Streaming

Spark Structured Streaming is a new stream processing API that makes creating real-time analytic applications easier than ever on the MapR Data Platform. Compared to the DStreams API in previous versions of Apache Spark, Structured Streaming presents a functional, SQL-like API that hides the gory details about how Spark works under the hood. At the same time, it introduces a vast library of useful analytic functions, like event-time windowing and aggregations as well as a new interface for data sources and sinks. For example, with Spark Structured Streaming, users can write a few lines of code to build a data pipeline that consumes from MapR Event Store, performs transformations, and inserts the results into a MapR Database JSON table.

Stream of record users will notice that they can leverage the same API as traditional batch-oriented Spark (Spark SQL and Datasets) to do real-time processing. With only a few small code changes, an application can either process the full history of data that has moved through a stream or begin continuous processing on new data as it is comes in. This capability also comes in handy for machine learning, as models can be trained using Spark ML on historical message data, then deployed into production using Structured Streaming.

SQL on Event History with Apache Drill (Preview)

When your stream is your system of record, there are a lot of times when you want to query your historical event data. Some examples include:

  • Ad hoc exploration: "How many events have we seen from Europe?" or "What was the schema of this topic again?"
  • Testing out an analytic query on historical data before writing a real-time app.
  • Monitoring: "How many events are currently stored in each topic partition?"

Apache Drill 1.13, released in MEP 5.0, now has a storage plugin that talks directly to MapR Event Store through the Kafka API to achieve the above use cases and more.

Given the newness of this feature, we're marking it "preview" for the time being. Take it for a spin and let us know what you think! We don't recommend using it for production applications.

Event Timestamps and Time-Seek API

We've made several enhancements to the core MapR Event Store API to make applications and analytics more robust and to make new use cases possible. Those improvements include:

  • Event-time timestamps that allow MapR Event Store producers to mark each event with the timestamp at which the event occurred. This allows downstream applications and analytics to properly bucket the event with others in the same time range.
  • Time-seek API that allows MapR Event Store consumers to seek to the first event of a particular timestamp. This saves applications time and energy, previously spent searching through messages for the right point in time.
  • Message headers allow applications to separate out data and metadata of each event, making applications easier to write.
  • Message interceptors allow for insertion of standardized processing before a producer sends a message to MapR Event Store and before a consumer processes a message from MapR Event Store. This standardized processing may include logic for masking or encryption, monitoring, auditing, or filtering.

The above represent the most significant new API changes to MapR Event Store as part of its adoption of the Kafka 1.0 API. There are a couple of Kafka 1.0 features not yet supported on MapR Event Store, so when in doubt, check our documentation to see what's supported.

Putting It Together

When you put all of these new capabilities together with a stream of record architecture, you can easily imagine a user that sits down with a cup of coffee and does the following:

  • Tests a new idea for a streaming analytic by quickly writing a Drill query on historical data to see if her assumption holds.
  • Once confirmed, writing a Spark Structured Streaming application that runs the logic in real-time, –but before deploying on real-time data, running the application on the historical data to work out bugs.
  • Fulfilling a request from the compliance department to pull all events that occurred between 150 and 180 days ago for user "Will Smith."

This release represents just a small taste of what we have planned for MapR Event Store in 2018. I can't wait to report back in a few months with some of the other stuff we've been working on.

This blog post was published April 03, 2018.