MapR Event Store Under the Hood – Whiteboard Walkthrough

Contributed by

8 min read

In this week's Whiteboard Walkthrough, Will Ochandarena, Director of Product Management at MapR, explains how we are able to build the MapR Event Store for Apache Kafka (formerly called MapR Streams) capabilities that differentiate us from similar products in the market.

Here's the unedited transcription:

Hi, I'm Will Ochandarena, Director of Product Management in MapR. When we launched MapR Event Store, a lot of people who were familiar with Kafka asked us how we achieved some of the differentiators that we claimed. During this Whiteboard Walkthrough, I wanted to go through what are the different things that we achieved and some basics about how we pulled them off.

Let's start with the list. First, we have a single cluster for multiple data services, files, tables, and streams. Second, we are able to do event persistence, meaning once an event is published into the system, it's available forever for any analytics after the fact. You never have to worry about aging it out. Millions of producers and consumers, true IoT scale, and global arbitrary global topology with global failover. Let's jump into how we do some of these things.

First, how do we share a single cluster? First, why do you want to share a single cluster? Multiple clusters means multiple data silos. It means redundant infrastructure; it means data duplication and data movement. One thing to realize about MapR and our data platform, is it's not based on on a Linux file system; we've actually implemented it from the ground up on a concept of container.

Within a cluster, we have many containers, where each container has multiple replicas. That's how we handle replication, fail over, et cetera, in a consistent way across multiple data services. The first thing we do in order to share, is being able to store multiple, different types of data structures in the same container, so that they sit next to each other in the namespace.

Files have chunks; tables have tablets; streams have partitionlets. All of these sit next to each other. All of them share the underlying primitives of containers. Ted Dunning does a great job of explaining containers and how they work in his Whiteboard Walkthrough.

Now that we're sharing a cluster between files, tables and streams, let's talk a little bit about how we persist data, once it goes into the system. Within a stream, there are data structures called "partitionlets." Each partitionlet stores a portion of the data for a stream. Unlike other messaging platforms that actually have files that store the messages, we store the messages in a partitionlet. What happens is, as messages are written into this data structure, this partitionlet, it will grow and grow until it reaches a size of about four gigabytes, in which case we will close it, open a new one in another container, which can exist on another node, depending on how data's balanced in the cluster, and start writing to this partitionlet.

That can happen again and again and again, forming a linked list of partitionlets across the cluster. What this means is as a producer is writing data to a partition, you never have to worry about an individual node filling up its storage. You only have to look at the overall storage capacity of the cluster. This is how we are able to persist messages as long as you want. You don't have to. We support "time to live," but this is a good way of having a streaming system of record, much like Liaison Technologies is doing.

Next, let's talk about how we achieve scalability of millions of topics and millions of consumers. If we dive a little bit deeper into what the data structure for a partition looks like. This is a partitionlet, one thing you'll notice is, we store two things other than just the simple messages. We store topic information, and we store consumer informations, particularly consumer offsets.

What that means is, we don't rely on any external system like ZooKeeper to keep track of topics or keep track of consumers. That means we can scale to millions and millions, because the same data path that stores the messages, stores all this other information as well. That's what scales out to IoT scale.

Let's talk about global. For global, we're able to support an arbitrary topology of up to thousands of clusters and failover of producers and consumers within those clusters. Let's take a really simple example of three clusters, San Jose, New York, London, and a topology where each is just replicating in a chain from one to the other. The first thing you'll notice here is we have a loop. Loops are often a bad thing in messaging systems because it means the message goes from San Jose to New York to London, back to San Jose, back to New York, back to London infinitely.

One thing that we do to be smart here, is because we built replication directly into the underlying data system, we have a loop-breaking or a loop-detection mechanism built in. As a message goes from London to San Jose to New York, those messages get appended the actual path that they've taken. As the message goes London, San Jose, New York, and then is deciding whether it wants to go back to London, it sees that London is on the chain of replication, so it doesn't. That's how we break loops.

This is a simple example of three clusters. This could be many, many, up to thousands of clusters, where each individual cluster replicates to up to sixty-four destination clusters. You can form multi-layer topologies if you want to reach true global IoT scale.

Now, let's talk about the failover. In IoT-like applications, let's say a smart city where a car is driving around, and that car needs to connect to multiple clusters, so it has low latency, instant access to messages. A car may want to failover from listening to one cluster to another. Well, what happens is, because these partitionlet data structures are copied in a consistent manner, where all of the data is transferred, so all of the messages, all of the offsets for those messages, and all of the consumer cursors are replicating. When a car needs to failover from one cluster to the other, not only are the messages consistent with the same offsets in both clusters, the destination cluster has already received information about that specific car, so that when it fails over, this cluster is expecting it, and it can feed it messages right where it left off.

That should cover our major differentiators and how we did it. If you want to get started for free, MapR Event Store is in our community edition. We also have a stand-alone streams edition, so you get started with just streams. You don't need to buy it in converged edition, although that's also available. Please feel free to comment on the video, and stay engaged with us. Thanks!

This blog post was published December 17, 2015.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now