Streaming Data with MapR

Contributed by

11 min read

In this week's Whiteboard Walkthrough, Mansi Shah, Senior Staff Engineer at MapR, talks about MapR Event Store, a global publish-subscribe event streaming system for big data. Mansi will discuss its architecture and how it lets you deliver your data globally and reliably.

Here's the unedited transcription:

Hi, this is Mansi. I’m an engineer at MapR and today, I’m going to talk about streaming with MapR. When we talk about streaming, typically what we’re referring to is, when lots of producers are producing some form of order data, and that needs to get transferred over to consumers for some kind of analytics. If you’re familiar with Kafka, you might be familiar with some of the terms on the board here right now, but let me just go through each one of them.

A topic is typically where all related messages. A producer puts all the related messages into one topic, and a producer could be producing to one or more topics. Your consumers could be listening from one or more topics, and consuming these messages which are within the topics. For administrative purposes, MapR puts another bigger pipe, sort of, around these topics, which we call the stream.

Like you can see here, a stream is actually a first class entity in MapR Distributed File and Object Store which gets within the MapR namespace, and it just sits right next to a file or a table in the namespace. By the virtue of that, it gets all the features that are typical of a MapR platform which is reliability, security, and performance. That’s the big picture concept of, where are the producers producing into these queues the consumers are consuming from, the topics here, and what is a stream.

Having said that, let me jump quickly into scale. How do we do scaling? Let’s consider there was one topic in your cluster, and you have four nodes, so how would you use up all your four nodes in that case? You could potentially partition your topic into multiple partitions. Here we have two partitions. Each partition starts out by creating a chunk on one node, so P1 is on N1, and P2 is on N2. As this chunk fills up, which is about four gigs, we would pick another node and start writing there the next chunk. That makes sure that your cluster is always used evenly. The data is distributed evenly, and your cluster is used at its full capacity.

Now, let’s talk about a feature which is cross cluster replication, which is a key feature of the MapR Platform and comes for free when we do streaming as well, because streaming is based off the MapR Platform. I would first like to talk about the different topologies that we could potentially use for this cross cluster replication. One is where you have, say, two sides, and you would like to fail over between the two sides. You have some producers producing and consuming off say, a San Francisco site, and you just want to keep a backup DR site in San Jose, so all your data is getting replicated here. If San Francisco fails, you could just fail over to San Jose, and start from there.

That’s one way of looking at it. Another reason why people do replication is for aggregation. Suppose you have a lot of small data collection sites all across the world. Here we see a bunch of sites across US, and your actual analytics or the work is actually happening in a big data center in San Francisco, so you want to move all your data into the central location, and do all the analytics in San Francisco. That’s also possible using replication, where these would be replication links between different sites into the central cluster.

The third possible way of looking at this is, a fully connected or a partially connected chain or mesh of topology, where say, you have a site in San Francisco, one in Sydney, one in Tokyo, and one in London. You have producers in all these sites, and you would potentially want to consume, and process data in any one of the sites for all the producers across the world. To do that, you would potentially just connect all the sites using replicas. It’s a chain topology, if you see, but the application itself is smart enough to understand where the chain ends. That is one potential way of connecting your clusters.

Now that we’ve talked about different topologies, let’s talk about a little bit more details of how the replication actually works. The administrative unit in MapR Event Store is a stream, so let's look at this example where I have set up replication between … Let’s say that a financial transaction is happening in New York and in London, and I want all the data from both the sites to be available on both sides. So I created a stream called … in the New York MapR cluster which is MapR New York transactions. I created a stream called MapR London transactions in the London cluster. Then I set up a replica between these two streams.

We have two topics in each of the streams. This blue topic is where producers are producing in New York, and that data is getting replicated into a same topic, so let’s call this New York topic, and this is also a New York topic here. It is just sitting in the London cluster. The other topic is where producers are producing into the London topic, and these producers are producing in London, and that topic is getting replicated into New York site, so this is London topic here.

Basically, these streams are replicating as a two way replication between the streams, but the topics are … That application is two way between the streams, but you can only have producers in any one of these sites producing to any one of the topic. To New York topic, you can have producers in only one of the sites, and so you have producers in New York, and that data is just getting replicated to London. For London topic, your producers are only in London, and the data is getting replicated into New York.

At any moment, if one of these site fails, you could seamlessly have both your producers and consumers fail over into the other site. The producers from New York could start then writing to this New York topic which is sitting in London, and the consumers which were consuming from here from the New York site, and a New York topic, could then potentially start reading that data from here. There is little more smarts in the system where the system itself remembers for every consumer, up to what point the consumer has consumed.

Let’s say there was a consumer here which had read message M1 and M2. Now the site goes down. When it goes here and starts reading that same topic from London, it will not read M1 and M2 again. It will know that it needs to start from M3, and the system itself will start handing out messages only from M3. That makes it a very seamless fail over solution.

Having said that, having talked about fail over, also the other important thing is, we do maintain the order of the messages, so a message which gets … Let’s say there are sequence numbers we call offsets. Streaming systems call these things offsets. Let’s say this message wanted offset one then two, three, and four like this, so the offsets will be maintained on the other side as well. That also helps with the seamless fail over.

The most important and the most crucial piece of this replication is the fact that it is hundred percent reliable. Now, how do we do that? To make the system 100% reliable, on the New York … Let’s just talk about the New York topic getting replicated to the New York topic here. When data gets produced into this New York topic, it gets written into a writer head log which we call it a bucket. The replication is done one bucket at a time, so unless the other side completely acts the fact, acknowledges the fact that it has this data, and the data has been stored securely in the MapR cluster on the destination site, the source will not get rid of the bucket.

Basically, if there is a failure of the London site, we will hold on to the bucket. If the network goes down, we hold on to the bucket, so unless the other side confirms that it has the data, this data is not going to go away anywhere. Because of that, we made this replication one 100% reliable, which is a fairly distinguishing point between other messaging systems and MapR (Streams). That is the overview of how replication works.

Now, if we go back to these topologies, you would see that any one of these models would fit in these topologies, where here, you have only one-way replication. You have two-way replication going on here. Here, you have multi way applications from every stream in San Francisco was getting replicated to a stream in Sydney. It was same stream in Tokyo, and the same stream in London. That’s fits in with the bigger picture of how replication happens, and how these topologies work. Thanks for watching the video. Hope you liked it, and do let us know through comments, if you have any questions, or if you would like us to cover some more topics in detail next time. Thank you!

This blog post was published December 10, 2015.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now