MapR: Big Data in the Cloud | Whiteboard Walkthrough

December 06, 2016 | BY Ted Dunning

Editor’s Note: Extend to the edge with MapR Orbit Cloud Suite. Go inside MapR Orbit now.

In this Whiteboard Walkthrough, MapR Chief Application Architect, Ted Dunning, explains how special capabilities such as mirroring, bi-directional stream and table replication and control of data locality make MapR particularly effective in cloud computing, whether you use cloud-to-cloud clusters or a hybrid of cloud and on-premise. Ted also explains how cloud bursting is a useful strategy for elastic work loads.

Additional resources:

Here is the full video transcription:

Hi, I'm Ted Dunning. I work for MapR, and I'd like to talk about running MapR clusters in the cloud. We get that question a lot. People ask us, "Can you do it? Can you run MapR in the cloud?" They actually mean about four different things. They might mean, can you just do it? They might mean, will we do it? They might mean things like cloudbursting and hybrid clouds. Let's talk about each one of those.

The first one's the easiest. Can you do it? Can you run MapR in cloud-based clusters? The answer there is absolutely. For years now, MapR has been the exclusive partner of Amazon for providing the technology that runs EMR. The majority of the options for configurations in EMR are based on MapR's distribution. We also were the exclusive launch partner of Google when they brought out the Google Compute engine. We were the big data, Hadoop choice there. We also partner with Azure. All of these options provide very nice ways to run MapR clusters in the cloud. You can also do other things as well, you can bring your own licenses. Amazon Marketplace works. Lots of options. That's the easiest answer of all.

Second question has to do with whether we will actually manage a cluster for a customer. Now, plausibly we do that under professional services, but we have partners. You can contact us, we'll set you up. Again, there's no real problem there.

The third and fourth questions about bursting and about hybrid usage have much more meat to them. Let's talk about those, starting with hybrid for a moment. If you have an on-premise data center that has a MapR cluster in it, you can establish a cloud cluster. It's obviously in the cloud there. Because MapR provides key technologies for data replication, you can synchronize these two clusters. This works just the same way as if you had two on-premise data centers.

Effectively this is exactly the same thing that they do in the Aadhaar project, that's the Unique Identity Authority of India. They have four data centers, two conjoined in the north, two conjoined in the south. You got synchronous replication, synchronous replication, and then asynchronous replication between north and south. I've got east-west here, but that's the same idea here. You can have multiple data centers either on-premise or in the cloud act effectively as one. They survive if one goes away. You could do a split brain between them. Clients that are touching either one can choose which one they want to touch. You can do it also cloud to cloud, that's a fine option as well.

To do that, MapR provides key technologies. These are the key technologies, of course HA. If you're going to run in the cloud an instance could disappear at any time for any reason. The cloud provider reserves the right to take a machine down to do any kind of maintenance at all. You have to expect machines to churn a bit. In fact, it's good practice for you to do it yourself. HA is absolutely critical. High availability has to be there if you're going to run in the cloud for any extended period. Extended, by the way, means more than an hour.

You also get mirroring with MapR. That means an on-premise dataset can be mirrored to the cloud, results can be mirrored back. That's for large-scale volume transfer. Table replication is also supported in MapR. That means two tables can be linked by replication. You get multi-master, bi-directional replication. Inserts here appear there, inserts here appear there. We deal with that via the timestamp resolution, and you get the sort of multi-center replication you need, again, for any of these configurations. Streams are subject to the same kind of asynchronous but very fast replication. If I insert into a topic over here, it appears in the same topic over here.

That isn't just like copying messages from one stream to another. It maintains all aspects, consumer offsets, message offsets. If I have two topics, one which contains the data, another which is an index containing references to that data, the index can be replicated and it still works exactly on the far side. If I have a consumer that is touching one data center, reading from a topic here and remembering offsets there, perhaps committing them to the stream, then if it starts using the other data center, all of the offsets, all of the indexes, all of the consumer offsets are maintained.

You can even insert into multiple data centers. Two different topics can be inserted in the same stream simultaneously… bi-directional replication. A single topic at one time can be inserted in either data center. These are amazing technologies that are provided with MapR to make multi-data center including cloud deployments, very, very easy. That makes this sort of hybrid on-premise in-cloud system very easy to build, very easy to maintain because it's an administrative control to replicate and synchronize data.

There's more. Sounds like you're going to get steak knives, but in this time what we're talking about is cloudbursting. With cloudbursting, burst compute, a core cluster is running in the cloud, and then temporarily we add additional resources to it. It bursts to a larger size, burst in the sense of short term, and then shortly you return those resources to the cloud, and you come back to your core. Again, the ability of MapR cluster to locate data in a particular location in a cluster, in this case in the core, means that when the burst is finished and you return those resources back to the cloud, the core results will be in the core cluster.

Since data has metaphorical mass and is expensive and hard to move, if we had large results that were in the burst part of the cluster, that would be bad. That would mean it would take a long time to let go of the burst part of the cluster. If instead the results are constrained to live in the core, then the burst can be evanescent. It can disappear without any loss. That is also a key capability of MapR clusters.

What we have, then, cloud hosting, check. Manage clusters, check via partners. Hybrid with cloud and on-premise, check. Bursting, check, and of course any combination of this. An on-premise cluster, a core cluster, another local cluster might need to be in a particular country to meet data sovereignty laws. Some data's replicated either direction, some data may not be. Some core cluster may need to be burst to a larger size, all of these options can be combined with MapR technology.

It's really quite a story there. You don't have to start with the on-premise; you could start with the cloud and move on-premise, depending on what your loads are, how the costs trade off. Cloud is excellent for elastic loads, for loads you didn't know you were going to have. On-premise is excellent for, put a guard around it and really, really lock it down. You own it, you know where it is, the costs are controlled. They trade off different characteristics. You can start in the cloud, you can start on-premise.

This is also the basis of really, really large IOT, internet of things, sort of installations. You might want local presence in many places, but you want your core summary of the data, the collection of all that data, to be in a central location. In those cases, bursting, and hybrids, and data motion are all key elements of your strategy. That's MapR in the clouds. Thanks very much.