MapR: Converged Advantages in the Cloud | Whiteboard Walkthrough

Contributed by

9 min read

In this week’s Whiteboard Walkthrough, Ted Dunning, Chief Application Architect at MapR, describes advantages of MapR Data Platform and how they work in the cloud. With files, tables and streams engineered into the same technology, MapR has particular advantages for multi-tenancy in the cloud including common pathnames and common security.

Additional resources:

Here is the full video transcription:

Hi. I'm Ted Dunning. I work at MapR. I'd like to talk today about how the convergence features of the MapR converged platform actually really, really enhance operating in the cloud. We have a couple of things here. I've talked in other places about how you can run clusters in the cloud, how you can run hybrid clusters, how you can burst them to larger size. I've talked other cases about what convergence means, but in fact, a question we've had is are the converged features still viable in the cloud? Do the cloud features, does the cloud capability of MapR mesh with the convergence or are they irrelevant? Do you lose the convergence features or advantages when you're in the cloud or not? The answer is that the convergence features are very, very helpful in the cloud. Let's talk.

Hybrid is when you have on-premise and cloud. Of course you want to synchronize them. Bursting is when the cloud cluster gets bigger, generally temporarily, and then shrinks back to a core. You can also have hybrid and cloud-cloud configurations. That's the basics of running clusters in the cloud. Of course MapR does that very well because of data replication capabilities.

Convergence. Convergence, at its heart, is where a file system, a data platform doesn't just have files. Tables, streams, and files are all first-class. First-class is a computer science word that talks about whether or not something partakes in all of the ecosystem, all of the rights, privileges, and so on that any other object can do. In particular, we want to have common permissions, common security, common pathnames. I want to be able to say this stream is in my home directory. The way I can say in my home directory is that the pathname matches up with my home directory. It matches up with the files in my home directory.

You could do that and still have all kinds of separate systems. We really want to share the resources. We want to have multi-tenancy. We want it to be the same system, even the same code to support it all. Even, we want to have it be the same heritage. We want to share all of that, all of the history, of security, reliability, HA. These are what convergence is made of. That you get all of these things in the same platform. You refer to them and talk to them with the same names. Of course you use different APIs because they're different kinds of things, but they share resources, share tenancy, share security, and all of that. That's what convergence is.

One of this things. Let's take a little note here. Shared resources. That's one that most people don't have really clear in their head. Why is that good? Let's talk about that. Suppose that we draw the usage of resources over time. Time is passing, and in the first moment, we're going to do one kind of thing. We might be ingesting a lot of data, so we might have a very heavy streaming mode. It goes down, but not goes away, so we can't stop streaming. After we ingest the data, we have a sudden peak of compute. Just in this drawing and who knows whether or not they would be the same. In fact, the resources consumed might even be different kinds of resources, but the peaks here, as I've drawn them, are the same.

That means we need a certain size resource for streaming and a certain size resource for compute. The total resource is the sum of the peaks. If it's shared, then the total resource usage is kind of like this. It fits over the top of those, and because the peaks are not at the same time, and they never are really at the same time, the total resource we need in a converged cluster is the same size as each of the streaming and compute clusters we needed before. By sharing those resources, one needs it, then the other needs it, we wind up with a much smaller set of resources than we would otherwise need.

This doesn't just apply in that one case with one application. Multi-tenancy really applies when lots of people are running lots of applications. In MapR clusters, you can wind up with hundreds and hundreds of applications in the same cluster, all co-existing, all providing this diversity in time and diversity in kind of resource usage. That means that the total size of the cluster for a heavily multi-tenant, multi-application situation, the total size of a MapR cluster is much smaller than the size of the heavily siloed and firewalled multiple cluster set situation that you might have if you didn't have convergence.

Convergence both carries with it a responsibility and a solution to shared resources. It only works, of course, if you have multi-tenancy. Multi-tenancy is a key feature of convergence. An odd feature, odd note there because most people don't see that implication. Then, the other advantage in a MapR system is that the streaming, the tables, and the files all use the same technology. They run on the same system. It's the same code at its core. The core structure in a MapR system for files, for directories, for streams, and tables is an underlying B-tree based on our container technology. Because that's common, you get some of these other advantages. Common pathnames, common security, and so on.

When we put these together, when we converge cloud and convergence, we actually get more than the sum. It's very cool. Because we have convergence, we can have streaming, and because we have these data motion capabilities of MapR clusters, you can have streaming and files out here. For instance, data and mini-telco situations is ingested via FTP. That means files, file transfer protocol. That means files land in that cluster. Then those clusters can automatically with a little watcher program be converted into streaming. The streaming can be replicated, using MapR data motion, into here. It can be processed in a bursty style, and then the results could be tables, say as summaries, which can now replicate back into our on-premise cluster.

The key features that allow MapR to operate very, very well in a cloud infrastructure, combined with the key features of convergence, leave you with a system that can do amazing things. You can build massive IoT systems. You can handle massive amounts of data from instruments. You can deal with very, very large user populations. You can write simpler systems. It's just amazing.

Cool, cool stuff. Converged advantages in the cloud. You can build converged applications. You can put them in the cloud. You can put them on-premise, and it's really fun. Thanks very much.

This blog post was published December 06, 2016.