Dataware for data-driven transformation

A New Kind of Data Platform: MapR | Whiteboard Walkthrough

Contributed by

8 min read

Editor’s Note: In this Whiteboard Walkthrough, Ted Dunning, Chief Application Architect at MapR, explains how MapR Data Platform implements the concepts of dataware by providing the unusual combination of being directly accessible by conventional programs, big data programs and AI/machine learning programs, with all these different workloads running together on the same data platform. You can learn more about the MapR Data Platform here.

Hi, I'm Ted Dunning from MapR Technologies, and I want to talk to you today about a new kind of data platform. It's a very interesting system because it uses new technology designed for the current and future age. It uses very different technology that is integrated into a single code base. As a fully distributed system, it stores and moves data, though, in ways that are accessible through lots of different APIs to lots of different workloads. That means that these workloads can co-reside in a single cluster and that allows you to have many, many workloads. So, I'd like to take you today through a new kind of platform. In fact, I'd like to walk you through the MapR Data Platform and how it works, but in particular, how it looks to you as you're using it.

Now, we could look at this in a couple of different ways. We could take it from the point of view of how it looks to you as a user and your applications and developers. A second approach would be to dig in to the underlying technology. Now, that's fun for me, but today I don't think we'll do that. A third approach would be, we would look at how the business actually works when it's running on the system. But let's stick today on the first approach and talk about how this system looks to you, and in particular, how it looks to you as you run various kinds of workloads.

The prominent, very first workload that you're going to have is going to be the stuff you already have -- legacy and conventional programs. Now, these are important. We've had a half century of UNIX and Linux development developing how we can build all of these systems, but they alreadyrun on a MapR cluster. That's because you can access data in the MapR cluster in this new kind of data platform using traditional APIs, traditional POSIX files, APIs. The things that you expect to work that work on your laptop will work pretty much exactly the same for legacy programs running on this new kind of platform. And they will work in a totally distributed way. The applications will work in the same way they ever have, but they can now run anywhere in a cluster.

That leads us to the second, really important kind of workload. That's this new trend, this system called Kubernetes. This is an orchestration system that orchestrates when and how different programs can run within a cluster, but it does not provide a data platform. And so, the combination of Kubernetes plus a data platform together provides you a really new and exciting way to run these conventional programs.

But it also gives you a new way to run big data systems --like Spark or Hive. These systems can now run on exactly the same files via different APIs, via big data APIs, which are somewhat idiosyncratic for those big data tools, but they can work on exactly the same data, where it's appropriate. Now, one of the conventional programs you might have is a fancy program for visualization. These can be beautiful, but none of them were really designed with big data in mind. So you normally have this problem of copying data from one place to another in order to get these two to talk.

That also occurs because big data systems are often used to prepare the training data for machine learning and AI systems. But, unless those two systems run in the same cluster and can access the same bits by their own traditional methods, you wind up copying the data again from the big data system to the machine learning system. Well, with this new kind of platform, all of the data can fit on the same platform, can be accessed by conventional programs, by big data programs, and by machine learning and AI programs. Or even by high performance computing (HPC) programs. All of those can access the same bits, using their own favored application programming interfaces, APIs, to get to those bits. They'll get to the same bits; they'll just get there through the front door, the back door, whatever the appropriate door is. Now, that ability for these systems to run together means that you could be running very large HPC simulations of finance or structures or aerodynamics, integrating that into conventional visualization, breaking down data using big data systems, all together. And that's a big deal.

Another big trend that's coming is the idea that you can connect microservices running asynchronously together in streaming architectures. The connection of these things with streams makes a big difference to the simplicity of operating such a system, and so it's a really cool way to architect these very large systems. It leads to much simpler systems, but it also leads to data that comes to you as quickly as it can. All of these systems can work together. They share names of data systems, and they even share the bits themselves. They can all access them.

So, the results are that you get scale on an enormous magnitude. We have systems at the exabyte level, and some that are well above the exabyte level. You get reliability. Many customers have years of up-time with zero downtime. That's reliability that is practically unheard of and certainly unheard of at this scale, or being able to support these different kinds of workloads. You get many, many different kinds of workloads, all living together, working together in ways that, again, are really exciting.

This is new technology developed over the last 10 years and available now to run these systems. It's an exciting development. It allows us to work, if you think about it, from the shortest time scale of business at around a millisecond to the longest timescale of business at around a gigasecond, roughly 30 years. Traditionally, at each different timescale, each roughly power of two, power of ten, or so, you had to change technologies entirely. You had to move data to an entirely different system. But here, for the first time, we have a system that really meets enterprise needs that scales to global extent and can span that entire time range from milliseconds to gigaseconds, and that's something new. That's the MapR Data Platform.

For MapR Technologies, this is Ted Dunning, and I've enjoyed talking to you today. Thank you very much.

This blog post was published February 14, 2019.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now