Our Cofounder's Perspective on the Data Platform

Contributed by

11 min read

In this week's Whiteboard Walkthrough, MC Srivas, MapR Co-Founder, walks you through the MapR Data Platform that has been in the making for the last 6 years and is now finally complete with MapR Streams (Now called MapR Event Store for Apache Kafka).

Here's the transcription:

Hi, I'm M.C. Srivas, Co-founder of MapR Technologies. I'm really excited to talk to you about our Data Platform. Now this is something we've been building for the last six years, ever since we started our company in 2009. We've been building this step-by-step, component by component, and now it finally comes together. This is a huge announcement that we're making, and I'm very excited to walk you through the details of this Data Platform.

To understand what a converged data platform is, let's look at where big data is going. If you look at what's happening in the data center, there's a once in a 30-year re-platforming. In the 1980s, we had this monolithic platform running on databases running Unix, and typically the app was inside Visual Basic or PowerBuilder. In the 2000s, the Visual Basic or PowerBuilder gave way to three-tier architecture. This is where you had a web tier with the middle enterprise—JavaBeans or JBoss or something like that, as the middle tier, where the application logic was, and databases held the data. Unix had become Linux. But pretty what remained was structured data.

Now, since the advent of the smartphone and other smart devices, the back-end of the data centers has completely changed, with lots of unstructured and semi-structured data in addition to structured data. This data is arriving so fast that operational and analytics workloads have to be done on the same image. These big trends favor our vision and value.

Batch has grown from micro-batch to real time. Simple apps have become heavily interconnected applications. They have moved from a few sources of data to IoT sensor-scale. All of this has required that the back end has to be more than just 24 by 7 IT grade—it has to become utility grade, like power, water, or other things like that. It's really a schema-free life, and there's no concept of foreign keys.

For example, the other day I was boarding a plane and my cellphone had my boarding pass. If my email didn't work right when I was at the security gate, I wouldn't be able to board the plane. This is what I mean by a schema-free life, and front utility-grade, and real time.

So while the technology has gone from batch to micro-batch streams and operational analytics, when you go into production, the same questions come up again. Does it scale? Is it reliable? Can I trust the data here? What is my latency like? Can I put multiple tenants on this?

What we are really talking about as an aspirational goal is a new system of record, with operational intelligence on the same image of data. Today, the strategy with all the open source projects has been a “do-it-yourself” strategy. That is, you go in with Hadoop and make it reliable on your own. Maybe you want to store some data. It has to be more reliable, so you buy NAS storage. Perhaps you want some kind of NoSQL cluster, say a Mongo or a Cassandra. You integrate Kafka to get streaming, and by the way, you want to do full text search, so go and get Elasticsearch or Solr.

Each of these is a different system with its own latencies, with its own way of administrating, and its own way of handling latencies and so on. It's not pre-integrated, and it's a real nightmare for any reasonable IT shop. So what we're talking about is actually giving you a complete big data platform. These are examples of complete platforms: AWS, .NET, iOS, Android. When something is complete, it is very powerful.

So we're creating the industry's best big data platform. So what do we mean by that? Well, big data is first, big storage. Big data is also big processing. The big processing is beyond just Hadoop. There's NoSQL, there's SQL, there's search, there's messaging, and so on. Hadoop from MapR has the best storage system in the world. With global namespace full of multi-data center support, data protection, at unlimited scale, and with a full unified security model.

On this storage, we've added all the Apache Hadoop projects. Our Hadoop is the most comprehensive Hadoop you can get from anyone. It includes all the Hadoop projects, Impala, Hive, Drill, Spark—everything. The dirty little secret about open source is that all open source vendors like open source, as long as it's their open source and not somebody else's. So you get things like HBase dislikes Cassandra. Cassandra dislikes Mongo. Mongo dislikes Couch. Couch dislikes Riak, and so on. In the Hadoop world, it's very similar, too. Impala dislikes Tez. Tez dislikes Spark. Spark dislikes somebody else.

We at MapR don't have any axe to grind. We watch out for the customer, and we provide you with the most comprehensive Hadoop platform, no matter what the origin of the open source. Nobody else does this. This is only from MapR. By the way, this Hadoop is exactly the same Hadoop you would get from anybody else. There are no changes here, and any changes MapR has made, has been converted back to the open source.

Beyond just Hadoop, our big data platform has a fully functional POSIX file system, and a fully functional JSON database that can store and retrieve article records of arbitrary size. We've integrated search—both Elasticsearch and Solr—with MapR. Now we are proud to introduce MapR Event Store, a real-time, global messaging, worldwide event streaming system.

To tie all of this together, we built Apache Drill. ANSI SQL is the universal language for data access, and Apache Drill implements ANSI SQL in its data format. That is, Drill can tie together databases in Hive with databases in the file system, or JSON, inside search, or inside MapR Event Store. You can do cross joins across all of these seamlessly.

What MapR has done is introduce this concept of global namespace. What that means is you can run clusters worldwide, and these clusters are named using the DNS convention, as you see in this picture. Client programs and clients logically don't belong to any cluster. They belong outside the cluster. They can access data in any cluster seamlessly, exactly the same way they as they would access the native cluster.

Across clusters, we have implemented all kinds of different failover mechanisms and mirroring mechanisms. So for example, from the yellow cluster to the green cluster, we can measure data in various forms. Whether it's in files, folders, tables, or streams, and this mirroring can be synchronous, asynchronous, continuous, snapshot-based, multi-master, or eventually consistent. So all of these things we support. Apache Drill really can reach out and perform joins across clusters and across the data types, because we have named everything using a global namespace.

Beyond just providing you with a converged platform that has all of these, we also natively support SAP, Vertica, SAS, email, document databases, high performance computing, and any kind of custom application. So while the red line shows what belongs in the MapR product, these other products are natively integrated with MapR storage. That is they can directly read and write data into MapR.

The beauty of this is that you don't create silos of your data for each one of these things. That is, data in Hadoop can flow back and forth with data in SAS or Vertica seamlessly. There is no new silo of Hadoop being created either. How does it all fit together?

The MapR Distributed File and Object Store, MapR Event Store, and database are all built on the MapR storage, and we have a real-time data transport here that automatically integrates to remote MapR Database tables and streams. Using that same technology, we have integrated reliably, out of the box, with Elasticsearch, Storm, and Spark Streaming. In addition to that, we have added Myriad. Myriad is a project that marries Mesos and YARN together to provide you true multi-tenancy across Hadoop and non-Hadoop workloads. All of this is available only from MapR.

Now we've been building this platform step-by-step, from the ground up, ever since we started the company. So in 2011, we released our first Hadoop distribution. It was the fastest and most reliable Hadoop on the planet. In 2013, we released MapR Database, which first introduced JSON to Hadoop. In 2014, we announced Apache Drill, which advanced the state of the art, by being able to read schema on-the-fly. This year, we are really happy to announce MapR Event Store, which is global, real-time messaging that can tie together thousands of clusters worldwide. These clusters can connect and disconnect seamlessly, and connect and reconnect seamlessly.

Next year, we are giving you the most trusted big data that's there with unified security for all big data, across all the different kinds of applications and data that just saw there. That is what our Data Platform is all about. We support all the open source compute engines and tools, and the enterprise and applications, like SAP, Vertica, SAP HANA, and so on.

We provide a very rich set of platform services, with global high availability, global namespace, and unified data protection, where we can roll forward or roll back your data at any time. It’s fully real-time, fully multi-tenant, with fantastic management and monitoring.

We have built the MapR Data Platform step by step over the last six years, and it's fantastic. It scales to exabytes of data with thousands of servers. I urge you to try it out. Download it at mapr.com. You'll be very pleasantly surprised how well it will fit your needs. Thank you.

This blog post was published December 08, 2015.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now