The MapR Data Platform Release – Real time, Reliable, Results

Contributed by

11 min read

On behalf of the entire MapR community, I’m happy to report that MapR 5.1 is now generally available.

For the first time ever, you can deploy a single platform that includes the key pieces of a big data deployment: Hadoop, Spark, Distributed File and Object Store, NoSQL with JSON, and event streaming. Check out the release notes for more detailed information.

Take the product for a test drive by downloading the MapR Data Platform.

A trip down memory lane

In 2015, MapR shipped three significant core releases : 4.0.2 in January, 4.1 in April, 5.0 and the GA version of Apache Drill in July. While all this was happening, many of my colleagues in engineering (who’ve demonstrated a whole new level of ingenuity and multitasking) were also working on one of the biggest releases in the history of MapR—the converged data platform release (AKA, MapR 5.1).

Frankly, the MapR Data Platform (MCDP) release has been in the making for over six years. The founding vision of MapR was to build a single platform that will allow running batch, interactive, streaming, and real-time applications simultaneously at the level of scale and reliability that has never been possible before (see slide from over three years ago below). MapR 5.1 fulfills this promise and establishes a new foundation that customers can use to build their innovative business-leading applications.

MapR Data Platform - 2013

This required systematic, disciplined execution as we built the foundational pieces and then implemented industry-leading services on top of it.

  • 2009-2011: Mostly in stealth, the team built a POSIX-compliant file system for Hadoop that provides extreme scalability (trillions of files, thousands of nodes), data protection (point-in-time consistent snapshots), disaster recovery (block-level Mirroring), and logical data and policy management (Volumes).
  • 2012-2013: Having solved foundational problems in a distributed read-write file system, it was time to surgically remove the duct tape around HBase that was built to run a read/write operational database on top of an append-only HDFS. This led to the creation of our integrated NoSQL database, MapR Database, which provides extreme reliability, massive scale, and consistently high performance.
  • 2014-2015: The beginning of the SQL-on-Hadoop wave started in earnest and led to the incubation of Apache Drill—the industry’s first SQL engine for big data with native support for flexible schemas. This was also the time when MapR engineers started to build out a real-time reliable transport layer that enabled multi-master table replication for MapR Database and a framework that would be used for future innovations.
  • 2015-2016: As more and more customers use the MapR Platform for use cases including mission-critical operational and analytics applications, the need to treat JSON as a first-class citizen and to bring in real-time event streams into the platform became universally evident. This led to the addition of document database capabilities in MapR Database, and a global publish-subscribe system, MapR Event Store for Apache Kafka.

Today, the picture has changed, but the product promise has stayed the same.

MapR Data Platform

Why did MapR build a Data Platform?

Now you can run real-time operational applications and analytics that can use files, real-time events, and database records together in a secure, multi-tenant environment. There are two important points here. First, by avoiding the delays of moving data across silos, your users can respond to new information immediately, and thus create greater levels of competitive advantage. And second, you can guarantee the isolation of distinct, sensitive data sets while also balancing that with the sharing of other data sets to promote information agility in a big data environment.

The choice of computational engines you run on top of the MapR Platform is entirely up to you. MapR distributes all popular Hadoop engines including the good old standby MapReduce, but also Apache Spark, Apache Drill, and many other Hadoop ecosystem projects.

Having a converged data platform is also useful for both developers as well as cluster administrators. For developers, it is a way to build simpler, secure apps at a faster pace. You can now develop applications on a converged sandbox and easily move those into a converged production cluster, thereby leading to shorter development cycles. The applications themselves are simpler as the platform eliminates much of the application-level “duct tape” that is required to integrate data and processing services across different clusters. Finally, a single model of authentication, authorization, encryption, and auditing means that the data can be secured and governed end-to-end.

The platform enables administrators to provide real-time access and a unified tool kit. Having Hadoop, NoSQL, and event streams data available in real time for analytics in a single platform reduces the complexity of data movement. Having the data that just came in (event streams, NoSQL) alongside historical data in Hadoop allows building applications that can shorten the information-to-action cycle. And finally, reducing cluster sprawl and complex data movement processes lowers administration and hardware costs, and significantly lowers overall total cost of ownership.

MapR 5.1 release highlights

  • A new documentation platform that makes it easier for you to search and navigate to the areas of your interest.
  • General availability of MapR Event Store - Global publish-subscribe event streaming system for big data. In addition to reliably delivering messages to applications within a single data center, MapR Event Store can continuously replicate data between multiple clusters, delivering messages globally. Like other MapR services, MapR Event Store has a distributed, scale-out design, allowing it to scale to billions of messages per second, millions of topics, and millions of producer and consumer applications.
  • General availability of MapR Database document database capabilities, including the OJAI API - Making MapR Database the first and only in-Hadoop wide column and JSON database. More importantly, the advantages of the JSON implementation in MapR Database (with regard to fine-grained security, concurrency, performance, scale, etc.) open up the database for a host of use cases not possible in other NoSQL technologies.
  • General availability of Apache Myriad (incubating) 0.1 - Apache Myriad enables the co-existence of Apache Hadoop and Apache Mesos on the same physical infrastructure. By running Hadoop YARN as a Mesos framework, YARN applications and Mesos frameworks can run side-by-side while dynamically sharing cluster resources.
  • Security enhancements, including:
    • Access Control Expressions (ACEs) for files and streams (in addition to MapR Database) allow use of Boolean expressions when setting permissions on files, directories, and whole volumes. ACEs reduce complexity compared to access control lists (ACLs) in a big data environment, where the setup of granular roles lead to hard-to-read ACLs with a higher chance for administrative error.
    • Additionally, whole-volume ACEs offer administrators a volume-level "filter" over file and directory permissions, guaranteeing that the data in a given volume is only accessible to specific individuals or groups of individuals. Whole-volume ACEs are especially critical in multi-tenant environments.
  • Selective auditing
    • Auditing every action in a big data cluster will eventually lead to the audit data being bigger than the data being audited. In order to alleviate this issue, MapR provides facilities like per-volume/directory audit capabilities, and a “coalesce interval” feature. With 5.1, MapR now supports selective auditing of certain file system and table operations, allowing users to include or exclude those operations explicitly from the cluster’s audit logs.
  • SSD optimization for MapR Database - For MapR Database workloads on high-end servers, MapR has made several enhancements to increase its already industry-leading performance. This feature is automatically enabled with a fresh install on any servers with SSD devices configured for MapR storage.
  • The optimized MapR POSIX Client runs as a userspace process to connect to one or more MapR clusters and allows app servers, web servers, and applications to read and write data directly and securely to the MapR clusters (like a Linux filesystem) at unprecedented speeds. An emerging use case for this is to provide seamless persistent storage to applications in Docker containers that are frequently redeployed across your cluster.
  • The MapR 5.1 release includes the latest versions of interoperable Hadoop ecosystem projects, including Apache Spark 1.6 (Development Preview), Apache Spark 1.5.2 (GA), Apache Drill 1.4, and others.

For the complete list please check out the MapR 5.1 Release Notes.

5.1 Resources

Don’t miss out on all of the 5.1 release and product information coming your way in the new MapR Community. The MapR Data Platform Release 5.1 Resources document has been created as a single directory of all available resources and it will grow over time. Log into the community and then follow it in your community Inbox to get notified when new information is added.

At this point, you might be thinking, “That’s a lot, and may be way more than I need.” Keep in mind that the MapR Data Platform actually makes things easier for you, even from the start. You want to get as much value from your data as possible, and the MapR Platform can help get you there. By choosing a limited alternative solution, you end up inhibiting the range of business advantage you can obtain. And also, don’t think you need to deploy an extensive, all-encompassing big data solution that includes numerous use cases in round one. Rather, start with your initial use cases, identify the key requirements, and you’ll see how the MapR Platform will address those needs. And please let us know if you have any questions; we’re here to help.

This blog post was published March 08, 2016.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now