7 min read
It gives me immense pleasure to write this blog on behalf of all of us here at MapR to announce the release of Hadoop 2.x, including YARN, on MapR. Much has been written about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. I will give a quick summary before highlighting some of the unique benefits of Hadoop 2.x and YARN in the MapR Distribution for Hadoop.
YARN's promise is to enable multiple execution frameworks to run on top of Hadoop, thereby expanding the Hadoop use cases beyond batch into interactive, real-time and others. At its core, YARN is a resource allocation framework that allows for execution frameworks such as classical MapReduce, and also newer ones like interactive SQL-on-Hadoop, streaming, and others to ask for and receive CPU and memory resources on the cluster for a period of time. YARN’s power is in making the resource allocation of a Hadoop cluster a more streamlined and centralized decision, thereby allowing for more efficient cluster use and more importantly, opening it up for emerging use cases.
MapR and YARN: Customer First
As always, MapR starts with a customer-first philosophy. We reached out to our customers to gain insights into how they wanted to leverage YARN, and a few things became apparent very quickly.
First, customers have heard of YARN and are excited about its promise. However, many customers were not ready to roll it out into production — reasons mostly centered around YARN's production readiness, including ResourceManager HA. Note that many of MapR's customers are running mission-critical production workloads that expect the cluster to be as resilient as their RDBMS systems.
Second, those customers that are ready to experiment, want to use a small staged environment first to understand the implications of running YARN — they want the ability to discover “gotchas” that are typically not found in marketing launch materials, slideware and documentation. For example, if the logs are written to the local file system and only flushed out to HDFS when a process terminates, how would this work for streaming frameworks that are supposed to run forever?
Third, many of MapR's customers run tens of thousands of MapReducev1 jobs continuously and want to completely characterize MapReducev2 performance and impact before migrating jobs, based on their convenience and schedule.
As a result of these discussions, we came up with the principal design consideration for YARN in MapR: _Do not regress on the high availability, maintenance and performance characteristics of MapR, and put the customers in charge of the schedule to migrate to YARN. _
MapR Design Goals
These discussions translated to three main design goals:
1. Offer the ability to run the MapReduce v1 framework alongside YARN and MapReduce v2 so that customers can migrate their jobs based on their schedules, not ours.
2. Ensure that the superior high availability that MapR customers are now accustomed to is not compromised as a side-effect of migrating to YARN.
3. Provide the ability to run non-YARN applications alongside YARN applications to provide true enterprise-grade manageability and maintainability of the cluster.
Before we look at the benefits, it’s a good idea to take a step back and make a few observations that often get lost in YARN discussions:
Benefits of Using YARN on MapR
· MapR provides a pluggable services framework that allows for co-operative co-existence of frameworks using YARN, as well as those not using YARN.
· MapReduce v1 and YARN-based MapReduce v2 can co-exist in the same cluster and even the same node, thereby providing full flexibility and degrees of freedom to the customers.
· High availability, performance, and multi-tenancy capabilities such as label-based scheduling will be available with YARN on MapR in order to ensure that there is no regression in key cluster capabilities as a side-effect of migrating to YARN.
The Most Important Benefit of All
Running non-MapReduce frameworks on Hadoop requires more than a shared resource allocator. For the ability to support the broadest variety of use cases, these non-MapReduce frameworks also need a data platform that can support such frameworks. This data platform must support random reads and writes with a standard POSIX interface; simultaneous reads and writes (so Storm can feed directly from the underlying distributed file system); a NoSQL store with an HBase API but without the HBase limitations (such as RegionServers issues, compactions, latency spikes, downtime); and business continuity including data protection (true point-in-time consistent snapshots) and disaster recovery (mirroring).
Only a platform capable of providing a resource allocation framework that accommodates applications for both today and tomorrow, and a dependable and widely accessible unified system of record data store, can fulfill the promise of the big data platform of choice.
We are excited about the possibilities of Hadoop 2.x and YARN, and look forward to working together with our current and future customers to bring this promise to bear.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.