10 min read
In this blog post, we compare MapReduce v1 to MapReduce v2 (YARN), and describe the MapReduce Job Execution framework. We also take a detailed look at how jobs are executed and managed in YARN and how YARN differs from MapReduce v1.
Note: The material from this blog post is from our free on-demand training course, Developing Hadoop Applications.
How Hadoop executes MapReduce (MapReduce v1) Jobs
To begin, a user runs a MapReduce program on the client node which instantiates a Job client object.
Next, the Job client submits the job to the JobTracker.
Then the job tracker creates a set of map and reduce tasks which get sent to the appropriate task trackers.
The task tracker launches a child process which in turns runs the map or reduce task.
Finally the task continuously updates the task tracker with status and counters and writes its output to its context.
When task trackers send heartbeats to the job tracker, they include other information such as task status, task counters, and (since task trackers are usually configured as file servers) data read/write status.
Heartbeats tell Job Tracker which Task Trackers are alive. When Job Tracker stops receiving heartbeats from a Task Tracker then:
Job Tracker reschedules tasks on failed Task Tracker to other Task Trackers.
Job Tracker marks Task Tracker as down and won't schedule subsequent tasks there.
Working with Schedulers
Two schedulers are available in Hadoop, the Fair scheduler and the Capacity schedulers.
The Fair scheduler is the default. Resources are shared evenly across pools and each user has its own pool by default. You can configure custom pools and guaranteed minimum access to pools to prevent starvation. This scheduler supports pre-emption.
Hadoop also ships with the Capacity scheduler; here resources are shared across queues. You may configure hierarchical queues to reflect organizations and their weighted access to resources. You can also configure soft and hard capacity limits to users within a queue. Queues have ACLs to prevent rogues from accessing the queue. This scheduler supports resource-based scheduling and job priorities.
In MRv1, the effective scheduler is defined in $HADOOP_HOME/conf/mapred-site.xml. In MRv2, the scheduler is defined in $HADOOP_HOME/etc/hadoop/mapred-site.xml.
The goal of the fair scheduler is to give all users equitable access between pools to cluster resources. Each pool has configurable guaranteed capacity in slots. Each Pool is equal to the number of jobs. Jobs are placed in flat pools. The default is 1 pool per user.
The Fair Scheduler algorithm works by dividing each pool’s minimum map and reduce tasks among jobs. When a slot is free, the algorithm will allocate a job below the minimum share or most starved. Jobs from "over-using" users can be preempted.
You can examine the configuration and status of the Hadoop fair scheduler in the MCS, or by using the Web UI running on the job tracker.
The Fair Scheduler was developed at Facebook.
The goal of the capacity scheduler is to give all queues access to cluster resources. Shares are assigned to queues as percentages of total cluster resources.
This scheduler was developed at Yahoo.
Moving to YARN (MapReduce v2)
The motivation for the development of YARN (Yet Another Resource Negotiator, a.k.a. MapReduce v2) was to further support and overcome several limitations of MapReduce v1.
YARN generalizes resource management for use by new engines and frameworks, allowing resources to be allocated and reallocated for different concurrent applications sharing a cluster. Existing MapReduce applications can run on YARN without any changes. At the same time, because MapReduce is now merely another application on YARN, MapReduce is free to evolve independently of the resource management infrastructure.
The MapReduce Job Lifecycle in YARN
The Non-MapReduce Job Lifecycle in MRv2
How a Job Is Launched In MRv2
MapR Direct Shuffle work flow in YARN
High-level Differences of YARN Job Management
As mentioned earlier, there are Job Tracker or Task Tracker Web UIs in YARN Job Management.
The above image shows a Job History Server in the web UI. This screen shot shows the summary of a launched job in MaprReduce v2 . If you were looking at a live screen, you could drill down to view further job details.
The screen shot above shows the summary of a launched job in MRv2. Here again, you would be able to dig into the job to get more details.
The partial screen shot above depicts the counters for a launched MapReduce job.
The partial screen shot above depicts the configuration parameters that were effective for a particular MapReduce job.
In the next post, we will detail how to write a simple MapReduce program.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.