The following high availability (HA) features are available for the JobTracker:
- Restart. When a JobTracker service fails, the Warden service on that node attempts to restart the service.
- Failover. If the JobTracker service fails and cannot be restarted by the Warden service on that node, the Zookeeper and Warden services on each node work together to start a new JobTracker service.
- Recovery. After failover occurs, the new JobTracker resumes the tasks that were previously running.
- Discoverability. After failover occurs, Zookeeper directs JobTracker clients to the new JobTracker node.
By default, Warden will attempt to restart failed JobTracker services three times. If JobTracker does not start successfully, JobTracker failover occurs. The restart frequency is configured in warden.conf and it applies to all services that Warden manages.
If the node running the JobTracker fails and the Warden on the JobTracker node is unable to restart it, Warden starts a new instance of the JobTracker. The Warden on every JobTracker node watches the JobTracker’s znode for changes. When the active JobTracker’s znode is deleted, the Warden daemons on other JobTracker nodes attempt to launch the JobTracker. The Warden service on each JobTracker node works with the Zookeeper to ensure that only one JobTracker is running in the cluster.
In order for failover to occur, at least two nodes in the cluster should include the JobTracker role. No further configuration is required.
When JobTracker failover, the new JobTracker takes over from where the first JobTracker left off. Job and task activity persist in the JobTracker volume, so the new JobTracker can resume activity immediately upon launching. The TaskTrackers maintain information about the state of each task, so that when they connect to the new JobTracker they are able to continue without interruption.
By default, JobTracker recovery is enabled and configured to work without further configuration as long as more than one node can run the JobTracker service. However, you can configure the following recovery properties in the Hadoop 1.x mapred-site.xml:
Default value: /var/mapr/cluster/mapred/jobTracker/recovery
|mapreduce.jobtracker.recovery.maxtime||Maximum time in seconds JobTracker should stay in recovery mode.|
Default value: 120
|mapred.jobtracker.restart.recover||"true" to enable (job) recovery upon restart, "false" to start afresh|
Default value: true
|mapreduce.jobtracker.recovery.job.initialization.maxtime||This property's value specifies the maximum time in seconds that the JobTracker waits to initialize jobs before starting recovery. This property's default value is equal to the value of the mapreduce.jobtracker.recovery.maxtime property.|
Default value: 480