MapR 5.0 Documentation : Recovery for the ResourceManager

After a restart or failover, the active ResourceManager recovers the ResourceManager state based on the checkpoints provided in the ResourceManager state store. During recovery, the ResourceManager resumes applications and tasks that were running prior to the failover but were not completed.

Two implementations of the ResourceManager state store are available:

  • FileSystemRMStateStore.  Enables implicit write access to a single ResourceManager node. MapR-FS provides fencing implicitly and its state store implementation provides better scalability and failover performance than the ZKRMStateStore. The state store is also naturally protected by MapR-FS replication. By default, FileSystemRMStateStore is the state store implementation for the ResourceManager and the ResourceManager state store is maintained in the following MapR-FS volume: /var/mapr/cluster/yarn/rm/system.
  • ZKRMStateStore. Enables implicit write access to a single ResourceManager node. This is usually recommended for HA implementations where YARN is running on HDFS. However, FileSystemRMStateStore is recommended in a MapR cluster.   
For recovery to occur, all ResourceManager nodes must have access to the ResourceManager state store.

ResourceManager Recovery Administration

To change the default behavior, update the ResourceManager configuration in the yarn-site.xml files and restart the ResourceManager(s). The yarn-site.xml is located in the following directory: /opt/mapr/hadoop/hadoop-2.x.x/etc/hadoop/

You may want to perform the following tasks:

Disabling the restart of applications after failover

You can configure the ResourceManager to not recover its state after a restart or failover occurs.

  •  Set the value of yarn.resourcemanager.recovery.enabled to false in yarn-site.xml on each ResourceManager node. 

Configuring Maximum Attempts for Applications

When an ApplicationMaster fails, the ResourceManager restarts the ApplicationMaster as long as the number of restart attempts does not exceed the max-attempt values set at the ResourceManager and ApplicationMaster level . By default, the maximum attempt value is set to 2.

To configure the maximum number of ApplicationMaster attempt retries for all applications run by the ResourceManager:

    • Set the value of in yarn-site.xml. The value defaults to 2. 

To configure the number of ApplicationMaster attempts allowed for the MapReduce ApplicationMaster:

    •  Set the value of in mapred-site.xml. The value defaults to 2. 

For more information about these properties, see ResourceManager Configuration Properties

Configuring the MapR-FS State Store 

By default, the Resource Manager stores its state in the MapR-FS. However, you can change the values for the following properties related to the MapR-FS state store:

To configure the URI to the state store location:

    •  Set the value of yarn.resourcemanager.fs.state-store.uri in yarn-site.xml. The value defaults to the ResourceManager volume (/var/mapr/cluster/yarn/rm/system).

To configure the retry policy used by the state store client to connect with MapR-FS:

    • Set the value of yarn.resourcemanager.fs.state-store.retry-policy-spec in yarn-site.xml. The value defaults to (2000,500).

To configure the number of completed applicatons retained by the state store:

    • Set the value of yarn.resourcemanager.state-store.max-completed-applications in yarn-site.xml. The value defaults to 10000.

For more information about these properties, see ResourceManager Configuration Properties

Enabling ZooKeeper Based State Store

By default, the Resource Manager stores its state in the MapR-FS. However, you can use the Zookeeper based state store instead.

To configure the ResourceManager to use the Zookeeper state store: 

    • Set the value of to org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore in the yarn-site.xml.
    • Set the value of yarn.resourcemanager.zk-address to a comma-separated list of host:port pairs for each ZooKeeper server used by the ResourceManager. This property needs to be set in yarn-site.xml.

For more information about these properties and additional properties that you might want to configure, see ResourceManager Configuration Properties