The CLDB service automatically replicates its data to other nodes in the cluster, preserving at least two (and generally three) copies of the CLDB data. If the CLDB process dies, it is automatically restarted on the node. All jobs and processes wait for the CLDB to return, and resume from where they left off, with no data or job loss.

If the node itself fails, the CLDB data is still safe, and the cluster can continue normally as soon as the CLDB is started on another node. In an M5-licensed cluster, a failed CLDB node automatically fails over to another CLDB node without user intervention and without data loss. It is possible to recover from a failed CLDB node on an M3 cluster, but the procedure is somewhat different.

Recovering from a Failed CLDB Node on an M3 Cluster

To recover from a failed CLDB node, perform the steps listed below:

  1. Restore ZooKeeper - if necessary, install ZooKeeper on an additional node.
  2. Locate the CLDB data - locate the nodes where replicates of CLDB data are stored, and choose one to serve as the new CLDB node.
  3. Stop the selected node - stop the node you have chosen, to prepare for installing the CLDB service.
  4. Install the CLDB on the selected node - install the CLDB service on the new CLDB node.
  5. Configure the selected node - run configure.sh to inform the CLDB node where the CLDB and ZooKeeper services are running.
  6. Start the selected node - start the new CLDB node.
  7. Restart all nodes - stop each node in the cluster, run configure.sh on it, and start it.

After the CLDB restarts, there is a 15-minute delay before replication resumes, in order to allow all nodes to register and heartbeat. This delay can be configured using the config save command to set the cldb.replication.manager.start.mins parameter.

Restore ZooKeeper

If the CLDB node that failed was also running ZooKeeper, install ZooKeeper on another node to maintain the minimum required number of ZooKeeper nodes.

Locate the CLDB Data

After restoring the ZooKeeper service on the M3 cluster, use the maprcli dump zkinfo command to identify the latest epoch of the CLDB, identify the nodes where replicates of the CLDB are stored, and select one of those nodes to serve the new CLDB node.

Perform the following steps on any cluster node:

  1. Log in as root or use sudo for the following commands.
  2. Issue the maprcli dump zkinfo command using the -json flag.

    # maprcli dump zkinfo -json

    The output displays the ZooKeeper znodes.

  3. In the /datacenter/controlnodes/cldb/epoch/1 directory, locate the CLDB with the latest epoch.

    {
        "/datacenter/controlnodes/cldb/epoch/1/KvStoreContainerInfo":" Container ID:1
        VolumeId:1 Master:10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--13-VALID Servers:
        10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--13-VALID Inactive Servers: Unused Servers:
        Latest epoch:13"
    }
    

    The Latest Epoch field identifies the current epoch of the CLDB data. In this example, the latest epoch is 13.

  4. Select a CLDB from among the copies at the latest epoch. For example, 10.250.2.41:5660--13-VALID indicates that the node has a copy at epoch 13 (the latest epoch).

You can now install a new CLDB on the selected node.

Stop the Selected Node

Perform the following steps on the node you have selected for installation of the CLDB:

The page Stopping a Node could not be found.

Install the CLDB on the Selected Node

Perform the following steps on the node you have selected for installation of the CLDB:

  1. Login as root or use sudo for the following commands.
  2. Install the CLDB service on the node:
    • RHEL/CentOS: yum install mapr-cldb
    • Ubuntu: apt-get install mapr-cldb
  3. Wait until the failover delay expires. If you try to start the CLDB before the failover delay expires, the following message appears:

    CLDB HA check failed: not licensed, failover denied: elapsed time since last failure=<time in minutes> minutes

Configure the Selected Node

Perform the following steps on the node you have selected for installation of the CLDB:

The page Configuring a Node could not be found.

Start the Node

Perform the following steps on the node you have selected for installation of the CLDB:

The page Starting a Node could not be found.

Restart All Nodes

On all nodes in the cluster, perform the following procedures:

Stop the node:

The page Stopping a Node could not be found.

Configure the node with the new CLDB and ZooKeeper addresses:

The page Configuring a Node could not be found.

Start the node:

The page Starting a Node could not be found.