Steps to Deploy a MapR Cluster: Part 2 of 2

Deploying a MapR Cluster

In the part 1 of this series, we talked about how to make sure all the nodes in your cluster are ready for a MapR deployment. Deploying a MapR cluster consists mainly of installing the right services on the right nodes. Let’s take an in-depth look at these services.

What are Services?

A MapR cluster is a full Hadoop distribution. Hadoop itself consists of a storage layer and a MapReduce layer. In addition, MapR provides cluster management tools, data access via NFS, and a few behind-the-scenes services that keep everything running. Some applications and Hadoop components, such as HBase, are implemented as services; others, such as Pig, are simply applications that you run as needed. We will lump them together here, but the distinction is worth making.

  • MapReduce services: JobTracker, TaskTracker
  • Storage services: CLDB, FileServer, HBase RegionServer, NFS
  • Management services: HBase Master, Webserver, ZooKeeper

A daemon called the warden runs on every node to make sure that the proper services are running (and to allocate resources for them). The only service that the warden doesn’t control is the ZooKeeper. Part of the ZooKeeper’s job is to have knowledge of the whole cluster; in the event that a service fails on one node, it is the ZooKeeper that tells the warden to start the service on another node.

Service Layout

Before installing services, it’s important to determine which services to run and where to run them. A lot of the factors that weigh on this decision have to do with the size of your cluster, the anticipated load, and the kinds of jobs you plan to run. Here are a few tips to help you understand service layout.

  • General Tips
    If you are using MapR Metrics, avoid running the MySQL server that supports the MapR Metrics service on a CLDB node. Consider running the MySQL server on a machine external to the cluster to prevent the MySQL server’s resource needs from affecting services on the cluster. For the same reason, avoid running the webserver on CLDB nodes. Queries to the MapR Metrics service can impose a bandwidth load that reduces CLDB performance.
    In general, the FileServer should run on all nodes.

  • Large Clusters
    In a large cluster, the JobTracker, CLDB, and ZooKeeper nodes are likely to get a lot of requests. With the ZooKeeper especially, latency is a bad thing. The ZooKeeper is not particularly resource-hungry, but if it gets slowed down waiting for some other process to relinquish CPU cycles, it can introduce problems to the cluster. For this reason, you should avoid putting the ZooKeeper on the same node as the CLDB or JobTracker in a large cluster. You should even consider isolating the ZooKeeper completely—that is, on ZooKeeper nodes, it’s best not to run any other services at all. On clusters over 250 nodes, isolate the JobTracker service on a dedicated node.

    If possible, you should run three or five ZooKeepers on different racks. ZooKeeper maintains the integrity of its information through majority consensus: if more than half the ZooKeeper nodes agree on something, it is considered true. If fewer than half the ZooKeeper nodes are present, or able to agree, then ZooKeeper stops the cluster until the problem is resolved. This means that if you run three ZooKeeper nodes, you can tolerate one ZooKeeper failure (two is more than half of three). If you run five ZooKeeper nodes, you can tolerate two ZooKeeper failures (three is more than half of five). There’s no point in running an even number of ZooKeeper nodes—if you run four, for example, then you can only tolerate one ZooKeeper failure since two is not more than half of four.

  • Small Clusters
    On very small clusters of just a few nodes, it’s impractical to isolate services on dedicated nodes. One layout approach is to run one CLDB and one ZooKeeper on the same node, leaving the other nodes free to run the TaskTracker. All nodes should run the FileServer. If you need HA in a small cluster, you will end up running the CLDB and ZooKeeper on additional nodes. Here is a sample layout:

    See Planning the Cluster for more information.

  • Installation tips
    Once you’ve planned your layout, you are ready to install services on your nodes. The procedure for installing services is covered in detail in the Installation Guide, in the section titled Installing MapR Software. Here are a few tips to keep in mind as you go through the procedure step by step:

    If you are installing on more than a few nodes, consider downloading the MapR packages and setting up a local repository on-site, so that you don’t have to download the packages over the Internet repeatedly. For more information, see Using a Local Repository.

    As you install the services and the warden starts them, keep in mind that the chain of events leading to a running cluster takes some time. The ZooKeeper nodes have to start and get in communication with each other; the CLDB has to start and settle; the Webserver has to get going. This means that it can take a few minutes before you are able to get to the MapR Control System or run any maprcli commands to verify that the installation has succeeded. As services start, you can use the jps command to see them come up. Here is an example showing the Warden, ZooKeeper (QuorumPeerMain), and other services:

            [root@nmk-rh5-1 ~]# jps
          3704 WardenMain
          3524 QuorumPeerMain
          5240 CLDB
          6547 JobTracker
          6883 CommandServer
          1654 Jps
  • After Installation
    After the cluster is installed, monitor the server load on the nodes in your cluster that are running high-demand services such as JobTracker, ZooKeeper, or CLDB. If you are running the TaskTracker service on nodes that are also running a high-demand service, you can reduce the number of task slots provided by the TaskTracker service. Tune the number of task slots according to the acceptable load levels for nodes in your cluster.

    Once your cluster is installed and tuned, you can install the MapR Client on users’ computers to let them run Hadoop Jobs without direct ssh access to the cluster itself. The MapR Client is a distribution of the Hadoop shell, compatible with Windows, Mac, and Linux workstations. For more information, see Setting Up the Client.

    Armed with the above knowledge, you are now ready to proceed with installing services and deploying a MapR cluster. In a future installment, we’ll discuss a few things you should know about configuring the cluster once it’s running.