Cluster Hardware

This section describes important hardware architecture considerations for your cluster.

When planning the hardware architecture for the cluster, make sure all hardware meets the node requirements listed in Preparing Each Node.

The architecture of the cluster hardware is an important consideration when planning a deployment. Among the considerations are anticipated data storage and network bandwidth needs, including intermediate data generated when jobs and applications are executed. The type of workload is important: consider whether the planned cluster usage will be CPU-intensive, I/O-intensive, or memory-intensive. Think about how data will be loaded into and out of the cluster, and how much data is likely to be transmitted over the network.

Planning a cluster often involves tuning key ratios, such as:
  • Disk I/O speed to CPU processing power
  • Storage capacity to network speed
  • Number of nodes to network speed
disk I/O speed to CPU processing power; storage capacity to network speed; or number of nodes to network speed.

Typically, the CPU is less of a bottleneck than network bandwidth and disk I/O. To the extent possible, network and disk transfer rates should be balanced to meet the anticipated data rates using multiple NICs per node. It is not necessary to bond or trunk the NICs together; MapR is able to take advantage of multiple NICs transparently. Each node should provide raw disks and partitions to MapR, with no RAID or logical volume manager, as MapR takes care of formatting and data protection.

The following example architecture provides specifications for a standard MapR Hadoop compute/storage node for general purposes. This configuration is highy scalable in a typical data center environment. MapR can make effective use of more drives per node than standard Hadoop, so each node should present enough face plate area to allow a large number of drives.

Standard Compute/Storage Node

  • 2U Rack Server/Chassis
  • Dual CPU socket system board
  • 2x8 core CPU, 32 cores with HT enabled
  • 8x8GB DIMMs, 64GB RAM (DIMM count must be multiple of CPU memory channels)
  • 12x2TB SATA drives
  • 10GbE network interface
  • OS using entire single drive, not shared as data drive

Minimum Cluster Size

All MapR clusters must have a minimum of 5 data nodes except for MapR Edge. A data node is defined as a node running a FileServer process, and responsible for storing data on behalf of the entire cluster. Having additional nodes deployed with control-only services like CLDB and ZooKeeper is recommended, but they do not count toward the minimum node count because they do not contribute to overall availability of data.

Note: Dedicated control nodes are not needed on clusters with less than 10 data nodes.

Best Practices

Hardware recommendations and cluster configuration vary by use case. For example, is the application a MapR-DB application? Is the application latency-sensitive?

The following recommendations apply in most cases:
Disk Drives
  • Drives should be JBOD, using single-drive RAID0 volumes to take advantage of the controller cache.
  • SAS drives can provide better I/O latency and SSDs even lower latency. (However, SSDs may not be cost-effective).
  • Match aggregate drive throughput to network throughput. 10GbE ~= 10-12 drives.
Cluster Size
  • In general, it is better to have more nodes. For example, a 5-node cluster is very small and not very resilient to failure.
  • For smaller clusters, all nodes are likely to fit on a single non-blocking switch. Larger clusters require a well-designed Spine/Leaf fabric that can scale.
Operating System and Server Configuration
  • CentOS or RHEL 6.x is recommended. CentOS 7 is supported but you may encounter a few issues.
  • Install the minimal server configuration. Use a product like Cobbler to PXE boot and install a consistent OS image.
  • Install the full JDK (1.7 or 1.8).
  • For best performance, avoid deploying a MapR cluster on virtual machines. However, VMs are supported for use as clients or edge nodes.
Memory, CPUs, Number of Cores
  • Make sure the DIMM count is an exact multiple of the number of memory channels the selected CPU provides.
  • Use CPUs with as many cores as you can. Having more cores is more important than having a slightly higher clock speed.
  • MapR-DB benefits from lots of RAM: 256GB per node or more.