The architecture of the cluster hardware is an important consideration when planning a deployment. Among the considerations are anticipated data storage and network bandwidth needs, including intermediate data generated during MapReduce job execution. The type of workload is important: consider whether the planned cluster usage will be CPU-intensive, I/O-intensive, or memory-intensive. Think about how data will be loaded into and out of the cluster, and how much data is likely to be transmitted over the network.

Typically, the CPU is less of a bottleneck than network bandwidth and disk I/O. To the extent possible, network and disk transfer rates should be balanced to meet the anticipated data rates using multiple NICs per node. It is not necessary to bond or trunk the NICs together; MapR is able to take advantage of multiple NICs transparently. Each node should provide raw disks and partitions to MapR, with no RAID or logical volume manager, as MapR takes care of formatting and data protection.

Example Architecture

The following example architecture provides specifications for a standard compute/storage node for general purposes, and two sample rack configurations made up of the standard nodes. MapR is able to make effective use of more drives per node than standard Hadoop, so each node should present enough face plate area to allow a large number of drives. The standard node specification allows for either 2 or 4 1Gb/s ethernet network interfaces.

Standard Compute/Storage Node

  • 2U Chassis
  • Single motherboard, dual socket
  • 2 x 4-core + 32 GB RAM or 2 x 6-core + 48 GB RAM
  • 12 x 2 TB 7200-RPM drives
  • 2 or 4 network interfaces
    (on-board NIC + additional NIC)
  • OS on single partition on one drive (remainder of drive used for storage)

Standard 50TB Rack Configuration

  • 10 standard compute/storage nodes
    (10 x 12 x 2 TB storage; 3x replication, 25% margin)
  • 24-port 1 Gb/s rack-top switch with 2 x 10Gb/s uplink
  • Add second switch if each node uses 4 network interfaces

Standard 100TB Rack Configuration

  • 20 standard nodes
    (20 x 12 x 2 TB storage; 3x replication, 25% margin)
  • 48-port 1 Gb/s rack-top switch with 4 x 10Gb/s uplink
  • Add second switch if each node uses 4 network interfaces

To grow the cluster, just add more nodes and racks, adding additional service instances as needed. MapR rebalances the cluster automatically.