TeraSort is a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system. It is commonly used to measure MapReduce performance of an Apache™ Hadoop® cluster. The following report compares performance of a YARN-scheduled TeraSort job on MapR and other distributions.
The MapR Distribution including Apache Hadoop continues to be fastest Hadoop distribution in the market. As seen in the figure, MapR is much faster than other distributions (Cloudera CDH was chosen for comparison purposes) sorting 1 TB of data on a 21-node cluster in 494 seconds. The other distribution run under the same conditions took 822 seconds. Please refer to the Appendix for test environment details.
MapR shows a significant performance advantage over other distributions for two primary reasons:
MapR Data Platform Advantage
MapR has set world records for MapReduce performance because of numerous differentiated features for performance including:
All of these features continue to provide performance benefits and lower infrastructure footprints when applied to MapReduce v2 jobs scheduled using YARN.
Taking Disk I/O into Account for YARN Scheduling
In order to calculate system resources required for a job, the YARN scheduler today takes memory and CPU characteristics of the nodes into account. For instance for a MapReduce job, the optimum number of map and reduce slots required will be calculated based on CPU and memory availability across the nodes.
MapR allows the YARN scheduler to also take disk I/O characteristics into account when calculating system resources. This ensures disk bottlenecks are correctly identified during the resource allocation process making YARN jobs perform much better.
MapR provides the best Hadoop performance for a variety of workloads, proven by MapReduce v1, MapReduce v2 (YARN), and YCSB benchmarks. Along with high reliability and the random read-write NFS capability, the MapR performance advantage continues to be one of many key benefits for end users. MapR clusters have proven to be the most cost-efficient Hadoop deployments by requiring a much smaller hardware footprint compared to other distributions.
|MapR World-Record Setting Benchmark MapR holds the TeraSort world record sorting 1 TB in 54 seconds, accomplished on 1003 virtual nodes on the Google Cloud platform. Details of the MapR world-record setting benchmark can be found in the MapR blogs.|
Number of Nodes: 20+1 node for NameNode/CLDB + YARN Resource Manager
Disks: 11 Disks—110 GB
CPU: 2x16 cores
Network: 10 GbE
CDH Version: CDH 5.1.0 YARN
MapR Version: MapR 4.0.1 YARN