TeraSort Benchmark Comparison for YARN

TeraSort Benchmark Comparison for YARN


TeraSort is a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system. It is commonly used to measure MapReduce performance of an Apache™ Hadoop® cluster. The following report compares performance of a YARN-scheduled TeraSort job on MapR and other distributions.

Test Results

The MapR Distribution including Apache Hadoop continues to be fastest Hadoop distribution in the market. As seen in the figure, MapR is much faster than other distributions (Cloudera CDH was chosen for comparison purposes) sorting 1 TB of data on a 21-node cluster in 494 seconds. The other distribution run under the same conditions took 822 seconds. Please refer to the Appendix for test environment details.

MapR shows a significant performance advantage over other distributions for two primary reasons:

MapR Data Platform Advantage

MapR has set world records for MapReduce performance because of numerous differentiated features for performance including:

  • Distributed metadata to eliminate bottlenecks
  • C++ implementation in key components
  • Fast, direct disk I/O (vs. layered I/O on top of the Linux file system)
  • Optimized MapReduce shuffle algorithm

All of these features continue to provide performance benefits and lower infrastructure footprints when applied to MapReduce v2 jobs scheduled using YARN.

Taking Disk I/O into Account for YARN Scheduling

In order to calculate system resources required for a job, the YARN scheduler today takes memory and CPU characteristics of the nodes into account. For instance for a MapReduce job, the optimum number of map and reduce slots required will be calculated based on CPU and memory availability across the nodes.

MapR allows the YARN scheduler to also take disk I/O characteristics into account when calculating system resources. This ensures disk bottlenecks are correctly identified during the resource allocation process making YARN jobs perform much better.


MapR provides the best Hadoop performance for a variety of workloads, proven by MapReduce v1, MapReduce v2 (YARN), and YCSB benchmarks. Along with high reliability and the random read-write NFS capability, the MapR performance advantage continues to be one of many key benefits for end users. MapR clusters have proven to be the most cost-efficient Hadoop deployments by requiring a much smaller hardware footprint compared to other distributions.

MapR World-Record Setting Benchmark MapR holds the TeraSort world record sorting 1 TB in 54 seconds, accomplished on 1003 virtual nodes on the Google Cloud platform. Details of the MapR world-record setting benchmark can be found in the MapR blogs.

Test Environment Details

Number of Nodes: 20+1 node for NameNode/CLDB + YARN Resource Manager

RAM: 128GB

Disks: 11 Disks—110 GB

CPU: 2x16 cores

Network: 10 GbE

CDH Version: CDH 5.1.0 YARN

MapR Version: MapR 4.0.1 YARN

Test parameters* Numbers
mapreduce.reduce.memory.mb 3072
mapreduce.map.memory.mb 1024
mapred.maxthreads.generate.mapoutput 2
mapreduce.tasktracker.reserved.physicalmemory.mb.low 0.95
mapred.maxthreads.partition.closer 2
mapreduce.map.sort.spill.percent 0.99
mapreduce.reduce.merge.inmem.threshold 0
mapreduce.job.reduce.slowstart.completedmaps 1
mapreduce.reduce.shuffle.parallelcopies 40
mapreduce.map.speculative false
mapreduce.reduce.speculative false
mapreduce.map.output.compress false
mapreduce.job.reduces 160
mapreduce.task.io.sort.mb 480
mapreduce.task.io.sort.factor 400
mfs.heapsize 35