MapR 5.0 Documentation : Tune the Map Phase For MapReduce v1

This section contains the following topics: 

Create a Map Task Pipeline to Prefetch Tasks

When a task is completed, the TaskTracker informs the JobTracker that a slot is available. MapR allows the JobTracker to over-schedule tasks on TaskTracker nodes in advance of the availability of slots, creating a pipeline. To avoid wasting time, you can prefetch a certain percentage of tasks in anticipation of the end of tasks in progress. It is important to set this correctly; if it is too low, time is wasted waiting for communication via heartbeats; if it is too high, parallelism suffers because tasks arrive too soon and must wait to be processed. This optimization allows TaskTracker to launch each map task as soon as the previous running map task finishes. The number of tasks to over-schedule should be about 25-50% of total number of map slots. You can adjust this number with the parameter mapreduce.tasktracker.prefetch.maptasks in the mapred-site.xml.

Reuse JVMs

When the TaskTracker is instructed to start a map task, it tries to find an available JVM. Re-using JVMs is very important for performance, because it takes approximately one second and uses a lot of CPU. Set the value of the mapred.job.reuse.jvm.num.tasks parameter to -1 so that JVMs don't restart after a set number of tasks.

Use Pluggable Sorting Algorithms

Starting in version 3.0.2 of the MapR distribution for Hadoop, you can specify the DMExpress custom sorting algorithm from SyncSort for your MapReduce job. The JAR file dmxhadoop_mrv1_mapr.jar is installed along with DMExpress in the /lib subdirectory of your DMExpress installation directory.


Before you use the DMExpress custom sorting algorithm, add the and to the LD_LIBRARY_PATH environment variable on each TaskTracker node and then restart Warden on each TaskTracker node.

  • To update the LD_LIBRARY_PATH, add the following text to opt/mapr/conf/ file:
    export LD_LIBRARY_PATH=<full path to the directory>:<full path to the directory containing>:$LD_LIBRARY_PATH
  • To restart Warden, run the following commands:
    sudo service mapr-warden stop
    sudo service mapr-warden start

Running Terasort with DMExpress 7.13.4

The 7.13.4 release of the DMEXpress custom sorting algorithm writes temporary sorting data to the directory specified by the Hadoop property mapred.local.dir.

To run Terasort with the 7.13.4 release of DMExpress, use the following command:

# hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar terasort -Dmapreduce.job.reduce.shuffle.consumer.plugin.class=com.syncsort.dmexpress.hadoop.DMXShuffleConsumerPlugin -Ddmx.home.dir=/usr/dmexpress -Ddmx.key.length=10 -D dmx.reduce.memory=3072 -D -libjars dmxhadoop_mrv1_mapr.jar -Dmapred.local.dir=/mapr/dmx/scratchSpace -Ddmx.mapr.nfs.mount=/mapr/perf /tsort_in/in /tsort_out/out

Running Terasort with DMExpress 7.13.7

The planned release of DMExpress 7.13.7 writes temporary sorting data to a directory chosen from a comma-separated list of directories specified by the dmx.sortwork.dirs property.

To run Terasort with the 7.13.7 release of DMExpress, use the following command:

# hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar terasort -Dmapreduce.job.reduce.shuffle.consumer.plugin.class=com.syncsort.dmexpress.hadoop.DMXShuffleConsumerPlugin -Ddmx.home.dir=/usr/dmexpress -Ddmx.key.length=10 -D dmx.reduce.memory=3072 -D -libjars dmxhadoop_mrv1_mapr.jar -Ddmx.sortwork.dirs=/mapr/dmx/scratchSpace -Ddmx.mapr.nfs.mount=/mapr/perf /tsort_in/in /tsort_out/out

Use Local Volume for Temporary Sort Data Directory

Using a single disk to store temporary sort data can impose performance challenges for your MapR cluster. To prevent this issue, mount the local volume to the local file system and set that directory as the temporary sort data directory. The MapR filesystem can then take advantage of all the disks that are available on the node, resulting in improved performance.

To take advantage of this performance boost, your cluster must have a MapR license that enables multiple NFS servers. All nodes on the cluster must have the mapr-nfs package installed and be running the mapr-nfs service.

The following sample script performs the NFS mount:

echo "Mount Each Node's emptydir: $localNFSMountDir to local volume: $nodeName:/mapr/perf/var/mapr/local/$nodeName/mapred"
    ssh $i "umount -l $localNFSMountDir"
    ssh $i "mount -t nfs ${nodeName}:/mapr/perf/var/mapr/local/${nodeName}/mapred ${localNFSMountDir}"

Use Serialization

Using sequence files can also improve serialization/deserialization performance because they use native Hadoop data types. If necessary, consider writing a custom comparator in your code to improve serialization and deserialization during sorts and partitioning.