This section contains the following topics:
Create a Map Task Pipeline to Prefetch Tasks
When a task is completed, the TaskTracker informs the JobTracker that a slot is available. MapR allows the JobTracker to over-schedule tasks on TaskTracker nodes in advance of the availability of slots, creating a pipeline. To avoid wasting time, you can prefetch a certain percentage of tasks in anticipation of the end of tasks in progress. It is important to set this correctly; if it is too low, time is wasted waiting for communication via heartbeats; if it is too high, parallelism suffers because tasks arrive too soon and must wait to be processed. This optimization allows TaskTracker to launch each map task as soon as the previous running map task finishes. The number of tasks to over-schedule should be about 25-50% of total number of map slots. You can adjust this number with the parameter
mapreduce.tasktracker.prefetch.maptasks in the mapred-site.xml.
When the TaskTracker is instructed to start a map task, it tries to find an available JVM. Re-using JVMs is very important for performance, because it takes approximately one second and uses a lot of CPU. Set the value of the
mapred.job.reuse.jvm.num.tasks parameter to -1 so that JVMs don't restart after a set number of tasks.
Use Pluggable Sorting Algorithms
Starting in version 3.0.2 of the MapR distribution for Hadoop, you can specify the DMExpress custom sorting algorithm from SyncSort for your MapReduce job. The JAR file dmxhadoop_mrv1_mapr.jar is installed along with DMExpress in the /lib subdirectory of your DMExpress installation directory.
Running Terasort with DMExpress 7.13.4
The 7.13.4 release of the DMEXpress custom sorting algorithm writes temporary sorting data to the directory specified by the Hadoop property mapred.local.dir.
To run Terasort with the 7.13.4 release of DMExpress, use the following command:
Running Terasort with DMExpress 7.13.7
The planned release of DMExpress 7.13.7 writes temporary sorting data to a directory chosen from a comma-separated list of directories specified by the
To run Terasort with the 7.13.7 release of DMExpress, use the following command:
Use Local Volume for Temporary Sort Data Directory
Using a single disk to store temporary sort data can impose performance challenges for your MapR cluster. To prevent this issue, mount the local volume to the local file system and set that directory as the temporary sort data directory. The MapR filesystem can then take advantage of all the disks that are available on the node, resulting in improved performance.
To take advantage of this performance boost, your cluster must have a MapR license that enables multiple NFS servers. All nodes on the cluster must have the mapr-nfs package installed and be running the mapr-nfs service.
The following sample script performs the NFS mount:
Using sequence files can also improve serialization/deserialization performance because they use native Hadoop data types. If necessary, consider writing a custom comparator in your code to improve serialization and deserialization during sorts and partitioning.