9 min read
With MapR’s enterprise-grade distribution of Hadoop, there are 3 unique features that make this task of profiling MapReduce code easier. They are:
This blog will describe how these features enable easy profiling of Hadoop jobs on MapR. We also describe how to enable and use Hadoop’s MapReduce profiling.
The first step in profiling an application is to enable some MapReduce properties. Here are some of these properties and some sample values:
The above parameters enable profiling, specify which HPROF options to pass to the JVM when the tasks are started, and specify how many of the mappers and reducer tasks to profile. Typically, one only needs a few of these profiles to analyze, and all mappers and reducers generally have very similar profiling signatures.
One way of specifying these parameters is on the Hadoop command line. But what if you have a series of MapReduce jobs to profile, and you’d prefer to modify the Hadoop configuration file (mapred-site.xml) to add these properties and profile all jobs? This would typically involve making the changes to mapred-site.xml, and copying that file across all nodes in the cluster.
MapR’s Central Configuration feature lets you place a new configuration file in the MapR XD distributed file system so it is available to all nodes, and then have each node pull a copy to the local configuration directory that Hadoop uses.
So where do we place the new mapred-site.xml file in MapR XD? MapR has a list of paths for the various services at which it locates central configuration files. These list of paths live in files in this directory:
For example, for Hadoop, we use:
$ cat /opt/mapr/servicesconf/tasktracker
The MapR XD location for central configuration files is here:
There are 2 sub-directories located here – one called ‘default’ and one called ‘nodes’. The ‘nodes’ directory allows per-node central configuration. We will assume all nodes will use the same mapred-site.xml file, therefore we will create the central configuration file in ‘defaults’.
In order to do this, we simply copy an existing mapred-site.xml file from the local per-node configuration directory (/opt/mapr) to the central location in MapR XD, make the change there, then MapR will automatically pull that file from the distributed file system location in MapR XD to the local directory. So here are the steps (note that we’re using MapR’s Direct NFS feature here so we can use regular Linux commands rather than ‘hadoop fs’ (which of course would also work):
$ mkdir –p /mapr/my.cluster.com/var/mapr/configuration/default/hadoop/hadoop-0.20.2/conf
$ cd /mapr/my.cluster.com/var/mapr/configuration/default/hadoop/hadoop-0.20.2/conf
$ cp /opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml .
$ vi mapred-site.xml
Here are the changes to add to the end of mapred-site.xml to enable profiling:
Now that we’ve set up this central config file, the MapR warden process will check every 5 minutes (can be customized) and will invoke /opt/mapr/server/pullcentralconfig to bring the new file over to to local /opt/mapr directory. Note that if the modification time on the local mapred-site.xml file is newer than the central file, it will not be overridden.
MapR XD is API-compatible with HDFS, but has tremendous capabilities in terms of ease-of-use, dependability, and performance/scale. For the purposes of this blog, we’ll focus on one of the features.
Unlike traditional Hadoop distributions, MapR can handle small files extremely efficiently. We can therefore keep all output from MapReduce jobs, including log files, map outputs, and TaskTracker and JobTracker messages in the MapR XD rather than the local file system.
Another benefit of this is all of our profile data will be located in MapR XD rather than the local file systems of the nodes. This makes it much easier to gather and analyze the profile output.
After making the changes to the central mapred-site.xml file, we run the Hadoop jobs as we normally would. Upon completion, it’s now time to find the profile output and analyze the results.
With MapR, any node has access to all of the profile files (the file name is “profile.out”) because this output was written to the distributed file system. The location in MapR XD would be in:
But there’s an even easier way of accessing the profile output files with MapR’s Centralized logging. Basically, you provide a job-id, and MapR builds a directory of links to the associated files regardless of which node they came from. Here is the command to do this:
$ maprcli job linklogs -jobid job_201302040944_0003 -todir /user/mapr/mylogdir
Now assuming you’re running MapR NFS, simply navigate to the new ‘mylogdir’ directory and find the profile files located there (again, they are simply links to the MapR XD location):
$ cd /mapr/my.cluster.com/user/mapr/mylogdir/job_201302040944_0003
$ ls –l
drwxr-xr-x 5 mapr mapr 3 Feb 8 13:49 hosts
drwxr-xr-x 2 mapr mapr 59 Feb 8 13:49 mappers
drwxr-xr-x 2 mapr mapr 1 Feb 8 13:49 reducers
So basically we have the log directories grouped my hosts, mappers, and reducers. So in our case, we can go to the mappers directory and see for all nodes the log directories:
Let’s see what is in one of these directories:
$ ls attempt_201302040944_0003_m_000000_0
log.index profile.out stderr stdout syslog
So we can see the mapper’s standard out and error files, and other log information, as well as our profile.out file. Now let’s look for all profile outputs:
$ ls */profile*
Here we have all 7 of our mapper profile outputs conveniently located in one hierarchy regardless of what node they executed on (recall we set mapred.task.profile.maps=0-6). The single reducer profile output can be found in the reducers directory:
$ ls <sup>attempt_201302040944_0003_r_000000_0 </sup>
log.index profile.out stderr stdout syslog
Now that we’ve conveniently located the profile data, it’s a matter of performing the analysis. A full description of HPROF is beyond the scope of this blog. But the key things to look for in the profile.out file is the CPU sampling, which shows which Java methods were the hottest, and the “SITES” section, which shows which Java objects are using the most memory. Both of these sections are at the end of the profile.out file:
SITES BEGIN (ordered by live bytes) Tue Feb 5 10:47:39 2013
percent live alloc'ed stack class
rank self accum bytes objs bytes objs trace name
1 29.25% 29.25% 4411600 269 4969200 303 313427 char
2 18.27% 47.52% 2755200 168 5510400 336 313268 char
3 14.64% 62.16% 2207952 269 2487024 303 313424 byte
4 9.14% 71.30% 1378944 168 4210704 513 301504 byte
5 1.03% 72.34% 156024 2622 1383512 20399 303897 char
CPU SAMPLES BEGIN (total = 1707) Tue Feb 5 10:47:39 2013
rank self accum count trace method
1 10.54% 10.54% 180 300072 java.util.zip.ZipFile.open
2 3.81% 14.35% 65 303897 java.util.Arrays.copyOfRange
3 3.34% 17.69% 57 313140 org.apache.hadoop.mapred.IFile$Reader.nextRawValue
4 2.99% 20.68% 51 303946 java.lang.Object.<init>
5 2.93% 23.61% 50 304069 java.lang.Object.<init>
Note the SITES section shows 30% of the live memory is a char array. If you want to see what stack generated that allocation simply search the file for the “stack trace” value, which is 313427. That will show which thread and stack did the allocation.
In the CPU SAMPLES section we see the method java.util.zio.ZioFile.open made up about 10% of the reduce task.
Profiling Hadoop MapReduce jobs can be a bit tricky in terms of configuring for profiling, and chasing down the profile data that is generated. MapR provides features such as central configuration, central logging, MapR XD, and NFS that makes the job of profiling easier.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.