The MapR Metrics service collects and displays analytics information about the Hadoop jobs, tasks, and task attempts that run on the nodes in your cluster. You can use this information to examine specific aspects of your cluster's performance at a very granular level, enabling you to monitor how your cluster responds to changing workloads and optimize your Hadoop jobs or cluster configuration. The analytics information collected by the MapR Metrics service is stored in a MySQL database. The server running MySQL does not have to be a node in the cluster, but the nodes in your cluster must have access to the server.

View this video for an introduction to Job Metrics...
The MapR Control System presents the jobs running on your cluster and the tasks that make up a specific job as a sortable list, along with histograms and line charts that represent the distribution of a particular metric. You can sort the list by the metric you're interested in to quickly find any outliers, then display specific detailed information about a job or task attempt that you want to learn more about. The filtering capabilities of the MapR Control System enable you to narrow down the display of data to the ranges you're interested in.

The MapR Control System displays data using histograms (for jobs) and line charts (for jobs and task attempts). All histograms and charts are implemented in HTML5, CSS and JavaScript to enable display on your browser or mobile device without requiring plug-ins. The histograms presented by MapR Metrics divide continuous data, such as a range of job durations, into a sequence of discrete bins. For example, a range of durations from 0 to 10000 seconds could be presented as 20 individual bins that cover a 500-second band each. The height of the histogram's bar for each bin represents the number of jobs with a duration in the bin's range. The line charts in MapR Metrics display the trend over time for the value of a specific metric.


The Job Metrics Database

Metrics information is kept in a MySQL database that you configure when you install MapR. The Metrics database provides the following standard tables:

| Tables_in_metrics                 |
| JOB                               |
| JOB_ATTRIBUTES                    |
| JOB_EVENT                         |
| METRIC_TRANSACTION                |
| NODE                              |
| TASK                              |
| TASK_ATTEMPT                      |
| TASK_ATTEMPT_EVENT                |
| TASK_EVENT                        |


  • The JOB and JOB_ATTRIBUTES tables hold job metadata while a job is running. Information from the JOBSEL and JOB_ATTRIBUTES tables is written to the /var/mapr/<cluster name>/mapred/jobTracker/jobs/history/ directory. If a request is made at the MCS for a job that has already been purged from the Metrics database, that data is reloaded from the relevant directory.
  • The METRIC_TRANSACTION_* tables hold job transaction data such as counters. The transactional data is written to the /var/mapr/<cluster name>/mapred/jobTracker/jobs/history/metrics/ directory on each host. This directory depends on the base path of the JobTracker directory. These transactional data files are named <hostname>job<jobID>_<fileID>_metrics.
  • The NODE table holds information about the node ID, hostname, host ID, cluster ID, and creation time.
  • The TASK, TASK_ATTEMPT, and TASK_ATTEMPT_ATTRIBUTES tables hold information related to a job's tasks and task attempts. These tables update while the job is running.

    If a job's task data has not been accessed within a configurable time limit, the data from the TASK, TASK_ATTEMPT, and TASK_ATTEMPT_ATTRIBUTES tables is purged. The db.joblastaccessed.limit.hours parameter in the db.conf file sets the number of hours that define this time limit. The default value for this parameter is 48.


The job metrics cover the following categories:

  • Cluster resource use (CPU and memory)
  • Duration (epoch)
  • Task count (map, reduce, failed map, failed reduce)
  • Map rates (record input and output, byte input and output)
  • Reduce rates (record input and output, shuffle bytes)
  • Task attempt counts (map, reduce, failed map, failed attempt)
  • Task attempt durations (average map, average reduce, maximum map, maximum reduce)

The task attempt metrics cover the following categories:

  • Times (task attempt duration, garbage collection time, CPU time)
  • Local byte rate (read and written)
  • Mapr-FS byte rate (read and written)
  • Memory usage (bytes of physical and virtual memory)
  • Records rates (map input, map output, reduce input, reduce output, skipped, spilled, combined input, combined output)
  • Reduce task attempt input groups
  • Reduce task attempt shuffle bytes

Results Filtering

By default, the MapR cluster software applies limits to the maximum number of records used for job and task histograms. You can configure these records from the MCS through System Settings > Metrics or by editing the db.conf file directly.

These limits are applied before filtering.

Metrics Protocol Buffers

The protocol buffer definition for cluster metrics data is clustermetrics.proto, which is located in the libprotodefs.jar artifact.

Example: Using MapR Metrics To Diagnose a Faulty Network Interface Card (NIC)

In this example, a node in your cluster has a NIC that is intermittently failing. This condition is leading to abnormally long task completion times due to that node being occasionally unreachable. In the Metrics interface, you can display a job's average and maximum task attempt durations for both map and reduce attempts. A high variance between the average and maximum attempt durations suggests that some task attempts are taking an unusually long time. You can sort the list of jobs by maximum map task attempt duration to find jobs with such an unusually high variance.

Click the name of a job name to display information about the job's tasks, then sort the task attempt list by duration to find the outliers. Because the list of tasks includes information about the node the task is running on, you can see that several of these unusually long-running task attempts are assigned to the same node. This information suggests that there may be an issue with that specific node that is causing task attempts to take longer than usual.

When you display summary information for that node, you can see that the Network I/O speeds are lower than the speeds for other similarly configured nodes in the cluster. You can use that information to examine the node's network I/O configuration and hardware and diagnose the specific cause.