MapR-DB JSON MapReduce API Library

This API library extends the Apache Hadoop MapReduce framework, so that you can write your own MapReduce applications to write data from one JSON table to another.

Prerequisites to Using this API Library

  • Ensure that you have a firm grasp of MapReduce concepts and experience writing MapReduce applications.
  • Before running a MapReduce application that uses this API, ensure that the destination JSON table or tables already exist and that any column families other than the default are already created on the destination tables.

Classes

The following table summarizes the information that is in the javadoc, which you can refer to for complete details of the classes.

Category Class Description
Utility MapRDBMapReduceUtil Simplifies the use of the API for most use cases.
Input formatters TableInputFormat Describes how to read documents from MapR-DB JSON tables.
Record reader TableRecordReader Reads documents (records) from MapR-DB JSON tables.
Record writer - bulk load BulkLoadRecordWriter Bulk loads documents into MapR-DB JSON tables.
Record writer - table mutation TableMutationRecordWriter Modifies documents that are in MapR-DB JSON tables.
Record writer - table TableRecordWriter Writes documents to MapR-DB JSON tables.
Output formatter - bulk load BulkLoadOutputFormat Describes how to bulk load documents into MapR-DB JSON tables.
Output formatter - table TableOutputFormat Describes how to write documents to MapR-DB JSON tables.
Serializer - document DocumentSerialization Defines the serializer and deserializer for passing data from Document objects between map and reduce phases.
Serializer - mutation MutationSerialization Defines the serializer and deserializer for passing data from DocumentMutation objects between map and reduce phases.
Partitioner - table TablePartitioner Specifies how to partition data from the source JSON table.
Partitioner - total order TotalOrderPartitioner<K,V> Globally sorts data according to row key and then partitions the sorted data. This class is useful when the destination table has been pre-split into two or more tablets.

Using MapRDBMapReduceUtil to Set Default Values in Configurations and Jobs

The centerpiece of this API is the MapRDBMapReduceUtil class, which you can use in the createSubmittableJob() method of your applications to perform these actions:

  • Set default values in the configuration for a MapReduce job and set the input and output format classes. You can do so with these methods:
    configureTableInputFormat(org.apache.hadoop.mapreduce.Job job, String srcTable)
    This method performs these actions:
    • Set the serialization class for Document and Value objects. These interfaces are part of the OJAI (Open JSON Application Interface) API.
    • Set the field INPUT_TABLE in TableInputFormat to the path and name of the source table, and pass this value to the configuration for the MapReduce job.
    • Set the input format class for the job to TableInputFormat.
    configureTableOutputFormat(org.apache.hadoop.mapreduce.Job job, String destTable)
    This method performs these actions:
    • Set the field OUTPUT_TABLE in TableOutputFormat to the path and name of the destination table, and pass this value to the configuration for the MapReduce job.
    • Set the output format class for the job to TableOutputFormat.
    If you want to set values for other fields in TableInputFormat or TableOutputFormat, or write your own logic for them, you can pass field values to configurations and specify these classes for jobs as you would in common MapReduce applications.
  • Set default types for output keys and values. You can also set types for output keys and values from the map phase, if those types will differ from the final output types.
    • setMapOutputKeyValueClass(org.apache.hadoop.mapreduce.Job job)
    • setOutputKeyValueClass(org.apache.hadoop.mapreduce.Job job)
  • Configure a TotalOrderPartitioner and return the number of reduce tasks to use for a job.

    For example, suppose that in your application's method for creating a job, you include this line:

    int numReduceTasks = MapRDBMapReduceUtil.setPartitioner(
                                     org.apache.hadoop.mapreduce.Jobjob, String destPath);
    The setPartitioner() method finds out whether a table has been pre-split into two or more tablets, counts the number of tablets, writes the number to a partitioner file, and sends that file to an instance of TotalOrderPartitioner. This line also returns the number of tablets to numReduceTasks. Your code can then use that variable to set the number of reducers, like this:
    job.setNumReduceTasks(numReduceTasks);
The sample application gives an example of how to use MapRDBMapReduceUtil.

Mutating Rows in Destination Tables

Use the MutationSerialization and TableMutationRecordWriter classes when you need to mutate rows.

For example, suppose that you are tracking the number of users who are performing various actions on your retail website. To do this, at intervals you run your MapReduce application and save the results in OJAI documents in MapR-DB. Suppose that you count the number of users who went through the order process but abandoned their orders. After every run of the application, you want to update an OJAI document by adding the current count to the total count and by updating a field that tracks the date and time that the MapReduce application was last run.

You could do that by setting values in a DocumentMutation object (see the javadoc for OJAI (Open JSON Application Interface)). You would then serialize that and write it to the table with TableMutationRecordWriter.

Compiling and Running Applications

Compile applications as described in Compiling and Running Applications that Access JSON Tables and Documents.

Important: Turn off speculative execution

Speculative execution of MapReduce tasks is on by default. For custom applications that load MapR-DB tables, it is recommended to turn speculative execution off. When it is on, the tasks that import data might run multiple times. Multiple tasks for an incremental bulkload could insert one or more versions of a record into a table. Multiple tasks for a full bulkload could cause loss of data if the source data continues to be updated during the load.

If your custom MapReduce application uses MapRDBMapReduceUtil.configureTableOutputFormat(), you do not have to turn off speculative execution manually. This method turns it off automatically.

Turn of speculative execution by using either of these methods:
  • Set either of the following MapReduce parameters to false, depending on the version of MapReduce that you are using:
    • MRv1: mapred.map.tasks.speculative.execution
    • MRv2: mapreduce.map.speculative
  • Include the following line in the method in your application that sets parameters for jobs:
    job.setSpeculativeExecution(false);