You can use the
HFileOutputFormat configureIncrementalLoad() method for writing custom MapReduce jobs to perform bulk loads. Although the name of the method implies that you can use it only for incremental bulk loads, the method also works for full bulk loads, provided that the
Bulkload parameter for a table is set to true, as described in Bulk Loading and MapR-DB Tables.
The HFileOutputFormat class on MapR clusters distinguishes between Apache HBase tables and MapR tables, behaving appropriately for each type. Existing workflows that rely on the
HFileOutputFormat class, such as the
ImportTsv utilities, support both types of tables without further configuration.
If you have a custom MapReduce applications that does not use
HFileOutputFormat.configureIncrementalLoad(), simply use the path to the MapR-DB table that you want to load. However, using
HFileOutputFormat.configureIncrementalLoad() gives you at least two advantages:
This method performs a number of tasks that your application would otherwise need to do explicitly:
Inspect the table to configure a total order partitioner
Upload the partitions file to the cluster and adds it to the DistributedCache
Set the number of reduce tasks to match the current number of regions
Set the output key/value class to match
Set the reducer up to perform the appropriate sorting (either
- This method turns off Speculative Execution automatically. For details, see the note below.
Turn Off Speculative Execution
Speculative Execution of MapReduce tasks is on by default. For custom applications that load MapR-DB tables, it is recommended to turn Speculative Execution off. When it is on, the tasks that import data might run multiple times. Multiple tasks for an incremental bulkload could insert one or more versions of a record into a table. Multiple tasks for a full bulkload could cause loss of data if the source data continues to be updated during the load.
If your custom MapReduce job uses
HFileOutputFormat.configureIncrementalLoad(), you do not have to turn off Speculative Execution manually.
HFileOutputFormat.configureIncrementalLoad() turns it off automatically. Speculative Execution is automatically turned off for MapReduce utilities such as
If you are writing a custom MapReduce job that does not use the
HFileOutputFormat configureIncrementalLoad() method for bulk loading, you must turn off Speculative Execution manually.
Turn of Speculative Execution by setting either of the following MapReduce parameters to false, depending on the version of MapReduce that you are using:
If the job is programmatically written, you can turn off Speculative Execution at the code level: