MapR 5.0 Documentation : Bulk Loading and MapR-DB Tables

The most common way of loading data to a MapR-DB table is with a put operation. At large scales, however,  bulk loads offer a performance advantage over put operations.

Tools for bulk loads

Bulk loading is supported for the following tools, which can be used for both full or incremental bulk load operations:

  • MapR CopyTable Utility.
    This utility is different from Apache HBase's CopyTable utility. When copying data to MapR-DB tables, it is recommended to use the MapR-DB version, which copies table metadata, access control expressions, and more in addition to table data.

  • The ImportFiles tool, which imports HFile or Result files into a MapR table.
    hbase com.mapr.fs.hbase.mapreduce.ImportFiles -Dmapred.reduce.tasks=2 -inputDir /test/tabler.kv -table /table2 -format Result

    If you are running on an HBase 0.98 client but the exported files were generated with HBase 0.94, include -Dhbase.import.version=0.94 in the ImportFiles job. 

Types of bulk load

A bulk load can be performed as a full bulk load or as an incremental bulk load.

Full bulk loads

Full bulk loads offer the best performance advantage for empty tables. A full bulk load operation can only be performed to an empty table and skips the write-ahead log (WAL) typical of Apache HBase and MapR-DB table operations, resulting in increased performance.

Note: You can perform a full bulk load only on empty tables that have the bulk load attribute set to true. You can set this value only when creating a table.

Tables are unavailable for normal client operations, including put, get, and scan operations, while a full bulk load operation is in progress.

Creating a MapR-DB table with support for a full bulk load

When you create a MapR-DB table that you want to perform a full bulk load on, you must specify that you want to perform a full bulk load. Attempting a full bulk load on a table that does not have the bulk load attribute set to true results in an incremental bulk load being performed instead.

  • Using the maprcli table create command:
    Specify the value of the -bulkload parameter as true.

  • Using the hbase shell: 
    Specify the value of the BULKLOAD parameter as true, as in the following example:

create '/a0','f1', BULKLOAD => 'true'

 If you want to pre-split a table, separate the BULKLOAD parameter from the SPLITS parameter, as in this example:

hbase> create '/t1', 'f1', {SPLITS => ['10', '20', '30']}, {BULKLOAD => 'true'}

  • Using the MapR Control System (MCS): 
    Select the Bulkload check box under Table Properties.

After completing a full bulk load, make the table available to client applications with one of the following commands. You do not have to use either of these commands if you used the CopyTable utility, which makes tables available to client applications automatically after a load is completed.

# maprcli table edit -path /user/juser/mytable -bulkload false (command line)

hbase shell> alter '/user/juser/mytable', 'f2', BULKLOAD => 'false' (hbase shell)

After you perform a full bulk load on a table, you cannot perform a full bulk load on it again. You cannot use the maprcli table edit command or hbase shell alter command to set the value to true again. In MCS, although the Bulkload check box is enabled after a table is created, selecting that check box and clicking OK in the Table Properties dialog generates an error message.

Incremental bulk loads

Incremental bulk loads can add data to existing tables concurrently with other table operations, with better performance than put operations. This type of bulk load makes use of write-ahead log files.

Tables are available for client operations, such as put, get, and scan operations, during incremental bulk loads.

You can use incremental bulk loads to ingest large amounts of data to an existing table. Tables remain available for standard client operations such as put, get, and scan while the bulk load is in process. A table can perform multiple incremental bulk load operations simultaneously.

Creating a MapR-DB table with support for incremental loads

Whether you create a table with the maprcli table create command, with the hbase shell’s create command, or in MCS, incremental loads are supported by default.