Integrate MapR Pig and MapR HBase

This document shows an example of MapR Pig and MapR HBase integration. The goal of integration is to upload data from the MapR Filesystem to Pig and then move the data to an HBase table.

Configuring MapR Pig and MapR HBase

No additional configuration is needed to integrate MapR HBase and MapR Pig.

MapR Pig and MapR HBase Integration Example

  1. Create sample data, and upload the data to the MapR Filesystem:
    1. Create a sample data file:
      vim input.csv
    2. Add data to the file:
      1,aaa,bbb
      2,ccc,ddd
      3,rrr,fff
      4,ttt,yyy
    3. Upload the data to the MapR Filesystem:
      hadoop fs -put input.csv /user/mapr/input.csv
  2. Create a sample table in HBase:
    1. Start the HBase shell:
      hbase shell
    2. Create a table:
      hbase(main):012:0> create 'sample_names', 'info'
  3. Load the data to Pig, and store the data in HBase:
    1. Start the Pig shell:
      pig
    2. Load the data to Pig:
      raw_data = LOAD '/user/mapr/input.csv' USING PigStorage(',') AS (listing_id: chararray, fname: chararray, lname: chararray);
    3. Store the data in HBase:
      STORE raw_data INTO 'sample_names' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('info:fname info:lname');
  4. Verify the data in HBase:
    1. Start the HBase shell:
      hbase shell
    2. Query the data:
      hbase(main):017:0* scan 'sample_names'
    The result is:
    ROW                         	COLUMN+CELL                                                                                                                                             	 
     1	column=info:fname, timestamp=1574946889082, value=aaa
     1	column=info:lname, timestamp=1574946889082, value=bbb
     2	column=info:fname, timestamp=1574946889091, value=ccc
     2	column=info:lname, timestamp=1574946889091, value=ddd
     3	column=info:fname, timestamp=1574946889091, value=rrr
     3	column=info:lname, timestamp=1574946889091, value=fff  
     4	column=info:fname, timestamp=1574946889091, value=ttt
     4	column=info:lname, timestamp=1574946889091, value=yyy