Integrate Pig and HBase

This document shows an example of a Pig and HBase integration. The goal of integration is to upload data from the MapR Filesystem to Pig and then move the data to an HBase table.

Configuring Pig and HBase

No additional configuration is needed to integrate HBase and Pig.

Pig and HBase Integration Example

  1. Create sample data, and upload the data to the MapR Filesystem:
    1. Create a sample data file:
      vim input.csv
    2. Add data to the file:
      1,aaa,bbb
      2,ccc,ddd
      3,rrr,fff
      4,ttt,yyy
    3. Upload the data to the MapR Filesystem:
      hadoop fs -put input.csv /user/mapr/input.csv
  2. Create a sample table in HBase:
    1. Start the HBase shell:
      hbase shell
    2. Create a table:
      hbase(main):012:0> create 'sample_names', 'info'
  3. Load the data to Pig, and store the data in HBase:
    1. Start the Pig shell:
      pig
    2. Load the data to Pig:
      raw_data = LOAD '/user/mapr/input.csv' USING PigStorage(',') AS (listing_id: chararray, fname: chararray, lname: chararray);
    3. Store the data in HBase:
      STORE raw_data INTO 'sample_names' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('info:fname info:lname');
  4. Verify the data in HBase:
    1. Start the HBase shell:
      hbase shell
    2. Query the data:
      hbase(main):017:0* scan 'sample_names'
    The result is:
    ROW                         	COLUMN+CELL                                                                                                                                             	 
     1	column=info:fname, timestamp=1574946889082, value=aaa
     1	column=info:lname, timestamp=1574946889082, value=bbb
     2	column=info:fname, timestamp=1574946889091, value=ccc
     2	column=info:lname, timestamp=1574946889091, value=ddd
     3	column=info:fname, timestamp=1574946889091, value=rrr
     3	column=info:lname, timestamp=1574946889091, value=fff  
     4	column=info:fname, timestamp=1574946889091, value=ttt
     4	column=info:lname, timestamp=1574946889091, value=yyy