MapR 5.0 Documentation : Hive and MapR-DB Integration

You can create MapR-DB tables from Hive that can be accessed by both Hive and MapR-DB. You can run Hive queries on MapR-DB tables, convert existing MapR-DB tables into Hive-MapR-DB tables,  and run Hive queries on those tables as well.

Install and Configure Hive

  1. Install and configure Hive if it is not already installed.
  2. Execute the jps command and ensure that all relevant Hadoop, MapR, and Zookeeper processes are running.


    $ jps
    21985 HRegionServer
    1549 jenkins.war
    15051 QuorumPeerMain
    30935 Jps
    15551 CommandServer
    15698 HMaster
    15293 JobTracker
    15328 TaskTracker
    15131 WardenMain
  3. Open the hive-site.xml file with your favorite editor, or create a hive-site.xml file if it doesn't already exist:

    $ cd $HIVE_HOME
    $ vi conf/hive-site.xml
  4.  Copy the following XML code and paste it into the hive-site.xml file.

    If you already have an existing hive-site.xml file with a configuration element block, just copy the property element block code below and paste it inside the configuration element block in the hive-site.xml file. Be sure to use the correct values for the paths to your auxiliary JARs and ZooKeeper IP numbers.

    Example configuration:

    file:///opt/mapr/hbase/hbase-<version>/lib/hbase-client-<version>-mapr.jar, file:///opt/mapr/hbase/hbase-<version>/lib/hbase-server-<version>-mapr.jar,file:///opt/mapr/hbase/hbase-<version>/lib/hbase-protocol-<version>-mapr.jar,file:///opt/mapr/zookeeper/zookeeper-<version>/zookeeper-<version>.jar</value>
      <description>A comma separated list (with no spaces) of the jar files required for Hive-HBase integration</description>
      <description>A comma separated list (with no spaces) of the IP addresses of all ZooKeeper servers in the cluster.</description>
      <description>The Zookeeper client port. The MapR default clientPort is 5181.</description>
  5. Save and close the hive-site.xml file.

If you have successfully completed all of the steps in this section, you're ready to begin the tutorial in the next section. 

Getting Started with Hive and MapR-DB Integration

In this tutorial we will:

  • Create a Hive table
  • Populate the Hive table with data from a text file
  • Query the Hive table
  • Create a Hive-MapR-DB table
  • Introspect the Hive-MapR-DB table from the HBase shell
  • Populate the Hive-MapR-DB table with data from the Hive table
  • Query the Hive-MapR-DB table from Hive
  • Convert an existing MapR-DB table into a Hive-MapR table

Be sure that you have successfully completed all of the steps in the Installing Hive and review the MapR-DB topics before beginning this Getting Started tutorial.

This Getting Started tutorial is based on the Hive-HBase Integration section of the Apache Hive Wiki. However, please note that there are some significant differences.

Create a Hive table with two columns

Change to your Hive installation directory if you're not already there and start Hive:

$ bin/hive

Execute the CREATE TABLE command to create the Hive pokes table

hive> CREATE TABLE pokes (foo INT, bar STRING);

To see if the pokes table has been created successfully, execute the SHOW TABLES command

Time taken: 0.74 seconds

The pokes table appears in the list of tables. 

Populate the Hive pokes table with data:

The kv1.txt file is provided in the $HIVE_HOME/examples/files directory. Execute the LOAD DATA LOCAL INPATH command to populate the Hive pokes table with data from the kv1.txt file.

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

A message appears confirming that the table was created successfully, and the Hive prompt reappears:

Copying data from file:
Time taken: 0.278 seconds

Execute a SELECT query on the Hive pokes table

hive> SELECT * FROM pokes WHERE foo = 98;

The SELECT statement executes, runs a MapReduce job, and prints the job output:

98      val_98
98      val_98
Time taken: 18.059 seconds

The output of the SELECT command displays two identical rows because there are two identical rows in the Hive pokes table with a key of 98.

Hive tables can have multiple identical keys. As we will see shortly, MapR-DB tables cannot have multiple identical keys, only unique keys.

Create a Hive-MapR-DB table

Enter these four lines of code at the Hive prompt:

hive> CREATE TABLE mapr_table_1(key int, value string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
    > TBLPROPERTIES ("" = "/user/mapr/xyz");

After a brief delay, a message appears confirming that the table was created successfully:

Time taken: 5.195 seconds

Note: The TBLPROPERTIES command is not required, but those new to Hive-MapR-DB integration may find it easier to understand what's going on if Hive and MapR-DB use different names for the same table.

In this example, Hive will recognize this table as "mapr_table_1" and MapR-DB will recognize this table as "xyz".

Start the HBase shell

Keeping the Hive terminal session open, start a new terminal session for HBase, then start the HBase shell:

$ bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.4, rUnknown, Wed Nov  9 17:35:00 PST 2011


Execute the list command to see a list of HBase tables

hbase(main):001:0> list
1 row(s) in 0.8260 seconds

HBase recognizes the Hive-MapR-DB table named xyz in directory /user/mapr. This is the same table known to Hive as mapr_table_1.

Display the description of the /user/mapr/xyz table in the HBase shell

hbase(main):004:0> describe "/user/mapr/xyz"
DESCRIPTION                                          ENABLED
 {NAME => '/user/mapr/xyz', FAMILIES => [{NAME => 'cf1', DATA_B true
 S => '0', TTL => '2147483647', KEEP_DELETED_CELLS =
 > 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fals
 e', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}
1 row(s) in 0.0240 seconds

From the Hive prompt, insert data from the Hive table pokes into the Hive-MapR-DB table mapr_table_1

hive> INSERT OVERWRITE TABLE mapr_table_1 SELECT * FROM pokes WHERE foo=98;
2 Rows loaded to mapr_table_1
Time taken: 13.384 seconds

Query mapr_table_1 to see the data we have inserted into the Hive-MapR-DB table

hive> SELECT * FROM mapr_table_1;
98      val_98
Time taken: 0.56 seconds

Even though we loaded two rows from the Hive pokes table that had the same key of 98, only one row was actually inserted into mapr_table_1. This is because mapr_table_1 is a MapR-DB table, and although Hive tables support duplicate keys, MapR-DB tables only support unique keys. MapR-DB tables arbitrarily retain only one key, and silently discard all of the data associated with duplicate keys. 

Convert a pre-existing MapR-DB table to a Hive-MapR-DB table

To convert a pre-existing MapR-DB table to a Hive-MapR-DB table, enter the following four commands at the Hive prompt.

Note that in this example the existing MapR-DB table is my_mapr_table in directory /user/mapr.

hive> CREATE EXTERNAL TABLE mapr_table_2(key int, value string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val")
    > TBLPROPERTIES("" = "/user/mapr/my_mapr_table");

Now we can run a Hive query against the pre-existing MapR-DB table /user/mapr/my_mapr_table that Hive sees as mapr_table_2:

hive> SELECT * FROM mapr_table_2 WHERE key > 400 AND key < 410;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
401     val_401
402     val_402
403     val_403
404     val_404
406     val_406
407     val_407
409     val_409
Time taken: 9.452 seconds