MapR 5.0 Documentation : Hive and WebHCat Integration

The WebHCat server provides a REST-like web API for HCatalog. Applications make HTTP requests to run Pig, Hive, and HCatalog DDL from within applications.

This topic contains the following sections:

Configuring the WebHCat Server

The properties to configure WebHCat are in the /opt/mapr/hive/hive-<version>/hcatalog/etc/webhcat/webhcat-site.xml file.

When you set up WebHCat, you can configure MapR-FS and Zookeeper as storage.

  1. To configure storage for WebHCatadd the MapRFS location property.

    <property> <name>templeton.storage.class</name> <value>org.apache.hive.hcatalog.templeton.tool.HDFSStorage</value> </property> <property> <name>templeton.storage.root</name> <value>/user/mapr/webhcat</value> <description>The path to the directory to use for storage</description> </property>
  2. To configure WebHCat for Pig:
    1. Compress the Pig installation, then move the compressed file to the MapRFS layer.

      # cd /opt/mapr/pig 
      # tar -czvf /tmp/pig-<version>.tar.gz pig-<version>/
      # hadoop fs -mkdir /user/mapr/webhcat
      # hadoop fs -put /tmp/pig-<version>.tar.gz /user/mapr/webhcat/
      
    2. Set the value of the templeton.pig.archive property to the location of the compressed file.

      <property> <name>templeton.pig.archive</name> <value>maprfs:///user/mapr/webhcat/pig-<version>.tar.gz</value> </property>

       

    3. Set the value of the templeton.pig.path property to the path inside the compressed Pig file where the Pig binary is located.

      <property>
          <name>templeton.pig.path</name>
          <value>pig-<version>.tar.gz/pig-<version>/bin/pig</value>
      </property>
  3. To configure WebHCat for Hive:
    1. Compress the Hive installation, then move the compressed file to the MapR-FS layer.

      # cd /opt/mapr/hive  
      # tar -czvf /tmp/hive-<version>.tar.gz hive-<version>/ 
      # hadoop fs -mkdir /user/mapr/webhcat
      # hadoop fs -put /tmp/hive-<version>.tar.gz /user/mapr/webhcat
    2. Set the value of the templeton.hive.archive property to the location of the compressed file.

      <property> <name>templeton.hive.archive</name> <value>maprfs:///user/mapr/webhcat/hive-<version>.tar.gz</value> </property>
    3. Set the value of the templeton.hive.path property to the path inside the compressed Hive file where the Hive binary is located.

      <property>
           <name>templeton.hive.path</name>
           <value>hive-<version>.tar.gz/hive-<version>/bin/hive</value>
      </property>
  4. To Configure WebHCat for streaming:
    1. Copy the Streaming JAR to the MapR-FS layer.

      # hadoop fs -put 
      /opt/mapr/hadoop/hadoop-<version>/contrib/streaming/hadoop-<version>-dev-streaming.jar /user/mapr/webhcat
    2. Set the templeton.streaming.jar property to the location of the streaming JAR.

      <property> <name>templeton.streaming.jar</name> <value>maprfs:///user/mapr/webhcat/hadoop-<version>-dev-streaming.jar</value> </property>

Managing the WebHCat Server

As of Hive 0.13-1504 and Hive 1.0-1504, WebHCat is managed by Warden. Therefore, you can start and stop WebHCat using maprcli and the MapR Control System (MCS).

Starting WebHcat Server

(applies to versions prior to Hive 0.13-1504 and Hive 1.0-1504)

# ./webhcat_server.sh start

Starting WebHCat using the maprcli

  1. Make a list of nodes on which Hive Metastore is configured.
  2. Issue the maprcli node services command: 

    maprcli node services -name hcat -action start -nodes <space delimited list of nodes>

Stopping WebHCat  using the maprcli

  1. Make a list of nodes on which Hive Metastore is configured.
  2. Issue the maprcli node services command: 

    maprcli node services -name hcat -action stop -nodes <space delimited list of nodes>

Starting or Stopping WebHCat using the MapR Control System

  1. In the Navigation pane, expand the Cluster Views pane and click Dashboard.
  2. In the Services pane, click WebHcat to open the Nodes screen displaying all the nodes on which Hive Metastore is configured.
  3. On the Nodes screen, click the hostname of each node to display its Node Properties screen.
  4. On each Node Properties screen, use the Stop/Start button in the WebHcat row under Manage Node Services to start WebHcat.

Checking the Error Logs

Go to the /opt/mapr/hive/hive-<version>/logs/<user.name>/webhcat folder. 

If you are running a Hive 0.13 version prior Hive 0.13-1504, go to the /tmp/<user.name>/webhcat folder to view the error logs.

Verifying the Server's Status

In a web browser, navigate to http://hostname:50111/templeton/v1/status?user.name=root. A healthy server will return the string {"status":"ok","version":"v1"}. You can change the port number from the default value of 50111 by editing the webhcat-site.xml file.

Running Jobs on the WebHCat Server

REST Calls in WebHCat

The base URI for REST calls in WebHCat is http://<host>:<port>/templeton/v1/. The following table lists elements appended to the base URI and DDL commands.

URI

Description

Server Information

/status

Shows WebHCat server status.

/version

Shows WebHCat server version.

DDL Commands

/ddl/database

List existing databases.

/ddl/database/<mydatabase>

Shows properties for the database named mydatabase.

/ddl/database/<mydatabase>/table

Shows tables in the database named mydatabase.

/ddl/database/<mydatabase>/table/<mytable>

Shows the table definition for the table named mytable in the database named mydatabase.

/ddl/database/<mydatabase>/table/<mytable>/property

Shows the table properties for the table named mytable in the database named mydatabase.

Launching a MapReduce job with WebHCat

WebHCat launches two jobs for each MapReduce job. The first job, TempletonControllerJob, has one map task. The map task launches the actual job from the REST API call. Check the status of both jobs and the output directory contents.

  1. Copy the MapReduce example job to the MapRFS layer:

    hadoop fs -put /opt/mapr/hadoop/hadoop-<version>/hadoop-<version>-dev-examples.jar /user/mapr/webhcat/examples.jar
  2. Use the curl utility to launch the job: 

    curl -s -d jar=examples.jar -d class="terasort" -d arg=teragen.test -d arg=whop3 'http://localhost:50111/templeton/v1/mapreduce/jar?user.name=<username>'

Launching a Streaming MapReduce job with WebHCat

  1. Use the curl utility to launch the job: 

    curl -s -d arg=teragen.test -d output=mycounts -d mapper=/bin/cat -d reducer="/usr/bin/wc -w" 'http://localhost:50111/templeton/v1/mapreduce/streaming?user.name=<username>'
  2. Check the job status for both WebHCat jobs at the jobtracker page in the MCS.

Launching a Pig job with WebHCat

  1. Copy a data file into MapRFS: 

    hadoop fs -put $HIVE_HOME/examples/files/kv1.txt /user/<user name>/
  2. Create a test.pig file with the following contents: 

    A = LOAD 'kv1.txt' using PigStorage('\u0001') AS(key:INT, value:chararray);
    STORE A INTO 'pig.output';
  3. Copy the test.pig file into MapRFS: 

    hadoop fs -put test.pig /user/<user name>/
  4. Run the Pig REST API command: 

    curl -s -d file=test1.pig -d arg=-v 'http://localhost:50111/templeton/v1/pig?user.name=<username>'
  5. Monitor the contents of the pig.output directory.
  6. Check the JobTracker page for two jobs: TempletonControllerJob and PigLatin

Launching a Hive job with WebHCat

  1. Create a table: 

    curl -s -d execute="create+external+table+ext3(t+TIMESTAMP)+location /user/<user name>/ext3'" 'http://localhost:50111/templeton/v1/hive?user.name=<username>'
  2. Load data into the table: 

    curl -s -d execute="insert+overwrite+table+ext3+select+*+from+datetable" 'http://localhost:50111/templeton/v1/hive?user.name=<username>'
  3. List the tables: 

    curl -s -d execute="show+tables" -d statusdir='hive.output' 'http://localhost:50111/templeton/v1/hive?user.name=<username>'

    The list of tables is in hive.output/stdout.

The Job Queue

To show HCatalog jobs for a particular user, navigate to the following address:

http://<hostname>:<port>/templeton/v1/queue/?user.name=<username>

The default port for HCatalog is 50111.

A known HCatalog bug exists that fetches information for any valid job instead of checking that the job is an HCatalog job or was started by the specified user.