Quick Tips for Using the Hive Shell Inside Scripts

Contributed by

19 min read

There are many great examples out there for using the Hive shell, as well as examples of ways to automate many of the animals in our Hadoop zoo. However, if you’re just getting started, or need something fast that won’t stay around long, then all you need to do is throw a few lines of code together with some existing programs in order to avoid re-inventing the workflow. In this blog post, I’ll share a few quick tips on using the Hive shell inside scripts. We’ll take a look at a simple script that needs to pull an item or count, and then look at two ways to use the Hive shell to get an answer.

All of the steps in this example were executed on Hive 0.13 on a MapR cluster using CentOS.

A Look at the Data Setup

In this setup, we have a data set that consists of travel times for specific traffic routes. We’ve pulled together travel times for days of the year, so when someone wants to know what to expect, we can quickly pull the info. In the past this was done on a traditional RDBMS, but we moved it off that in order to save on processing time and money, and are now using the MapR Distribution including Hadoop. The table is an external table and is tab-delimited. There are more efficient ways to store the data, but we’ll cover those comparisons in a later post. The table is also partitioned by collection day. If you’re not familiar with a partition, it just means that when we enter data into our table, it is grouped into folders of records on the file system. The data in each folder in this case represents a day.

CREATE EXTERNAL TABLE traffic_speeds (
  `id` STRING,
  `collectiontime` string,
  `speed` STRING,
  `traveltime` STRING,
  `linkpoints` STRING)
PARTITIONED BY (
  `collectionday` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION
 'maprfs:/mapr/my_cluster/data/traffic/trafficspeeds';

Here is a sample of some typical data:

Ways to Use Hive

In this example, we’re going to use a simple Bash script to extract some values out of a Hive table using the Hive shell. The same methods can be used for just about any scripting language. In this case, we are going to take a few parameters as arguments and then execute them on Hive using the Hive shell. The results are captured as a variable and then echoed to standard out. There are two basic ways to pass things off to the Hive shell to execute. This script will walk through both of those options. The first option will consist of the Hive command running an external HQL script and returning a value. The second option will consist of passing all the needed commands to Hive at one time and returning a value.

Option 1: Using Hive in Conjunction with an HQL Script

One way to use Hive inside a script is to have the Hive shell execute an HQL file. This is accomplished with the file option. hive -f my_script.hql

An HQL script is just a series of Hive query language commands. They are the same ones you would use in the Hive shell.

example.hql:

show databases;
show tables;
set dfs.block.size=1073741824;
select * from my_database.my_table limit 10;

In order to make this a process that can be used more generically, Hive allows you to pass in variables using the -hiveconf option hive -hiveconf MY_VAR1=value1 -hiveconf MY_VAR2=value2 -f my_script.hql

Inside the HQL file, it is used like this:

show databases;
show tables;
set dfs.block.size=${hiveconf:MY_ VAR1};
select * from ${hiveconf:MY_ VAR2} limit 10;

Operational messages are not typically passed back to the parent script and can be removed with the silent options -S. This will make things far less chatty. hive -S -hiveconf MY_VAR1=value1 -hiveconf MY_VAR2=value2 -f my_script.hql

Pulling a value from the command will vary depending on your script language of choice, but one thing that will hold true for all of them is ensuring that only one Hive command that returns rows or values is returned. In the example above, we have three. If the data from the select statement is what I want to get cleanly, then I would remove the show commands from the HQL file. set dfs.block.size=${hiveconf:MY_ VAR1}; select * from ${hiveconf:MY_ VAR2} limit 10;

An example of pulling a value using Bash looks something like this: my_value=hive -S -hiveconf MY_VAR1=value1 -hiveconf MY_VAR2=value2 -f my_script.hql

Option 2: Passing Commands Directly to Hive

Another way to use Hive inside a script is to pass the HQL commands as a query string for the Hive shell to execute. This is accomplished with the -e option. hive -e "select * from my_database.my_table limit 10;"

You can actually add several HQL commands in that string, which comes in handy when you have to specify a database because the next command only has a table option. An example is when you load or add a partition with the ALTER command. hive -e "USE my_database; alter table my_table add if not exists partition(my_partition=my_value);"

As with the previous example, the only values returned are values pulled from the dataset. Hive operational messages may show up on the screen, but are not included in the response. You can remove that chatter by specifying silent mode again with the -S option.

hive -S -e "USE my_database; alter table my_table add if not exists partition(my_partition=my_value); select * from my_table  limit 10;"

As I mentioned earlier, pulling a value from the command will vary depending on your script language of choice. The one thing that will hold true for all of them is ensuring that only one Hive command has rows or values that are returned. Another example of pulling a value using Bash looks something like this:

my_value=$( hive -S -e "USE my_database; alter table my_table add if not exists partition(my_partition=my_value); select * from my_table  limit 10;")

An Example Script That Uses Hive

Let’s take a look at a Bash script that runs either of the options mentioned above in a more generic fashion in order to pull average traffic transit times for a given station.

Script Overview

The script for the rest of the doc is called “hive_scripts.sh”. It allows for two operations: “pull_with_hql_script” or “pull_with_one_liner”. The basic flow is: ./hive_scripts.sh command [options]

Example Using External Hive Files with an HQL Script

Let’s take a look at using Hive HQL files in conjunction with a Bash script in order to ascertain a given station’s average traffic speed.

Command arguments

The arguments for “./hive_script.sh pull_with_hql_script”

Example execution

In this example, the command line arguments are parsed, and passed to Hive to execute in our HQL script.

Command: ./hive_scripts.sh pull_with_hql_script -c id=A2F36 -d default -t traffic_speed -r speed

The select statement is built to execute an average of the traffic times throughout the range specified. In this case, a single station id was selected. select=<font color="red">"AVG($requested_column)"</font>

The Hive command is executed and the value is echoed to standard out. The variable Hive_script was hard coded, but could be changed to a passed object like all the other variables.

hive_script=<font color="red">"hive_script.hql"</font>
my_value=<font color="red">`hive -S -hiveconf MY_SELECT=$select -hiveconf MY_COMPARISON=$comparison -hiveconf MY_DATABASE=$database -hiveconf MY_TABLE=$table -f $hive_script`</font>
echo <font color="red">"returned value = $my_value"</font>

Example passing all commands to Hive directly

Let’s take a look at passing Hive a set of HQL commands as a string with a Bash script to find a given station’s average traffic speed:

Command arguments

The arguments for “./hive_script.sh pull_with_one_liner”

Example execution

In this example, the command line arguments are parsed, and passed to Hive to execute as a string. Additionally, there is an example of checking to see if partitions exist and are loaded.

Command: ./hive_scripts.sh pull_with_one_liner -c id=A2F36 -d default -t traffic_speed -r speed -b 20130621 -e 20130629

The select statement is built to execute an average of the traffic times throughout the range specified. In this case, a single station id was selected. A date range was also passed and could be used to enhance the data selection and filtering process, but in this case it was used to load partitions for the given range if they were not found.

As in the example above, the select statement is built for use in the command. select=<font color="red">"AVG($requested_column)"</font>

The values are passed and used to build the query. The query is passed to Hive and executed. The value is stored as my_value and is echoed to standard out.

<font color="green"># lets build the query we will execute in the hive shell</font>
my_query=<font color="red">"set mapred.reduce.tasks=30;"</font>
my_query=<font color="red">"$my_query SELECT $select"</font>
my_query=<font color="red">"$my_query FROM ("</font>
my_query=<font color="red">"$my_query SELECT DISTINCT collectiontime, id, collectionday, speed"</font>
my_query=<font color="red">"$my_query FROM $database.$table"</font>
my_query=<font color="red">"$my_query WHERE $comparison"</font>
my_query=<font color="red">"$my_query ) t"</font>
my_query=<font color="red">"$my_query GROUP by id, collectionday;"</font>

<font color="green"># echo the query passed to the hive shell just because</font>
echo <font color="red">"hive -S -e \"$my_query\""</font> 
my_value=$(hive -e <font color="red">"$my_query"</font>)
echo <font color="red">"returned value = $my_value"</font>

hive_scripts.sh

<font color="green">#!/bin/bash
# hive_scripts.sh

# this will print the usage statements and exit</font>
usage() {
    <font color="pink">case</font> $<font color="blue">1</font> <font color="pink">in</font>
        <font color="red">""</font>)
            echo <font color="red">""</font>
            echo <font color="red">"Usage: hive_scripts.sh command [options]"</font>
            echo <font color="red">"      for info on each command: --> hive_scripts.sh command -h|--help"</font>
            echo <font color="red">"Commands:"</font>
            echo <font color="red">"      hive_scripts.sh pull_with_hql_script [options]"</font>
            echo <font color="red">"      hive_scripts.sh pull_with_one_liner [options]"</font>
            echo <font color="red">""</font>
            echo <font color="red">""</font>
            ;;
        pull_with_hql_script)
            echo <font color="red">""</font>
            echo <font color="red">"Usage: hive_scripts.sh pull_with_hql_script [-h|--help]"</font>
            echo <font color="red">""</font>
            echo <font color="red">"  This is a quick example of using a hql script with a bash script to pull a value:"</font>
            echo <font color="red">""</font>
            echo <font color="red">"Params:"</font>
            echo <font color="red">"      -c|--comparison comparison_statement: column=value"</font>
            echo <font color="red">"      -d|--database database_name"</font>
            echo <font color="red">"      -h|--help: print this help info and exit"</font>
            echo <font color="red">"      -r|--requested_column: column to get averaged value from"</font>
            echo <font color="red">"      -t|--table table_name"</font>
            echo <font color="red">"Examples:"</font>
            echo <font color="red">""</font>
            echo <font color="red">"        ./hive_scripts.sh pull_with_hql_script -c a_column_name=a_value -d a_database_name -t a_table_name -r a_column_name"</font>
            echo <font color="red">""</font>
            ;;
        pull_with_one_liner)
            echo <font color="red">""</font>
            echo <font color="red">"Usage: hive_scripts.sh pull_with_one_liner [-h|--help]"</font>
            echo <font color="red">""</font>
            echo <font color="red">"  This is a quick example of passing query strings directly to a hive shell with a bash script to pull a value:"</font>
            echo <font color="red">""</font>
            echo <font color="red">"Params:"</font>
            echo <font color="red">"      -b|--begin_date yyyymmdd"</font>
            echo <font color="red">"      -c|--comparison comparison_statement: column=value"</font>
            echo <font color="red">"      -d|--database database_name"</font>
            echo <font color="red">"      -e|--end_date yyyymmdd"</font>
            echo <font color="red">"      -h|--help: print this help info and exit"</font>
            echo <font color="red">"      -r|--requested_column: column to get averaged value from"</font>
            echo <font color="red">"      -t|--table table_name"</font>
            echo <font color="red">"Examples:"</font>
            echo <font color="red">""</font>
            echo <font color="red">"        ./hive_scripts.sh pull_with_one_liner -c a_column_name=a_value -d a_database_name -t a_table_name -r a_column_name -b 20120122 -e 20130122"</font>
            echo <font color="red">""</font>
            ;;        
    <font color="pink">esac
    exit</font>
}

<font color="green"># this will process command line arguments enough to get you to a specific function</font>
args() {
    echo <font color="red">"processing command request"</font>
    <font color="pink">case</font> $<font color="blue">1</font> <font color="pink">in</font>
        pull_with_hql_script)
            <font color="pink">shift</font>
            pull_with_hql_script $@
            ;;
        pull_with_one_liner)
            <font color="pink">shift</font>
            pull_with_one_liner $@
            ;;

        *)
            echo >&<font color="blue">2</font> <font color="red">"Invalid comand: $1"</font>
            usage
            ;;
    <font color="pink">esac</font>
}

pull_with_hql_script() {
    <font color="green"># init params</font>
    requested_column=<font color="red">""</font>
    database=<font color="red">""</font>
    table=<font color="red">""</font>
    comparison=<font color="red">""</font>
    hive_script=<font color="red">"hive_script.hql"</font>
    select=<font color="red">""</font>
    begin_date=<font color="red">""</font>
    end_date=<font color="red">""</font>

    <font color="green"># process args for this block</font>
    <font color="pink">while</font> test $# -gt <font color="blue">0</font>
    <font color="pink">do</font>
    <font color="pink">case</font> $<font color="blue">1</font> <font color="pink">in</font>
            -c|--comparison)
                <font color="pink">shift</font>
                comparison=$<font color="blue">1</font>
                ;;
            -d|--database)
                <font color="pink">shift</font>
                database=$<font color="blue">1</font>
                ;;
            -h|--help)
                usage pull_with_hql_script
                ;;
            -r|--requested_column)
                <font color="pink">shift</font>
                requested_column=$<font color="blue">1</font>
                ;;
            -t|--table)
                <font color="pink">shift</font>
                table=$<font color="blue">1</font>
                ;;
            *)
                echo >&<font color="blue">2</font> <font color="red">"Invalid argument: $1"</font>
                usage <font color="red">"pull_with_hql_script"</font>
                ;;
        <font color="pink">esac
        shift
    done</font>

    <font color="green"># determine if any option is missing</font>    
    <font color="pink">if [ x<font color="red">"$requested_column"</font> == <font color="red">"x"</font> ]; <font color="pink">then</font>
        echo <font color="red">"missing requested column: -r|--requested_column column_name_to_count"</font>
        usage <font color="red">"pull_with_hql_script"</font>
    <font color="pink">fi

    if</font> [ x<font color="red">"$database"</font> == <font color="red">"x"</font> ]; <font color="pink">then</font>
        echo <font color="red">"missing database name: -d|--database database_name"</font>
        usage <font color="red">"pull_with_hql_script"</font>
    <font color="pink">fi

    if</font> [ x<font color="red">"$table"</font> == <font color="red">"x"</font> ]; <font color="pink">then</font>
        echo <font color="red">"missing table name: -t|--table table_name"</font>
        usage <font color="red">"pull_with_hql_script"</font>
    <font color="pink">fi</font>

    if</font> [ x<font color="red">"$comparison"</font> == <font color="red">"x"</font> ]; <font color="pink">then</font>
        echo <font color="red">"missing comparison clause WHERE <comparison>: -c|--comparison comparison"</comparison></font>
        usage <font color="red">"pull_with_hql_script"</font>
    <font color="pink">fi</font>

    <font color="green"># set select statement</font>
    select=<font color="red">"AVG($requested_column)"</font>

    <font color="green"># echo the command used to run the hive hql script just because</font>
    echo <font color="red">"hive -S -hiveconf MY_SELECT=$select -hiveconf MY_COMPARISON=$comparison -hiveconf MY_DATABASE=$database -hiveconf MY_TABLE=$table -f $hive_script"</font> 
    my_value=<font color="red">`hive -S -hiveconf MY_SELECT=$select -hiveconf MY_COMPARISON=$comparison -hiveconf MY_DATABASE=$database -hiveconf MY_TABLE=$table -f $hive_script`</font>
    echo <font color="red">"returned value = $my_value"</font>
    <font color="pink">exit</font>

}

pull_with_one_liner() {
    <font color="green"># init params</font>
    requested_column=<font color="red">""</font>
    distinct=<font color="red">""</font>
    database=<font color="red">""</font>
    table=<font color="red">""</font>
    comparison=<font color="red">""</font>
    hive_script=<font color="red">"hive_script.hql"</font>
    select=<font color="red">""</font>
    begin_date=<font color="red">""</font>
    end_date=<font color="red">""</font>
    load=true

    <font color="green"># process args for this block</font>
    <font color="pink">while</font> test $# -gt <font color="blue">0</font>
    <font color="pink">do</font>
    <font color="pink">case</font> $<font color="blue">1</font> <font color="pink">in</font>
            -b|--begin_date)
                <font color="pink">shift</font>
                begin_date=$<font color="blue">1</font>
                ;;
            -c|--comparison)
                <font color="pink">shift</font>
                comparison=$<font color="blue">1</font>
                ;;
            -d|--database)
                <font color="pink">shift</font>
                database=$<font color="blue">1</font>
                ;;
            -e|--end_date)
                <font color="pink">shift</font>
                end_date=$<font color="blue">1</font>
                ;;
            -h|--help)
                usage pull_with_one_liner
                ;;
            -r|--requested_column)
                <font color="pink">shift</font>
                requested_column=$<font color="blue">1</font>
                ;;
            -t|--table)
                <font color="pink">shift</font>
                table=$<font color="blue">1</font>
                ;;
            *)
                echo >&<font color="blue">2</font> <font color="red">"Invalid argument: $1"</font>
                usage <font color="red">"pull_with_one_liner"</font>
                ;;
        <font color="pink">esac
        shift
    done</font>

    <font color="green"># determine if any option is missing</font>    
    <font color="pink">if</font> [ x<font color="red">"$requested_column"</font> == <font color="red">"x"</font> ]; <font color="pink">then</font>
        echo <font color="red">"missing requested column: -r|--requested_column column_name_to_count"</font>
        usage <font color="red">"pull_with_one_liner"</font>
    <font color="pink">fi

    if</font> [ x<font color="red">"$database"</font> == <font color="red">"x"</font> ]; <font color="pink">then</font>
        echo <font color="red">"missing database name: -d|--database database_name"</font>
        usage <font color="red">"pull_with_one_liner"</font>
    <font color="pink">fi

    if</font> [ x<font color="red">"$table"</font> == <font color="red">"x"</font> ]; <font color="pink">then</font>
        echo <font color="red">"missing table name: -t|--table table_name"</font>
        usage <font color="red">"pull_with_one_liner"</font>
    <font color="pink">fi

    if</font> [ x<font color="red">"$comparison"</font> == <font color="red">"x"</font> ]; <font color="pink">then</font>
        echo <font color="red">"missing comparison clause WHERE <comparison>: -c|--comparison comparison"</font>
        usage <font color="red">"pull_with_one_liner"</font>
    <font color="pink">fi

    if</font> [ x<font color="red">"$begin_date"</font> == <font color="red">"x"</font> ]; <font color="pink">then</font>
        echo <font color="red">"missing start date <2012-01-28>: -b|--begin_date yyyy-mm-dd"</font>
        usage <font color="red">"pull_with_one_liner"</font>
    <font color="pink">fi

    if</font> [ x<font color="red">"$end_date"</font> == <font color="red">"x"</font> ]; <font color="pink">then</font>
        echo <font color="red">"missing ending date <2014-02-17>: -b|--begin_date yyyy-mm-dd"</font>
        usage <font color="red">"pull_with_one_liner"</font>
    <font color="pink">fi</font>

    <font color="green"># get start date and end date in correct format</font>
    begin_date=<font color="red">`date --date="$begin_date" '+%Y%m%d'`</font>
    end_date=<font color="red">`date --date="$end_date" '+%Y%m%d'`</font>

    <font color="green"># set select statement in affect overriding any set by argument</font>
    select=<font color="red">"avg($requested_column)"</font>

    <font color="green"># before we query for information we may want to make sure we have all the partitions loaded?
    # in this case the table was external so there may be partitions that were not loaded 
    # when the table was created
    # Here is an example of executing two commands with hive in a bash shell</font>
    partitions=$(hive -e <font color="red">"USE $database; SHOW PARTITIONS $table"</font>)

    <font color="green"># loop through all the partitions and load if not present 
    # in most case this is not needed.</font>
    my_date=$begin_date
    <font color="pink">while</font> test  $my_date -le $end_date
    <font color="pink">do</font>
        <font color="green"># Load partitions</font>
        <font color="pink">if</font> [ <font color="red">"$load"</font> = true ]; <font color="pink">then</font>
            <font color="green"># check if partition is present</font>
            add_partition=true
            <font color="pink">for</font> partition <font color="pink">in</font> $partitions
            <font color="pink">do</font>
                my_timestamp=<font color="red">`date -d "$my_date" "+%s"`</font>
             <font color="pink">if</font> [ <font color="red">"$partition"</font> == <font color="red">"collectionday=$my_timestamp"</font> ]; <font color="pink">then</font>
                    add_partition=false;
             <font color="pink">fi
            done
            if</font> [ <font color="red">"$add_partition"</font> = true ]; <font color="pink">then</font>
             <font color="green"># Here is an example of executing two statements in the hive shell</font>
                echo <font color="red">"hive -e \"USE $database; alter table $table add if not exists partition(collectionday=$my_timestamp);\""</font>
                hive -e <font color="red">"USE $database; alter table $table add if not exists partition(collectionday=$my_timestamp);"</font>
            <font color="pink">fi
        fi</font>
        my_date=<font color="red">`date --date="$my_date + 1 day" '+%Y%m%d'`</font>
    <font color="green"># end of while loop</font>
    <font color="pink">done</font>

    <font color="green"># lets build the query we will execute in the hive shell</font>
    my_query=<font color="red">"set mapred.reduce.tasks=30;"</font>
    my_query=<font color="red">"$my_query SELECT $select"</font>
    my_query=<font color="red">"$my_query FROM ("</font>
    my_query=<font color="red">"$my_query SELECT DISTINCT collectiontime, id, collectionday, speed"</font>
    my_query=<font color="red">"$my_query FROM $database.$table"</font>
    my_query=<font color="red">"$my_query WHERE $comparison"</font>
    my_query=<font color="red">"$my_query ) t"</font>
    my_query=<font color="red">"$my_query GROUP by id, collectionday;"</font>

    <font color="green"># echo the query passed to the hive shell just because</font>
    echo <font color="red">"hive -S -e \"$my_query\""</font> 
    my_value=$(hive -e <font color="red">"$my_query")</font>
    echo <font color="red">"returned value = $my_value"</font>
    <font color="pink">exit</font>

}

<font color="green"># -------------------------------------------------------------------------------------
# Beginning of script execution
#</font>

args $@

hive_script.hql

<font color="gray">set</font> mapred.reduce.tasks=30;
use ${hiveconf:MY_DATABASE};
<font color="gray">SELECT</font> ${hiveconf:MY_SELECT}
<font color="gray">FROM</font> ( <font color="gray">SELECT DISTINCT</font> collectiontime, id, collectionday, speed <font color="gray">FROM</font> ${hiveconf:MY_TABLE} <font color="gray">WHERE</font> ${hiveconf:MY_COMPARISON}) t
<font color="gray">GROUP</font> by id, collectionday;

Using the Hive shell inside scripts is a great way to avoid re-inventing the workflow. In this blog post, I’ve showed you how to use Hive in conjunction with an HQL script, as well as how to pass commands directly to Hive.

If you have any questions, please ask them in the comments section below.


This blog post was published December 11, 2014.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now