Secondary Indexing for MapR Database using Elasticsearch

Contributed by

6 min read

In the wide column data model of MapR Database, all rows are stored by a row key, column family, column qualifier, value, and timestamps. In the current version, the row key is the only field that is indexed, which fits the common pattern of queries based on the row key. However, some use cases might require scans based on column data. Let's take an example of a simple MapR Database table that stores person data. The table contains a person ID as the row key and details in the column family. The “details” column family stores “fname,” “lname,” and “address” as the columns. For this table, if access patterns are based on columns “fname” or “lname,” queries will result in a full table scan. This article describes a solution to avoid full table scans when queries are based on values other than the row key. This solution uses Hive for interacting with MapR Database and Elasticsearch. And since MapR Database uses the HBase API, this solution also uses the HBaseStorageHandler to interact with MapR Database, and EsStorageHandler to store indexes of MapR Database tables in Elasticsearch.

There are two relevant processes here: Data Ingestion and data retrieval.

  1. Data ingestion: Steps 1 and 2 in the diagram below entail data ingestion and are typically done in a single process. MapR Database can alternatively handle step 2 automatically using the built-in Elasticsearch integration (more on this later).
  2. Data retrieval: Steps 3 and 4 entail searching the index first on any value, and then using the row key to fetch the complete data.

This example assumes some working knowledge of MapR Database, Elasticsearch, and Hive. Here are the steps:

Step 1) Create some sample data in MapR Database, consisting of row keys and other attributes. This data table will be used for creating an external table in Hive.
Below is an example schema and sample data:

Create the table in MapR Database (note that we are using the HBase shell for accessing MapR Database): hbase(main):001:0> create '/user/user01/person', {NAME=>'details'}
Insert data in the table.
hbase(main):008:0> put '/user/user01/person', '1','details:fname','Tom’
hbase(main):008:0> put '/user/user01/person', '1','details:lname','John’
hbase(main):008:0> put '/user/user01/person', '1','details:address','350 Holger Way’
hbase(main):008:0> put '/user/user01/person', '2','details:fname','David’
hbase(main):008:0> put '/user/user01/person', '2','details:lname','Warner’
hbase(main):008:0> put '/user/user01/person', '2','details:address','1950 Holger Way’
hbase(main):008:0> put '/user/user01/person', '3','details:fname','Aaron’
hbase(main):008:0> put '/user/user01/person', '3','details:lname','Finch’
hbase(main):008:0> put '/user/user01/person', '3','details:address','1951 Holger Way’

Step 2) Create an external table in Hive pointing to the MapR Database table.

hive> CREATE EXTERNAL TABLE person_hbase(key int, fname string, lname string, address string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,details:fname,details:lname,details:address")
TBLPROPERTIES("hbase.table.name" = "/user/user01/person");

After the table is created, verify the data can be retrieved from the hbase table.

hive> select * from person_hbase;
OK
1    Tom    John        350 Holger Way
2    David    Warner                1950 Holger Way
3    Aaron    Finch        1951 Holger Way
Time taken: 1.563 seconds, Fetched: 3 row(s)
hive>

Step 3) Next, index the attributes in Elasticsearch including the row key. Create an external Hive table with data from the HBase table which will be used to populate the Elasticsearch table.

Download elasticsearch-hadoop-2.0.0.RC1.zip and unzip it.
% wget http://download.elasticsearch.org/hadoop/elasticsearch-hadoop-2.0.0.RC1.zip
% unzip elasticsearch-hadoop-2.0.0.RC1.zip

Use the jar and set hive.aux.jars.path as shown in below command. This is for resolving org.elasticsearch.hadoop.hive.EsStorageHandler class used in creation of external table.
% hive -hiveconf hive.aux.jars.path=elasticsearch-hadoop-2.0.0.RC1.jar

hive>CREATE EXTERNAL TABLE person_es (id BIGINT, fname STRING, lname STRING)
STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'contact/person','es.id.field' = 'id', 'es.index.auto.create' = 'true');

Load index data into the Elasticsearch table from the data in the person_hbase table.

hive> INSERT INTO table person_es select * FROM person_hbase;

Note that the last “insert” command above simplifies the indexing process for the sake of this tutorial, and as shown earlier in this article, the indexing step is typically done by the application in a real environment. Alternatively, the entire step above of updating external search indexes on columns, column families, or entire tables in MapR Database into Elasticsearch can also be done automatically and in near real-time in the 5.0 release of MapR. As new data is inserted into MapR Database, it is instantly replicated in Elasticsearch. See the blog How to Index MapR Database Data into Elasticsearch for more information.

Step 4) Now that the data is indexed in Elasticsearch, we can now query Elasticsearch for fname/lname and use the row key returned for searching the MapR Database table and retrieve the complete row including the address for the fname/lname. The Hive script below illustrates these steps.

$cat hive_script.hql
#!/bin/sh
query_es="select key from person_hbase where fname =Tom"
row_key_from_es=$(hive -e "$query_es")
echo "returned row key value = $row_key_from_es"
query_hbase="select * from person_hbase where key = $row_key_from_es"
return_val_from_hbase=$(hive -e "$query_hbase")
echo "returned value = $return_val_from_hbase"

$ sh hive_script.hql

returned value = 1    Tom    John  350 Holger Way

Summary:

This solution is just one simple example of how you can leverage Elasticsearch as a means of creating secondary indexes in MapR Database. In this example, we used two steps in Hive, the first for the retrieval of a row key using Elasticsearch and then using that key to query MapR Database. An alternate solution is to write a UDF (user defined function) in Hive and use it in the where clause of the MapR Database query.


This blog post was published January 25, 2016.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now