Dataware for data-driven transformation

Secondary Indexing for MapR-DB using Elasticsearch

Contributed by

6 min read

In the wide column data model of MapR-DB, all rows are stored by a row key, column family, column qualifier, value, and timestamps. In the current version, the row key is the only field that is indexed, which fits the common pattern of queries based on the row key. However, some use cases might require scans based on column data. Let's take an example of a simple MapR-DB table that stores person data. The table contains a person ID as the row key and details in the column family. The “details” column family stores “fname,” “lname,” and “address” as the columns. For this table, if access patterns are based on columns “fname” or “lname,” queries will result in a full table scan. This article describes a solution to avoid full table scans when queries are based on values other than the row key. This solution uses Hive for interacting with MapR-DB and Elasticsearch. And since MapR-DB uses the HBase API, this solution also uses the HBaseStorageHandler to interact with MapR-DB, and EsStorageHandler to store indexes of MapR-DB tables in Elasticsearch.

There are two relevant processes here: Data Ingestion and data retrieval.

  1. Data ingestion: Steps 1 and 2 in the diagram below entail data ingestion and are typically done in a single process. MapR-DB can alternatively handle step 2 automatically using the built-in Elasticsearch integration (more on this later).
  2. Data retrieval: Steps 3 and 4 entail searching the index first on any value, and then using the row key to fetch the complete data.

This example assumes some working knowledge of MapR-DB, Elasticsearch, and Hive. Here are the steps:

Step 1) Create some sample data in MapR-DB, consisting of row keys and other attributes. This data table will be used for creating an external table in Hive.
Below is an example schema and sample data:

Create the table in MapR-DB (note that we are using the HBase shell for accessing MapR-DB): hbase(main):001:0> create '/user/user01/person', {NAME=>'details'}
Insert data in the table.
hbase(main):008:0> put '/user/user01/person', '1','details:fname','Tom’
hbase(main):008:0> put '/user/user01/person', '1','details:lname','John’
hbase(main):008:0> put '/user/user01/person', '1','details:address','350 Holger Way’
hbase(main):008:0> put '/user/user01/person', '2','details:fname','David’
hbase(main):008:0> put '/user/user01/person', '2','details:lname','Warner’
hbase(main):008:0> put '/user/user01/person', '2','details:address','1950 Holger Way’
hbase(main):008:0> put '/user/user01/person', '3','details:fname','Aaron’
hbase(main):008:0> put '/user/user01/person', '3','details:lname','Finch’
hbase(main):008:0> put '/user/user01/person', '3','details:address','1951 Holger Way’

Step 2) Create an external table in Hive pointing to the MapR-DB table.

hive> CREATE EXTERNAL TABLE person_hbase(key int, fname string, lname string, address string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,details:fname,details:lname,details:address")
TBLPROPERTIES("" = "/user/user01/person");

After the table is created, verify the data can be retrieved from the hbase table.

hive> select * from person_hbase;
1    Tom    John        350 Holger Way
2    David    Warner                1950 Holger Way
3    Aaron    Finch        1951 Holger Way
Time taken: 1.563 seconds, Fetched: 3 row(s)

Step 3) Next, index the attributes in Elasticsearch including the row key. Create an external Hive table with data from the HBase table which will be used to populate the Elasticsearch table.

Download and unzip it.
% wget
% unzip

Use the jar and set hive.aux.jars.path as shown in below command. This is for resolving org.elasticsearch.hadoop.hive.EsStorageHandler class used in creation of external table.
% hive -hiveconf hive.aux.jars.path=elasticsearch-hadoop-2.0.0.RC1.jar

hive>CREATE EXTERNAL TABLE person_es (id BIGINT, fname STRING, lname STRING)
STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'contact/person','' = 'id', '' = 'true');

Load index data into the Elasticsearch table from the data in the person_hbase table.

hive> INSERT INTO table person_es select * FROM person_hbase;

Note that the last “insert” command above simplifies the indexing process for the sake of this tutorial, and as shown earlier in this article, the indexing step is typically done by the application in a real environment. Alternatively, the entire step above of updating external search indexes on columns, column families, or entire tables in MapR-DB into Elasticsearch can also be done automatically and in near real-time in the 5.0 release of MapR. As new data is inserted into MapR-DB, it is instantly replicated in Elasticsearch. See the blog How to Index MapR-DB Data into Elasticsearch for more information.

Step 4) Now that the data is indexed in Elasticsearch, we can now query Elasticsearch for fname/lname and use the row key returned for searching the MapR-DB table and retrieve the complete row including the address for the fname/lname. The Hive script below illustrates these steps.

$cat hive_script.hql
query_es="select key from person_hbase where fname =Tom"
row_key_from_es=$(hive -e "$query_es")
echo "returned row key value = $row_key_from_es"
query_hbase="select * from person_hbase where key = $row_key_from_es"
return_val_from_hbase=$(hive -e "$query_hbase")
echo "returned value = $return_val_from_hbase"

$ sh hive_script.hql

returned value = 1    Tom    John  350 Holger Way


This solution is just one simple example of how you can leverage Elasticsearch as a means of creating secondary indexes in MapR-DB. In this example, we used two steps in Hive, the first for the retrieval of a row key using Elasticsearch and then using that key to query MapR-DB. An alternate solution is to write a UDF (user defined function) in Hive and use it in the where clause of the MapR-DB query.

This blog post was published January 25, 2016.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now