7 min read
One of the common challenges of deploying a search engine is keeping the search indexes synchronized with the source data. In some cases, a batch process using custom code to periodically index new documents is satisfactory, but in many enterprise environments today, real-time (or near real-time) synchronization is required.
In the 5.0 release of MapR, you can create external search indexes on columns, column families, or entire tables in MapR Database into Elasticsearch and keep the indexes updated in near real-time. That is, when a MapR Database table gets updated, the new data is instantly replicated over to Elasticsearch. As shown in this post, this capability only requires a few configuration steps to set up.
Even if you’re not an existing MapR Database customer, you can still try this out. Download the free Community Edition to get started.
Granted, the configuration above is not recommended for production use, but would suffice for me to demonstrate the integration of MapR Database and Elasticsearch.
This post assumes you have a MapR Database (either Enterprise Database Edition or Community Edition) cluster up and running with the HBase client package installed as below.
[root@node1 roles]# rpm -qa| grep hbase
Below is a list of services I have configured in my test MapR Database cluster.
There are a few ways to install Elasticsearch. In this post, I’m going to use a tarball to install Elasticsearch. I will use the following values:
Note: It’s assumed you have a node with an OS (such as CentOS 6.5) installed that’s ready to be dedicated as an Elasticsearch node .
1) Use the link below to download the Elasticsearch tarball: (elasticsearch-1.4.4.tar.gz) and copy it over to the Elasticsearch node under /opt.
2) Gunzip and untar the tar file you downloaded and copied to the host.
tar -xvf elasticsearch-1.4.4.tar
3) Edit /opt/elasticsearch-1.4.4/config/elasticsearch.yml
4) Run the command below to start Elasticsearch:
5) You can verify that Elasticsearch is running and has the right cluster name:
Register your Elasticsearch cluster with MapR
The next step is to make the MapR cluster aware of the Elasticsearch cluster. This is done with the “register-elasticsearch” script.
Run below command (on MapR cluster node):
/opt/mapr/bin/register-elasticsearch -r 10.10.70.115 -e /opt/elasticsearch-1.4.4 -u root -y
-r the IP address for the Elasticsearch node that needs to be registered
-e the home directory for Elasticsearch
-u the user who can login to ES_NODE and read all the files under the ES_HOME directory (default user is the user who is running the register command)
-y omit interactive prompts
Wait until it finishes. Once the command is executed, you will have an Elasticsearch target cluster registered in your MapR cluster with messages as seen above.
Create a source table
We can use the tool “loadtest” to load sample data in our source table:
/opt/mapr/server/tools/loadtest -table /srctable -numrows 10
The command above creates table “/srctable” and inserts 10 rows into the table for our demonstration.
Source table verification
Verify the table indeed has 10 rows (see totalrows) with the following command:
maprcli table info -path /srctable -json
Index and setup replication from the MapR Database table to Elasticsearch
To map a MapR Database source table into an Elasticsearch type (a type is a class of similar documents in Elasticsearch), we run the following command:
maprcli table replica elasticsearch autosetup -path /srctable -target AbizerElasticsearch -index sourcedoc -type json
-path the source table path
-target the target Elasticsearch cluster name
-index the name of the index you want to use in Elasticsearch. In the RDBMS world this can be thought of as a database.
-type the name of the type you want to use within the Elasticsearch index. In the RDBMS world, this can be thought of as a table.
This command registers the destination Elasticsearch type as a replica of the source table, copies the content of the source table into the Elasticsearch cluster via running CopyTable in the background, and then starts the replication stream to keep the Elasticsearch indexes up to date. Updates to the source table are replicated near real-time by the replication stream. Replication of data to Elasticsearch indexes is asynchronous.
Once the command above finishes successfully (it might take a while for huge tables), Elasticsearch replicas of the source table can be listed as below from the MapR cluster.
Verify data in Elasticsearch
There are a few ways to verify that the data made it to the Elasticsearch cluster. Elasticsearch has a very good plugin called Marvel. It has a dashboard which displays the essential metrics you need to monitor the Elasticsearch cluster, index rate and documents written in the Elasticsearch cluster. It also provides an overview of the nodes and indexes, displayed in two clean tables with the relevant key metrics.
It is very simple to install and use Marvel.
Below are the steps:
1. Go to the Elasticsearch home directory and run the command below to install the Marvel plugin.
bin/plugin -i elasticsearch/marvel/latest
2. Restart Elasticsearch. The two commands below stop and then start Elasticsearch.
curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown' (to shut down Elasticsearch on the node where it was installed)
Once the plugin is installed on the nodes, you can access the Marvel UI by viewing http://10.10.70.115:9200/_plugin/marvel/
As seen in the above screen shot under the “Indices” table, there are 10 documents which have been replicated from the MapR Database table as expected (10 rows converted to 10 documents).
In this blog post, you’ve learned how to index MapR Database data into Elasticsearch. If you have any further questions, please ask them in the comments section below.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.