How to Set Up MapR Database to Elasticsearch Replication

Contributed by

17 min read

Use Case for Elasticsearch Replication

Automatic replication of MapR Database data to Elasticsearch is useful for many environments.

There are some great uses cases I can think of for taking advantage of this great feature.

  1. Full text search of data in MapR Database
  2. Geospatial searches for location data (think mobile user data here)
  3. Kibana visualization of the data, especially useful for time series data like sensor data or performance/network metrics
  4. ES as a secondary index for a MapR Database table (won’t be needed from MapR 6.0 when JSON DB tables will support secondary indices)
  5. Change Data Capture (arguably)

The MapR Gateway replication feature makes it possible to get the data into Elasticsearch 2.2 without any code!

Let’s learn how we can do it using the latest MapR Sandbox (available for free). There is no better way to learn than to do after all!

By and large, the MapR documentation of this feature is sufficient for an experienced MapR admin to set up the replication working. However, the documentation isn’t task-focused. What I contribute in this post is a start to finish how to using sample data and cover the whole process step-by-step.

Limitations

Some important notes about limitations of the MapR Database to Elasticsearch replication:

  1. Only Elasticsearch 2.2 is supported
    • Later versions of ES will NOT work
  2. No support for JSON DB tables

Sample dataset

For this tutorial, we’ll use pump sensor data that is used in other training materials and blogs such as Real-time Streaming with Kafka and HBase by Carol MacDonald. I have modified it a bit to add an id column.

[mapr@maprdemo ~]$ head -n 3 /mapr/demo.mapr.com/user/mapr/sensordata.csv
1,COHUTTA,3/10/14,1:02,9.67,1.731,882,0.52,87,1.79
2,COHUTTA,3/10/14,1:03,10.47,1.732,882,1.7,92,0.66
3,COHUTTA,3/10/14,1:05,9.56,1.734,883,1.35,99,0.68

The data columns includes a date, a time and some metrics related to sensor readings from a pump such as those used in the oil industry (psi, flow, etc.). There are 47899 rows in this dataset. While this is tiny by the standards of production MapR Database, it’s more than enough to demonstrate the technology working on the sandbox.

Get the data here.

Send the data to the sandbox using the following command (while the sandbox is running of course!):

$> scp -p 2222 sensordata.csv mapr@localhost:

You will be prompted for the password, it’s “mapr”.

Or else, you can wget the data directly from the sandbox, just copy the dataset’s URL and paste it after wget directly while logged into the sandbox.

$> wget <URL to dataset>

Remember that to login to the sandbox from your favorite shell, just type:

$> ssh -p 2222 mapr@localhost

Finally, copy the data to MapR. This ensures the command to import data into MapR Database will run as-is:

$> cp sensordata.csv /mapr/demo.mapr.com/user/mapr

MapR Database Replication Using the MapR Gateway Service

MapR Database is a NoSQL database that follows in the footsteps of Google BigTable. More specifically, it started as a reimplementation of Apache HBase designed from the ground up to take advantage of the advanced inner workings of the killer distributed platform known as the MapR Data Platform. It also now has native JSON support to more easily handle hierarchical, nested, and evolving data formats.

At its core, the MapR Database replication feature was to enable a MapR Database table to be replicated to a MapR Database instance running on another cluster automatically. One primary use case is for a global enterprise to improve speed of access and get multi-region level HA automatically with a guarantee on data consistency. This feature can get really fancy with bi-directional replication where applications can read and write to/from either replicas and still know both are always kept up to date.

More info can be found here and here.

Setup Guide

Choices in Solution Design

If you just want to try this feature out, then the MapR Sandbox is a great way to get started quickly and I’ll make sure to cover that in this guide.

For those who may want to use this feature on a production cluster though, there are a couple of configurations to ponder:

  • Co-locate the ES cluster with the MapR cluster
  • Use an external ES cluster

Unsurprisingly, if you have plenty of hardware servers then the external ES cluster should be the preferred solution, to isolate services and reduce failure impact as well as reserve the cluster resources for actual big data processing.

While putting the ES cluster on separate nodes is the recommended solution for a production cluster, it is also possible to colocate part or all of an ES cluster with MapR nodes. Keep in mind that memory resources taken by ES are not available to the cluster.

For sizing of the ES cluster, the main factors are storage needs and incoming data throughput. The more data, the more nodes will be needed. The sizing issue is well explained in the MapR documentation.

Preparation

Install ES (single node or cluster mode)

  1. Install elasticsearch and run it (as root)
    $> wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.2.0/elasticsearch-2.2.0.rpm
    $> rpm -i elasticsearch-2.2.0.rpm
    $> service elasticsearch start
    
  2. Check installation

    $> curl localhost:9200
    {
    "name" : "D'Spayre",
    "cluster_name" : "elasticsearch",
    "version" : {
    "number" : "2.2.0",
    "build_hash" : "8ff36d139e16f8720f2947ef62c8167a888992fe",
    "build_timestamp" : "2016-01-27T13:32:39Z",
    "build_snapshot" : false,
    "lucene_version" : "5.4.1"
    },
    "tagline" : "You Know, for Search"
    }
    

    NOTE: it's installed in /usr/share/elasticsearch/ and runs as user elasticsearch when installing with rpm

  3. Update elasticsearch config

    $> vi /etc/elasticsearch/elasticsearch.yml
    cluster.name = mapr-elastic
    network.host = 10.0.2.15 # <- ip of sandbox, hostname -i
    
  4. Verify config

    $> curl maprdemo:9200
    {
    "name" : "Angelo Unuscione",
    "cluster_name" : "mapr-elastic",
    "version" : {
    "number" : "2.2.0",
    "build_hash" : "8ff36d139e16f8720f2947ef62c8167a888992fe",
    "build_timestamp" : "2016-01-27T13:32:39Z",
    "build_snapshot" : false,
    "lucene_version" : "5.4.1"
    },
    "tagline" : "You Know, for Search"
    }
    

Optional: add port forwarding to access ES from your host In VirtualBox, I added TCP port 9200 to the list of Port Forwarding Rules.

VirtualBox

We’ll just keep in mind to remember the hostname of the ES instances and remember that the supported ES version is 2.2. This is important or else there is good chance the replication will fail.

Create a MapR Database Table

There are a variety of ways to create a MapR Database table.We’ll use the command line, but it’s equally possible (and very easy!) to use MCS to do it visually.

$> maprcli table create -path /user/mapr/pumps
$> maprcli table cf create -path /user/mapr/pumps -cfname data

That’s it! Inserting the data will create the columns automatically. We don’t need to worry about data types as MapR Database only stores bytes and it’s up to the application to convert the data to/from bytes. This is a common pattern for NoSQL databases.

Add data to the MapR Database Table

We’re using HBase’s inputTsv functionality to import the CSV formatted dataset directly into the MapR Database table we just created. Again, no code required. So much for “Hadoop is difficult”, right?

$> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=,  -Dimporttsv.columns="HBASE_ROW_KEY,data:resid,data:date,data:time,data:hz,data:disp,data:flow,data:sedppm,data:psi,data:ch1ppm" /user/mapr/pumps /user/mapr/sensordata.csv

This launches a YARN MapReduce application to bulk import the data, meaning it will scale up to any size CSV from megabytes to petabytes. The main point to check here is that the “data:” part means the column family. Adjust the columns to match your own use case and span potentially many column families, it will work just fine!

NOTE: the column names are meaningful, they must match up with the Elasticsearch type mappings we define later on. This is important to get everything working!

Install MapR Gateway Service

First, install the mapr-gateway package on one or more nodes. On a production cluster, it’s always recommended to have at least two gateways to enable high availability. The number of nodes running the gateway should be based on the network bandwidth requirement as well as cluster hardware and available resources.

To install the package, log in as root (su root after logging on as mapr, or just login as root. The password is also ‘mapr’). Then install the package using yum:

$> yum install -y mapr-gateway

After installing the package, still as ‘root’ configure the system again:

$> /opt/mapr/server/configure.sh -R
$> service mapr-warden restart

The details are all available on the MapR documentation site.

Register Elasticsearch

Next we need to register ES with the MapR cluster. This basically means copying over some libraries for the gateway to use. An ES needs only be registered once per cluster, and can be reused to replicate many tables to different index/types.

We will also need to run the following command as root.

To do this, run the script /opt/mapr/bin/register-elasticsearch.

Parameters:

  • -c : this parameter is a tag that will be used as a target for the replica setup command. the recommended name is the ES cluster name but it could be anything. It will be the name used for the replication command. remember it!
  • -r <ES hostname/IP>
  • -t use the transport client. This is the only client supported by MapR 5.2 and is required in conjunction with the -r parameter.
  • -e the directory where ES is installed. Note that if ES is installed via the RPM/Deb package, this parameter is not necessary.
  • -y do not prompt for values. If following the steps here, it’s safe to use.

Using the sandbox, this command will register ES as the mapr user:

$> /opt/mapr/bin/register-elasticsearch -c elastic -r maprdemo -t -y  
Copying ES files from maprdemo to /tmp/es_register_mapr...
The authenticity of host 'maprdemo (10.0.2.15)' can't be established.
RSA key fingerprint is 6a:24:76:81:7d:53:ab:4d:3e:b5:29:0a:cb:ab:dd:9a.
Are you sure you want to continue connecting (yes/no)? yes
Registering ES cluster elastic on local MapR cluster.
Your ES cluster elastic has been successfully registered on the local MapR cluster.

Doing this as root on a fresh sandbox, expect only the prompt for “Are you sure you want to continue connecting (yes/no)”. Answer yes of course. If you run the command as user mapr it will not work if elasticsearch was installed from RPM because it requires access to the elasticsearch.yml file, which the RPM installs in the /etc/elasticsearch folder.

In practice, this will add some shared libs and other such required data to MapR XD under the folder /mapr/demo.mapr.com/opt/external/elasticsearch. You can verify the content of the ‘clusters’ subfolder will have ‘elastic’.

To verify ES is registered properly, you can then enter this command (notice the -l parameter):

$> /opt/mapr/bin/register-elasticsearch -l
Found 1 items
drwxr-xr-x   - mapr mapr          3 2016-10-27 21:28 /opt/external/elasticsearch/clusters/elastic

We are now done with registering the Elasticsearch cluster with the MapR cluster. This only needs to be done once for each Elasticsearch cluster, regardless of how many tables replicate to ES.

Add Elasticsearch Mappings

This part is critical and a source of most issues. Get the mappings wrong, the replication will fail.

$> curl -X PUT maprdemo:9200/pumps/ -d '
{
    "mappings" : {
          "pumpsdata" : {
            "properties" : {
                  "pumpsdata" : {
                    "dynamic" : "true",
                    "properties" : {
                          "resid" :{"type":"string"},
                        "date" :{"type":"date", "format":"MM/dd/yy"},
                        "time" :{"type":"string"},
                        "hz" :{"type":"double"},
                        "disp" :{"type":"double"},
                        "flow" :{"type":"double"},
                        "sedppm" :{"type":"double"},
                        "psi" :{"type":"double"},
                        "ch1ppm" :{"type":"double"}
                    }
                }
            }
          }
    }
}'

The mappings is really critical because MapR Database binary tables, just like HBase and many NoSQL databases, as no information about the data. It only stores bytes. As such, the replication gateway needs to convert the data from bytes into whatever is in the mapping for the type defined in Elasticsearch. If a conversion fails, it throws an exception and the replication fails.

Check this page in the MapR documentation to validate that your mapping is indeed correct given your data.

Personally, the date/timestamp types caused me a lot of grief. It was fiddly to get it working properly until I got the hang of it. The mappings above are tested and work.

Set Up Replication

We are finally there! Time to start the actual replication. Related documentation is found here in MapRDocs.

This is done using the maprcli utility as user ‘mapr’:

$> maprcli table replica elasticsearch autosetup -path /user/mapr/pumps -target elastic -index pumps -type pumpsdata

Once this command is run, MapR will launch a mapreduce job to do an initial bulk replication of the data currently stored in the MapR Database table. This could be long if the table is already holding a lot of data. With our very small test data (47899 rows) this should take less than one minute, mostly because of the startup cost of a mapreduce job. You can see it running by opening the ResourceManager UI (http://maprdemo:8088/cluster/apps).

If planning to use replication from the start, it’s probably a good idea to set it up when the table has just a bit of data to make the initial bulk load run quickly. While it’s possible to enable replication on an empty table, I wouldn’t recommend it since there is no way to make sure the replication is setup properly until data is added, which could be in production. I tend to prefer to detect errors and fix issues as early as possible.

From there on out, as data is added to the MapR Database table, the data will be automatically replicated to ES by the gateway. It’s magic. :-)

Verifying the Replication

In MCS we should now be able to see that the replication has indeed been successful.

In Elasticsearch we can also make sure that we have 3 hits for the rows we have replicated so far:

$> curl maprdemo:9200/pumps/metrics/_count
{"count":47899,"_shards":{"total":5,"successful":5,"failed":0}}

MCS

Above, we can see a screenshot of MCS where the /user/mapr/pumps table’s replicas tab, which clearly shows that Elasticsearch replication is on.

The first load is a bulk load, all subsequent inserts/updates are added to ES as they are added to MapR Database in a streaming fashion.

Potential Issues

Some sources of issues to be careful about:

  1. Make sure the user running the replication command has POSIX permissions to the MapR Database table. In our case, we’re creating it with user ‘mapr’ and running the command as ‘mapr,’ so that’s OK. Permissions in MapR matter.
  2. Double check that your index is created and the mappings are well matched to the data. If you’re using our test data and mappings though, it should be smooth sailing!
  3. Finally, ensure that the data input are strings in UTF-8 format in this particular example. The gateway decodes the bytes stored in MapR Database as a UTF-8 string so if the data input was ASCII, the decoded output will be weird numbers and ES will complain. UTF-8 is the default file format of all modern computers, so it should be fine, but it’s something to keep in mind.

If the job fails, go to elasticsearch-2.2.0/conf and edit the logging.yml file to set the logging level to DEBUG. Tailing the log in elasticsearch-2.2.0/logs/elastic.log will give the most information about conversion errors.

Wrap Up

Replication to Elasticsearch can be a very useful feature, with a lot of great use cases as I described above. It’s pretty easy to set up and will work reliably in the background to keep your data synchronized. Hopefully it will encourage more MapR users to experiment with this feature and take advantage of it on their production clusters.

Additional Resources


This blog post was published August 02, 2017.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now