MapR 5.0 Documentation : Configuring MapR Gateways When Combining Table Replication and Indexing

There are two basic ways of combining table replication with indexing data in Elasticsearch:

  • Replicating and indexing from a single MapR source cluster
  • Replicating to a MapR cluster that is also indexing data in Elasticsearch

Replicating and Indexing from a Single MapR Source Cluster

You can replicate to one or more MapR clusters and index table data in one or more Elasticsearch clusters all from a single MapR cluster.

For example, in this diagram the customers table in the MapR cluster newyork is being replicated to the MapR cluster sanfrancisco. The gateways for replication are in the sanfrancisco cluster. The customers table is also being indexed in a single Elasticsearch cluster.

In the next diagram, the customers table in the MapR cluster newyork is being replicated to the MapR cluster sanfrancisco. The gateways for replication are in the sanfrancisco cluster. The customers table is also being indexed in multiple Elasticsearch clusters. The gateways for indexing are running on newyork. Updates to the customers table are sent to the gateways, which distributes them to Elasticsearch nodes running in the different Elasticsearch clusters. Those nodes distribute the updates to nodes where shards of the destination index are located.

For replication to another MapR cluster, MapR-DB knows to use the gateways that are in the remote MapR cluster because the name of the that cluster is associated with the gateways.

For example, if you use the maprcli cluster gateway set command on the newyork cluster, in the -dstcluster parameter of this command you would specify the name of the remote MapR cluster: sanfrancisco. MapR-DB would then understand that replication to this cluster goes through gateways A and B.

If you chose to use a DNS record, the record would look like this, where A and B are the hostnames or IP addresses of the gateways in the newyork cluster: gateway.sanfrancisco IN TXT “A B”

If you choose not to use either of these methods but instead rely on an entry in your cluster’s mapr-clusters.conf file (assuming that gateways A and B were on CLDB nodes in sanfrancisco, the entry in this file would start with the cluster name “sanfrancisco”.

For indexing table data in an Elasticsearch cluster, MapR-DB knows to use the gateways that are in the source MapR cluster, and the name of that cluster is associated with the gateways.

When you install the mapr-gateway package on a node, you specify the MapR cluster that the node is a part of.

If you use the maprcli cluster gateway set command on the newyork cluster, in the -dstcluster parameter of this command you would specify the name “newyork”. MapR-DB would then understand that indexing to the Elasticsearch cluster goes through gateways C and D.

If you chose to use a DNS record, the record would look like this, where C and D are the hostnames or IP addresses of the gateways in the sanfrancisco cluster: gateway.newyork IN TXT “C D”

If you choose not to use either of these methods but instead rely on an entry in your cluster’s mapr-clusters.conf file (assuming that gateways C and D were on CLDB nodes in the MapR cluster newyork, the entry in this file would start with the cluster name “newyork”.

Replicating to a MapR Cluster That is Also Indexing Data in Elasticsearch

You can also replicate from a MapR cluster to a cluster that is indexing data in Elasticsearch. Use the maprcli cluster gateway set command if you want to use one subset of your gateways for replication and the other subset for indexing.

For example, suppose that you want to replicate from the source MapR cluster sanfrancisco to the destination MapR cluster newyork. You also want to index table data in Elasticsearch cluster newyork_es. You envision a configuration like this one:

As in the next diagram, you configure four gateways in the cluster newyork, planning to use two for table replication and two for indexing, as depicted in this diagram of what you want your configuration to look like. For indexing, you plan to place the two gateways physically in the Elasticsearch cluster, though logically they will part of the cluster newyork.

However, as the next diagram shows, what will happen is that table replication will use all of the gateways, and so will indexing.

This configuration will not necessarily slow down the performance of the Elasticsearch nodes where the gateways are located. However, if you do notice any slow down as you test your configuration, you could try partitioning the gateways by using the maprcli cluster gateway set command. When you run this command on the sanfrancisco cluster, specify only gateways A and B. When you run this command on the newyork cluster, specify only gateways C and D. By using this method of specifying only a subset of the gateways on the newyork cluster both times, you partition the gateways so that they are either handling traffic for table replication or handling traffic for indexing, but not handling both.