10 min read
Businesses increasingly feel the need for data that can be shared and updated across data centers, whether on premise or from premise to cloud, at massive scale. And they need this to be done with low latency and high consistency, reliably and conveniently, with reliable control over replication as well. In some cases the need is global, with data centers and data sources in multiple countries.
This pressure for an effective global data fabric is being met by emerging big data technologies. One such innovation is the MapR Data Platform, in which fundamental capabilities for distributed files, NoSQL tables and data streams have been engineered into the same technology: they run on the same cluster as part of the same code. This converged system allows MapR to offer unique advantages for geo-distributed data and computation, even on a global scale, as described here.
One of the most important features for effective geo-distribution of big data is the ability to carry out multi-master replication across data centers. Take for example the situation in which you have a database in different data centers located in two cities (such as San Francisco and New York). You want both databases to have the same information. In order to insure this, with an older architecture (Part A of this figure, showing master-slave table replication) you would have all data sources, even those in distant cities, send updates to the master data center. Updates are then copied to the secondary site. That works, but with some risk: the long hop between some data sources and the primary data center makes updates less reliable due to the potential for interruptions in the connection. How can we improve on that design?
Figure showing advantages of multi-master replication: from O’Reilly data report “Data Where You Want It: Geo-Distribution of Big Data and Analytics” by Ted Dunning and Ellen Friedman © 2017, used with permission.
A better solution (shown in part B of the figure) depends on technology that can handle multi-master bi-directional table replication, as is done by MapR-DB, the NoSQL database capability built into the MapR Data Platform. With MapR, you can have data sources report to their nearest data center, reducing the risk of data loss on ingestion. Almost immediately after local updates in either data center, multi-master table replication updates the database in the other data center. This design provides a big advantage when working across multiple data centers, and with MapR, this advantage holds whether you are working entirely on premises or with a mix of on-premise and in-cloud clusters.
Note that multi-master table replication in the MapR system doesn’t repeal physics or change the CAP theorem. When you have multiple data centers that form a very large distributed computing system, things are inherently going to be more complicated than working with a single cluster or a single computer. What makes the MapR solution unusual is the fact that within a single cluster, the system opts for strong consistency. Between clusters, however, consistency is relaxed by using timestamps to implement last-write-wins. The foundation of strong consistency in each component cluster, however, makes it relatively simple to implement vector counters to get better guarantees than simple last-write-wins.
The take-away lesson is that multi-master replication is a valuable capability, and it’s reliable and convenient to do it using MapR.
The ability to do multi-master table replication using MapR-DB offers an obvious benefit for geo-distributed architectures, but it may surprise you to know that with the MapR you also have the ability to do this with streaming data. MapR Streams is the message transport capability built into the MapR Data Platform technology, and it makes this geo-distribution of message streams practical. This convergence means that the message stream transport happens on the same cluster as stream processing (and the same as data storage in distributed files and NoSQL database). That’s a big advantage even with just one data center and even more so when you need to move data across different locations.
Figure shows the fundamental capabilities of distributed files, NoSQL database and message stream transport that are engineered together into a single technology, the MapR Data Platform.
As with tables, MapR Streams can be replicated in a multi-master fashion, on the same cluster or across data centers, on premise or in cloud.
From the developer’s point of view, MapR Stream replication makes building an application the same whether the data source is local or near a distant data center.
Not only can you directly replicate a MapR Stream on the same cluster or to a distant cluster on premise or in cloud, you can do so with low latency and very little administrative hassle. In addition, replication of MapR Streams can be configured with loops and redundant paths in the replication pattern. That makes the global data fabric more rip-proof and resilient to failures or changes. This ability to have multi-path replication is not something that Apache Kafka can handle because replication isn’t built in with Kafka, but is implemented at the application level using tools like MirrorMaker. Keep in mind, too, that while Kafka has topics, there is no equivalent structure to the collection of topics that is a MapR Stream.
Let’s look at how this ability to do efficient multi-master geo-distributed stream replication with MapR plays out in a use case. In the short O’Reilly book Streaming Architecture, my co-author Ted Dunning and I described in Chapter 7 a hypothetical transportation example that would require MapR Streams’ unique capabilities.
This example involves a global shipping firm that owns ships and containers, as well as leasing space for containers owned by other companies. Additional stakeholders are the port authority in each city where the ship docks as well as the various owners of the goods being shipped. The goal is to efficiently track the location and status of containers using data emitted by many sensors placed on or near the containers. Keep in mind that some stakeholders, such as the shipping company, should have access to data from all containers; other stakeholder should be allowed access only the data related to their containers or goods.
Here’s where MapR Stream replication is important. Each ship can have a small on-board MapR cluster that ingests data from sensors in containers on the ship, as shown in the following figure.
Here’s the scenario:
Figure shows a shipping example in which direct stream replication between MapR clusters provides the capabilities needed to keep multiple data centers in different locations updated with current information on the status of containers. From “Streaming Architecture by Ted Dunning and Ellen Friedman © 2016, used with permission.
Sensors report to a small onboard cluster, and when the ship docks in Tokyo, a temporary link is formed with an onshore cluster. MapR Stream replication updates the cluster in port as well as a data center back at corporate headquarters for the shipping company in London (A)
Now the ship continues to its next port-of-call, Singapore. While at sea, data from sensors on containers is collected on the onboard cluster. But long before the ship arrives, data from the Tokyo cluster is distributed via MapR Stream replication to the onshore cluster in Singapore, so stakeholders there know what to expect when the ship arrives. Once in port, a temporary connection once again updates the Singapore cluster for what has happened at sea (B).
The last step (C) adds a little humor as a reminder that sometimes the data being streamed is needed for real time insights. Here we have containers slide off the back of the ship into the sea, suggesting the humidity readings from sensors on these containers would increase sharply.
The need to move data between locations is not new, but the ability to do this efficiently, affordably, at large scale and low latency is relatively new. Multi-master replication with MapR-DB and MapR Streams provides an unusually powerful way to handle data across distant data centers, on premise or in cloud, and extends options for geo-distributed data even beyond the excellent mirroring capabilities you may already be familiar with for the MapR Data Platform.
Issues related to geo-distribution of both data and compute power including hybrid cloud architecture and the importance of being able to run stateful applications in stateless containers is discussed in a new O’Reilly data report called Data Where You Want It: Geo-Distribution of Big Data and Analysis that you can download free from the MapR website.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.