Hadoop came to prominence because of its economical storage of data and the bulk processing capabilities it provided for extremely large data sets. Quite recently, there has been considerable interest and experimentation in applying Hadoop to data streaming applications. This has been provoked partly by the fact that Hadoop has a streaming component (Storm) and partly by Spark, which can be used in a streaming context due to its in-memory and micro-batch capability. To explain the importance of this, we’ll provide a brief explanation of what streaming is from an application perspective:
An event stream is a stream of data records or events (or both) passing from one location to another. Usually the event data will include identifying information so that senders and receivers are speaking a common language.
A simple event stream might be nothing more than a continual broadcast from one application to another. A more sophisticated arrangement is for it to enable a publish- subscribe (pub/sub) arrangement so that more than one data provider can publish (write) data to the stream, and more than one data subscriber can read the data. A common example of this, with just one publisher and multiple subscribers, is an RSS feed that provides news stories from a particular media outlet. A more sophisticated example would involve an aggregator capturing multiple RSS feeds and allowing multiple subscribers to select the ones they want.
The primary advantage of the publish-subscribe paradigm is that it is very efficient. The publisher only needs to publish data once (rather than once for every subscriber), and the subscribers only receive the data they want – and it can be delivered to them as soon as it is published. The subscriber does not need to continually check for data.
There is a Hadoop ecosystem component, Apache Kafka, which implements a distributed messaging system, including a publish-subscribe capability, that could be used to build a Hadoop streaming capability. However, it does not integrate with an existing Hadoop cluster, instead requiring its own cluster, and thus providing a streaming infrastructure for use with Hadoop that has, so far, been a do-it-yourself experience fraught with pitfalls.
We were reminded of this recently when MapR presented us with its latest Hadoop release, MapR 5.1. The new release is impressive, and with it, MapR is clearly pursuing its own roadmap for the evolution of data streams and Hadoop.
It is worth noting that, from its first release, MapR chose to implement its own file system, MapR-FS, rather than vanilla HDFS. It had reasons for this. The company was not only focused on providing a robust scale-out clustering capability. It intended to deliver a globally distributed Hadoop capability with Hadoop clusters able to synchronize data, not just between clusters within a data center, but between data centers on opposite sides of the planet. This became clear when MapR introduced MapR-DB, which included a replication capability. This was bidirectional, and hence, it could be used both to replicate data and to keep data stored in multiple clusters in one step.
Taken together, MapR-FS and MapR-DB extend the functionality of Hadoop and Spark significantly. They enhance the components from being a highly scalable, shared file system, suited perhaps for use as a data hub, to being a globally distributed file system for any data application. The Hadoop user is no longer constrained to piling data up in a large central heap; data can be distributed geographically to create multiple remote data depots or, if desired, within the data center to create multiple Hadoop clusters of local data. But if such a capability is to be practical, it requires more than MapR-FS and MapR-DB.
MapR Streams is a new MapR component released with MapR 5.1 that complements and augments the distributed capabilities of MapR-FS and MapR-DB. It is a pub/sub data transport and event streaming capability. MapR Streams exposes units called “streams,” and each contains multiple topics that can be configured with security, retention and replication policies. Publishers (data producers) write to specific topics, and subscribers (data consumers) read data from specific topics. So applications can publish data to MapR Streams, and it will be delivered directly and immediately to all registered subscribers.
It helps to understand that MapR Streams provides real-time delivery, so whether data is published to it in batches or as a record-by-record (event-by-event) data feed, it is transmitted to subscribers immediately. Events (data records) are secure, and data delivery is guaranteed. There is no limit to the number of publishers or subscribers.
Figure 1 provides an overview of the MapR Converged Data Platform with its three outward facing components – MapR-FS, MapR Streams and MapR-DB – and the applications they serve.
You can think in terms of three distinct kinds of applications accessing the data platform. The first, labeled applications and/or data sources, are typical business applications or data transfer applications that read data from or write to Hadoop. They may simply make use of the data platform to store data, or they may access MapR-DB to read or write data. Additionally, they can publish data to be sent to other applications using MapR Streams.
What distinguishes the second group of applications, labeled bulk processing, is that they use Hadoop components such as Hive, MapReduce, Spark, Drill, etc., which require the scale-out capabilities that Hadoop has become famous for. They are also able to access any one of, or a combination of, MapR-FS, MapR-DB and MapR Streams.
Finally, there are streaming applications that will most likely be using the micro-batch capability of Spark or the streaming capabilities of Storm. These can also make use of any or all of MapR-FS, MapR-DB and MapR Streams.
The important thing to understand about MapR Streams is that it is engineered to have a real-time, guaranteed events streaming capability. This means that if you need to rapidly transfer data from any application or any location to another application or location, you can use MapR Streams. For the MapR Platform, MapR Streams will quickly become a fundamental business capability that can be leveraged in a multitude of ways.
While MapR can already use MapR-DB to replicate data between Hadoop instances, MapR Streams gives it the capability to replicate any data in Hadoop to other Hadoop instances anywhere or, if need be, to any application anywhere. In fact, it could be used to transfer data between two applications that otherwise have nothing to do with Hadoop. And it is not just about moving data. MapR Streams can establish and drive real-time feeds throughout the organization. It can take external feeds from data providers or from the cloud. It can take internal feeds from log files or Internet of Things (IoT) sensors. It can aggregate feeds and send them to multiple recipients.
In short, it provides a global, distributed real-time capability.
Hadoop has been rapidly evolving. In recent times, it has been depicted as the natural bedrock for a data lake. Indeed, the opportunity to house all of an organization’s data – and possibly a great deal of externally sourced data – into a single, scaled-out collection of servers is an attractive notion. With strong disaster recovery and security measures, the goal of a single physical or logical location for data makes sense. This is especially so in this era of cloud computing, where location flexibility is far greater than it ever was.
Right now, other Hadoop offerings are constrained in their event streaming capabilities. In general, where event streaming or messaging systems are used, they are managed as separate clusters, with all the inherent latencies those clusters create, and they fail to provide a global capability. Because they do not offer a versatile distributed capability, they tend to be deployed in an awkwardly centralized manner.
The global data platform that MapR delivers both supports and enables data distribution in a rational and very practical way. Multiple Hadoop clusters can be created to satisfy different workloads and priorities, and data that genuinely needs to reside in multiple geographical locations can be secure and resilient against technology failure. It thus is eminently suited, for example, to be the platform for the corporate system of records (SOR), segregated, as it should be, from other data and other workloads. The platform is capable of serving traditional business applications as well as big data and real-time applications.
The trend towards deploying Hadoop and Spark for real-time and near real-time applications is in its infancy, yet it is destined to become increasingly important in years to come, as event-based architectures become more common. With MapR Streams, the MapR Platform can play a primary role in delivering such applications.
In our view, MapR was always ahead of its competition in shaping the evolution of Hadoop and providing the enterprise-grade technical infrastructure it requires. With MapR 5.1 and MapR Streams, it has clearly thrown down the gauntlet to its competitors. Whether its competitors will be able to make up the lost ground remains to be seen. If you’re reviewing the possibilities of Hadoop and Spark, particularly if you see it as occupying a central strategic role in the corporate IT environment, you would do well to consider looking at MapR. The company has put itself far ahead of its competitors in a variety of ways.