MapR Streams brings integrated publish/subscribe messaging to the MapR Converged Data Platform. In this post, we will give a high-level overview of the components of MapR Streams. Then, we will follow the life of a message from a producer to a consumer, with an oil rig use case as an example.
Topics are logical collections of messages that are managed by MapR Streams. Topics decouple sources, which are the producers of data, from consumers, which are applications for processing, analyzing, and sharing data. Topics organize events: producers publish to a relevant topic and consumers subscribe to the topics of interest to them.
Topics are partitioned for throughput and scalability. Partitions, which exist within topics, are parallel, ordered, sequences of messages that are continually appended to. Partitions make topics scalable by spreading the load for a topic across multiple servers. Producers’ publishing is load balanced between partitions by MapR Streams, and consumers can be grouped to read in parallel.
A stream is a collection of topics that you can manage together. Streams can be asynchronously replicated between MapR clusters, with publishers and listeners existing anywhere, enabling truly global applications.
The MapR Streams replication feature gives your users real-time access to live data distributed across multiple clusters and multiple data centers around the world.
You can replicate streams in a master-slave, many-to-one, or multi-master configuration between thousands of geographically distributed clusters interconnected arbitrarily – in a tree, a ring, a star, or a mesh. MapR Streams detects loops and prevents message duplication.
With Streams replication, you can create a backup copy of a stream for producers and consumers to fail over to if the original stream goes offline. This feature significantly reduces the risk of data loss should a site-wide disaster occur, making it essential for your disaster recovery strategy.
To show you how these concepts fit together, we will go through an example of the flow of messages from a producer to a consumer.
Imagine that you are using MapR Streams as part of a system to monitor oil wells globally.
Your producers include sensors in the oil pumps, weather stations, and an application which generates warning messages. Your consumers are various analytical and reporting tools.
In a volume in a MapR cluster, you create the stream /path/oilpump_metrics. In that stream, you create the topics Pressure, Temperature, and Warnings.
Of all of the sensors (producers) that your system uses to monitor oil wells and related data, let's choose an oil pump sensor that is in Venezuela. We'll follow messages generated by this sensor and published in the Pressure topic. When you created this topic, you also created several partitions within it to help spread the load among the different nodes in your MapR cluster and to help improve the performance of your consumers. For simplicity in this example, we'll assume that each topic has only one partition.
A consumer application that correlates oil well pressure with weather conditions is subscribed to the Pressure topic. Many more consumers could be subscribed to it, too.
Since messages remain in the partition even after delivery to a consumer, when are they deleted? When you create a stream, you can set the time-to-live for messages.
Once a message has been in the partition for the specified time-to-live, it is expired. An automatic process reclaims the disk space that the expired messages are using.
The time-to-live can be as long as you need it to be. Messages will not expire if the time-to-live is zero, and will remain in the partition indefinitely.
You don’t have to worry about partitions getting too big to store on a single server; partitions will be re-allocated every 2GB to balance storage. MapR Streams can intelligently move partitions around in a cluster in order to spread the data out, allowing topics to be infinite, persistent storage.
MapR Streams provides reliable, global, IoT-scale publish-subscribe event streaming, allowing for the real-time collection and processing of events, paving the way to many use cases such as real-time alerting, monitoring, fraud detection, offers, and more. In this example, we saw how data can move from a producer like an oil well sensor to a consumer like an analytics application.
To find out more about MapR Streams, visit our MapR Streams page.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.