5 min read
Nick Amato, Director Technical Marketing at MapR, explains the advantages of a converged environment for streaming applications vs. running these services in separate clusters.
Here's the transcription:
Hi, I'm Nick Amato, director of technical marketing with MapR. Let's talk a little bit about what it means to have a Converged Data Platform. We hear a lot in the market about what it means to move to real time, and we see a lot of customers wanting to move to a real time model. To do that, all of the components of the architecture have to support real time as well. If you have any components in the data platform that are batch-orientated, or don't support real time, it's going to be a long, slow project to get there. You really can't move to a real time model if you have parts of the architecture that are batch-oriented.
For example, let’s look at a traditional approach, maybe in an IoT scenario. We have producers at the top putting data into the cluster, and this might be device speaking MQTT, maybe JSON data coming in. You see a lot of devices out there now that have REST APIs, maybe Log data, and other types of data coming into nodes in your data architecture.
In this traditional approach, I have different services deployed in the architecture. I have a Hadoop and Spark for analytics. I have Kafka if I want a streaming model, where I have data coming in, and I have consumers of that data that want to listen to different offsets of it, or different topics within the stream. I have operational nodes that maybe are running HBase or a Cassandra. Finally, I want all of that data to be persisted to a system of record. Maybe my enterprise storage.
This is a type of model that we would call a connected approach. Where I have different services running on the cluster, maybe using a subset of nodes, or different nodes within the same data center. You might look at this and say, "Well, I don't really have multiple clusters. I don't have different clusters running these different services." If you have a type of architecture where you have a set of nodes in the platform, and let's say I have a subset of the nodes running Kafka, I have some nodes running Cassandra. What you have is a connected type of architecture where you have to write code, and you have to write different components that transfer data between different services. I have to write some code that transfers data between Kafka and Cassandra. Different nodes running different services. This is what we call a connected type of architecture.
On the other hand. This is a converged platform and that's what we offer at MapR. With a converged platform you have all these services running on all of the nodes of the cluster. For example, if I have the same kind of IoT scenario, where I have data coming in from different producers, MQTT or JSON data. Same type of data as before, but in this architecture it's converged. I have different topics in a stream that I can write data to for producing the data, and the things that I'm doing on the cluster are running on the same software platform. If I'm using Spark, or Storm for example, or Flink. Any applications that are consuming the data from the data coming in with Streams I can run on that same platform, and they're sharing resources and I'm able to use the same architecture.
Finally, with MapR Database you can actually persist the data as part of the platform as well, so if I want to persist binary data, or we actually have a new API that allows you to persist JSON documents directly in the data base. You can do that on this same platform.
This is what it means to have a converged platform verses a connected architecture where I'm transferring data between the different types of services.
Thanks for watching, and be sure to check out our recent benchmark paper, where we show the architecture scaling to well over 7,000,000 messages per second. If you have any feedback for us leave a comment below.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.