5 min read
In this week’s Whiteboard Walkthrough, Ankur Desai, Senior Product Marketing Manager at MapR, describes how Apache Kafka Connect and a REST API simplify and improve agility in working with streaming data from a variety of data sources including legacy database or data warehouse. He also explains the differences in this architecture when you use MapR Event Store versus Kafka for data transport.
Here is the full video transcription:
Hi, I'm Ankur Desai. I'm with the product team here at MapR. Welcome to my Whiteboard Walkthrough. Today I'm going to talk about the streaming architecture and how new advances in the architecture can help make the architecture more agile and simpler. Let's talk about how it all works.
This is a typical streaming architecture. On the left-hand side, you have data sources such as social media, sensors, and all sort of data. Then you would use a data collector like Flume to get the data from those data sources, and then Flume acts as the producer to Kafka. Also, remember you have legacy data sources too such as databases and data warehouses. To get the data from these sources and put in Kafka, you can typically use custom code that acts as a producer to Kafka, or you could use a data collector once again.
Once the data is in Kafka, Kafka acts as the messaging system for the streaming architecture. It acts as the transport layer. Once your data is in Kafka, Kafka can then serve the data to stream processing engine such as Spark Streaming and Flink. Stream processing layer is used for purposes such as ATL, for analytics, for aggregation. Once the processing is done, you would want to store the results in a persistence layer just to make it available for downstream applications.
Let's talk about how we can make this whole architecture more agile, a little more simpler. Let's start with the REST API. The REST API, let me just draw it here to explain it. The REST API allows any programming languages in any environment to write the data into Kafka using SJDP. At the same time, remember, we also have legacy data sources that often need to talk to Kafka. The community has developed a tool set, a framework called Kafka Connect. Kafka Connect is a set of pre-built connectors that can help you get the data from your legacy systems into Kafka. Here you can now get the data in and out of Kafka using Kafka Connect. As a result, the whole architecture is simpler.
Kafka Connect offers pre-built connectors so you don't have to write custom code every time you want to get the data in and out of legacy system. Also, Kafka Connect does not only act as a data import tool, it can also export data from Kafka to certain targets. Furthermore, let's talk about how can we converge certain components of this architecture into one platform, on one cluster, in one system.
With the MapR Data Platform, we replace Kafka with MapR Streams (Now called MapR Event Store for Apache Kafka), which by the way uses the same API. All your Kafka applications will work on MapR too. The MapR Data Platform converges all the required components for transport processing and persistence on one single platform, in one cluster, in one system. Everything you see here inside this red box is actually running on the same platform in the same cluster. This is all converged on MapR. This actually helps eliminate data movement between different clusters. As a result, we are extending this concept of agility and simplicity because now you don't have to move the data between different clusters. That reduces the latency and it introduces a simplicity in the architecture that was not available before.
Here you have it, how you can make your architecture simpler and more agile using the MapR Data Platform. Thank you for watching. If you have any question, please feel free to write comments below.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.