6 min read
In this week's whiteboard walkthrough, Tugdual Grall, technical evangelist at MapR, explains the advantages of a publish-subscribe model for real-time data streams.
Here's the transcription:
Hello I'm Tug and I'm working as a technical evangelist at MapR. Let's talk today about data streams. As you know, MapR is a converged data platform that allows you to store data, process data of any type, at any volume. And one of the last things that we have added to the platform is streams, MapR Streams (Now called MapR Event Store), that give you an API on an infrastructure layer to stream the data in and stream the data out of the platform.
What I want to explain now is why streaming data is important for your company, for your business. It's all related to real time and real-time processing of the data. Like all the ways you have many data sources. You have data sources coming from log of applications, you have data coming from applications. It could be business data, it could be user clicks that you want to capture. Or even go down to the system or database level when you want to capture specific events. For example, when somebody's inserting a record, modifying a record, deleting a record, you want to be able to push that inside your platform. In the past, what we were doing we were using a batch approach. Taking some logs, putting the log in the file system, doing some processing and transformation.
Using a stream approach, you stream the data in real time, inside the system and as you push the data in, you will use the same kind of model all the time to add new data sources. You can transform them when you get the message. For example you want to add some information to a message, you want to correlate using machine learning techniques to all the information inside the system. When you have done that, independently of the data sources and this is very important, it’s completely disconnected from the data sources. You can push the data into a target application. It's not really a push because this is a target application that will subscribe to the event, that gets the data out of the system, that for example, to create a data lake. You want to be able to put your data, your user interaction, transform it, push it for analytical, 360 degree view of a customer, for example.
Also using the same publish-subscribe approach you can create APIs or you can create real time alerts. I will take one concrete example to show the difference between a batch approach, that will not be in real time, and using streams that will allow you to do data management, data processing in real time. Suppose you are a credit card company and you will have to capture events every time you have somebody paying with the credit card in a shop. You will, from anywhere, any type of data sources, you will capture this event, you will send the message into the platform inside the brokers. The broker will just capture the message. All the applications that are subscribed to this specific event will grab the message and work with it. For example, you want to do all the analytical parts. You will take the message and put that in the data lake. For example, adding some information about the customer, adding information about the shop where the credit card has been used.
Also, with any chosen application, you can add some services to the application. For example, to do some real time alert services. In this case when the message is captured, independently of the data lake flow, you will analyze that using some machine learning, for example Spark machine learning, to see if this credit card has been used in two different continents at the same time. If you find this information, this specific alert mechanism will receive the message and send an alert in real time.
The same way if you want to be able, for example to capture some information in the log about authentication failure. Every time you login to a system, it will capture the message, do some work, put that in the data lake to create some information about the user behavior on your website. When you have this message you can in the same way offer new services to provide some API about the user and capture some failure about the authentication. If a specific user has tried too many times from different IP addresses to login, you can view that as alert of fraud inside your system.
So the big difference between what we used to do in a batch-oriented approach where you take all the data every hour, every day, into the system then you do the work, you're instead using a stream-based architecture. You do that in real time. You take the data. You subscribe. The broker will be here to handle all the messages and that also so people will subscribe to get the data in real time.
Thank you for watching. If you have any questions or any comments just put some comments below.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.