11 min read
In this week’s Whiteboard Walkthrough, Stephan Ewen, PMC member of Apache Flink and CTO of data Artisans, describes a valuable capability of Apache Flink stream processing: grouping events together that were observed to occur within a configurable window of time, the event time.
For other related resources:
Download the O’Reilly book Introduction to Apache Flink by Ellen Friedman and Kostas Tzoumas (September 2016)
Download the O'Reilly book _Streaming Architecture: New Designs Using Apache Kafka and MapR Streams _by Ted Dunning and Ellen Friedman (O’Reilly March 2016)
Here is the transcription:
Hello, my name is Stephan. I'm one of the creators of Apache Flink and CTO of data Artisans. Today, I'm going to teach you something about event stream processing with Apache Flink.
In order to illustrate what event time is and why it's an incredibly useful thing, let me introduce you to a typical stream processing scenario. We have a set of producers. Think of it as typical Internet of Things, event data, or event producers, cell phones, cell phone towers, smart cars—all of them are continuously producing events, and they're pushing them at some time into a data center. This is a simplified view; there are usually multiple steps involved, but just think of it as the simplified version for awhile. The events are being pushed, and they're ending up at some point in a message queue here in the data center. It's important to notice that these producers are frequently connected and disconnected. Now, over this sequence of events produced by these IoT data producers—at some point, you may want to run a set of streaming analytics. So filtering out garbage, correlating events, detecting anomalies, outliers, just computing statistics over time—all of these are use case scenarios.
We're going to use Apache Flink for this, and we're looking at a very simple analytical scenario here. We have basically two operators running in Apache Flink, a source operator which simply picks the data up from the message queue, and we have something called a windowing operator. Windowing means grouping events together that do belong together by some notion. The most prominent notion is time. For example, grouping events together that occurred within let's say 15 seconds of each other, within the same hour—whatever is relevant for the application.
Let's take the example of actually grouping these events together by time, let's say by minute. We have all these producers pushing the messages into the queue, and we have the Flink application that receives them. Now, if we group them by minute, the first thing we have to think about is what kind of time are we actually referring to. Do we just refer to the time of let's say here, the stream processor? Do we refer to the minute in which it sees the event, or do we refer by the minute when it was produced here? Those two notions are called processing time and event time.
Event time refers to the point in time when the data was actually produced. Processing time refers to the point in time when it was received by the stream processing and analyzed. Now, if you think about devices that are always connected, you will actually see that event time and processing time are often fairly close together. In modern infrastructure, it doesn't take a long time for an event to travel through the network, through the message queue, to the data center, and to the analytical application. However, in a more Internet of Things kind of scenario, you will see that they can diverge heavily, mainly because not all of these devices are connected all of the time. A cell phone may just happen to be in an area where coverage is bad. It may run out of battery and just shut down, store the events it hasn't yet transmitted, and transmit them when it's activated again. Similar things happen to a smart car, but even certain types of sensors gather data but don't always transmit it, just in order to save energy, because transmitting data is so energy intensive. So what that means is that by the time these events arrive for analysis, you are actually at a very different point in real-world clock time than you were when they were produced. So if you would actually correlate events, let's say correlate events from sensors that monitor forest fires. It's probably way more interesting to correlate them by when they were observed than observed by the sensor, than when they were actually observed by the stream processor, and that means analyzing them by event time rather than by processing time. So when analyzing the data by event time, it's very important to account for the fact that data comes delayed or comes out of order, and here is what Apache Flink internally does in order to help you do that.
If you want to analyze data by event time, what you're going to tell the system is how it can figure out the event time of each individual event. Most sensors, or most producing devices, attach a time stamp directly to the event that you can refer to at the source, and then the source is going to classify this event as part of that event time point.
The source is also going to create something like a progress report in event time, which is a concept we internally call event time low watermarks, and these progress reports, they indicate how far in event time have you actually seen your data. It's a mechanism that is user and application specific, and it takes into account how late you expect data to arrive, how much late data do you want to tolerate, and when you want to define a cutoff point.
When the stream processor actually picks up these events from the message queue here and classifies them, and attaches an event time stamp to them, it's going to put them into windows according to that event time stamp. So other than working on processing time where these windows work strictly after another, at any point in time, many event time windows can be in progress. Let's assume we have a window from 12 o'clock to 12:30. We have a window from 12:30 to 1:00 o'clock, and so on. At any point in time, it's going to pick up the event, classify it, and inject it into the window that would be the corresponding window for that particular time.
The window operator's also going to wait for the watermark. You can think of it as the progress report that event time has reached that point in time, that all events that belong to that period of time have now been observed. At some point in time, when let's say the watermark tells it that 12:30 in event time has been reached, this may be actually at 12:45 in wall clock time, or at 12:30 and two seconds, depending on how late and how out of order your data is. It will actually evaluate this window, compute the final aggregate, the final account, some standard deviation, and send it downstream as a result. You can think of the window operator as something that acts as a point that reorders out of order data, that buffers data and waits for late data, and only emits it once the system thinks okay, I've actually seen everything of interest now. That way, events that have occurred at the same time in the real world, but just took longer or shorter in order to arrive in the application, will actually be brought back together.
Now, this can obviously lead to delays in stream processing, because if you have to wait for late data, you obviously introduce an inevitable delay waiting for that. Because of that, these windows can actually use combinations of event time and processing time. This means that you can tell the windows to group the data by event time, let's say in these half hour windows, 12 o'clock to 12:30, but not to strictly wait forever if there's still data up to 12:30 missing, but only to wait for a maximum of five minutes and then close and emit, and drop anything that has just happened to come too late. That way, you can catch a bounded amount of lateness, still get a lot better results than if you work purely by processing time, but you're also going to bound your latency. What you can also do is you can combine the two in the way that you actually tell the system to produce a final result and event time, which is going to be as complete as possible, but you also produce early previews of the result based on processing time. So, you could say by the time you reach 12:30, you want to emit a result that contains everything observed up to that point in time. But you hold on to the data, and you keep refining it with all the late data that keeps coming, and you emit further results, the more you see, until you emit a final result that contains everything. So with that mechanism, you can balance the trade-offs between making more sense of data, because you correlate them properly by event time, completeness, and latency.
That's it — thanks for watching! If you have any questions, feel free to put them in the little bottom box below the video.
To learn more:
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.