Savepoints in Apache Flink Stream Processing – Whiteboard Walkthrough

Contributed by

8 min read

In this week’s Whiteboard Walkthrough, Stephan Ewen, PMC member of Apache Flink and CTO of data Artisans, explains how to use savepoints, a unique feature in Apache Flink stream processing, to let you reprocess data, do bug fixes, deal with upgrades, and do A/B testing.

Here's the undedited transcription:

Hello. My name is Stephan. I'm one of the original creators of Apache Flink, and CTO of data Artisans. I'd like to tell you a little bit about stream processing with Apache Flink today. In particular, I'm going to talk about a unique feature in Flink called savepoints, and the applicability of savepoints to several operational problems in stream processing, such as how do you do reprocessing? How do you do upgrades of running programs? How do you do bug fixes? How do you do AP testing?

For this talk, I'm going to assume a pretty classical stream processing setup. We have a message bus or a log service. It could be MapR Event Store or Apache Kafka. Events are being added to the log. On the other hand, we have a stream processor, which here is Apache Flink, which picks up messages from the log and computes streaming analytics over these events.

The example I've picked here is that the stream processor computes a set of statistics based on sessionization of these events. You can think of these sessions as being bursts of sensor data, or being user interaction sessions with a certain service. The stream processor sessionizes the events, and then classifies certain sessions as outliers or just computes session lengths or classifies them. Once you have this application set up, it's going to run continuously. Events are being added to the message bus, the stream processor picks them up and it's going append them and it's going to grow the sessions.

At some point in time, what you may actually want to do is upgrade your streaming application because you came up with a new way to classify outliers, or you may want to do the sessionization slightly differently. Another possibility could be that you introduced a bug awhile back in your application and you actually want to reprocess the data. I say go back a week, use the bug fixed version of the program, and just reprocess the last week of data and replace everything with the proper result.

Let me take that as an example. How do you do this in a continuous application like the streaming application? The way you would do this in Apache Flink is by using the savepoints feature. The savepoints feature allows you to take a point-in-time snapshot of your entire streaming application. That point-in-time snapshot is going to contain information about exactly where in the input stream you were, and it's going to contain all of the information about the pending state or the in-flight sessions that you have at that point in time. You could think of it as conceptually "stopping" the streaming application, taking a picture of it, and then continuing it. Although it will actually do it without stopping the application; it will run completely in the background.

If we run this and any streaming application, you can just use the utilities and tell it, "Please take a savepoint of that application now.” The savepoint is typically stored on something like a distributed file system. It will, like I said, contain the metadata about where in the input streams the current streaming application is, as well as the pending stage—the currently in-flight state of the streaming application life (the sessions up to that point). It will not, for example, contain this session any more, which has been completed. It will not contain any of those points here which would not have been addressed yet. This savepoint here contains the information of both of these parts. What you can then do is take the program, stop it, and just tell the system, "Please restart that program from that savepoint." This will conceptually rewind the entire application back by a week or however far back you took the savepoint. You can also take a modified variant of the program and have it resume from that savepoint. That way you can upgrade a program or you can say, "I upgraded the program a week back from a savepoint. In the course of the upgrade, I introduced a bug. I didn't actually bug fix it.” You would go back once more and reprocess it with the fixed version.

Another very powerful way to use this feature is for AB testing. Let's call this variant here of the program "variant A." At a certain point in time, you have an idea of how to improve your recommender; how to improve the outlier anomaly classification. What you would do in that case, is you would take a savepoint of program A, store it there, and then write a B variant of the program. The B variant of the program could then be started from that savepoint as well. This would run concurrently to variant A and pick up exactly at that point where you took the savepoint of variant A, picking up the in-flight sessions from variant A and just continuing the computation from there.

The savepoint feature actually does not correlate savepoints and jobs in a strict way, which allows you to even go to the point of creating a different savepoint from B, and another savepoint from A. You can move this program B to that savepoint from program A. Or, you can even work off a variant C from a later savepoint at A, B, or a later savepoint in A. What you get is almost a versioning snapshot mechanism for your entire continuous application. The versioning takes care of both: where in your input you are and where in your application state you are.

Another powerful way to use that is if you only have a limited capacity on the message bus (something that is especially relevant for Apache Kafka, where the data retires after a certain point in time). You want to be able to rewind applications at least up to the point where the data is still in Kafka, but you never have to rewind further. You never have to rely on more data than from a certain point in time. So these savepoints actually can give you points behind which you can retire the data, because you'll always have a fall back point that is not too far in the past and doesn't rely on earlier data in the message cue.

Thanks for watching. If you have any questions, please add them to the comments section below.

This blog post was published July 21, 2016.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now