Apache Spark vs. Apache Flink – Whiteboard Walkthrough

Contributed by Balaji Mohanam

In this week's whiteboard walkthrough, Balaji Mohanam, Product Manager at MapR, explains the difference between Apache Spark and Apache Flink and how to make a decision which to use.

Here is the transcription:

My name is Balaji Mohanam, and I'm a product manager at MapR Technologies. Today, I'm going to talk about Apache Spark and Apache Flink. There's been a lot of buzz going around Spark and Flink, and as a result, there are a lot of questions around when to use Spark and when to use Flink. I'm going to talk about several features comparing and contrasting Spark and Flink, as well as give you some use cases which would help you decide when to use Apache Spark Streaming and when to use Flink.

Both Spark Streaming and Flink provide you exactly once guarantee, which means that every record will be processed exactly once, thereby eliminating any duplicates that might be available. Both Spark Streaming and Flink provide you with a very high throughput compared to other processing systems like Storm.

The overhead of fault tolerance is low in both the processing engines. Where Spark Streaming and Flink differs is in its computation model. While Spark has adopted micro-batching, Flink has adopted a continuous flow, operator-based streaming model.

As far as Windows criteria, Spark has a time-based Window criteria, whereas Flink has a record-based or any custom user-defined Window criteria. While Spark provides configurable memory management, Flink provides automatic memory management, although with the latest release of Spark 1.6, Spark has moved towards automating memory management as well.

Having looked at the various features of Spark and Flink, let's look at the three different types of computation modes, namely batch, micro-batch, and continuous flow operator. Now what is batch is essentially processing data at rest, taking a large amount of data at once and then processing it and then writing out the output. Micro-batch interestingly combines the aspects of both batch and the continuous flow operator, wherein it divides the input into several series of micro-batches which are basically atomic, determinant, and time-based usually, and are executed finally. As a result of this, micro-batches are an essentially “collect and then process” kind of computational model, whereas a continuous flow operator processes data when it arrives, without any delay in collecting the data or processing the data.

To give you a good analogy, imagine collecting water in a bucket, flowing water in a bucket, and then pouring it out, vs. putting in a pipe there and letting water flow continuously without any intermediate delays. That's essentially the difference between a micro-batch and a continuous flow operator.

Spark essentially started as a batch processor, and eventually started adding more and more capabilities that make it more often real-time streaming processing as well. Flink ,which initially during its research stages, started solving problems around batch, but along the way, its researchers identified several interesting challenges in the real-time streaming paradigm. As a result, they pivoted more from a continuous flow operator-based model and kind of treated batch as a special case of real-time streaming.

In terms of deciding whether to do batch, micro-batch, or continuous flow operator, there are several tradeoffs that need to be made, mainly latency, throughput, and reliability.

Why latency? Why is latency so important? The traditional wisdom is that data has value, it doesn't matter how old the data is, and it has a lot of value. When processing capabilities increased, businesses started realizing that the value of information is highest when the data is happening, or when the data is gathered. They want to process data as and when it happens, which dictates a need for a real-time processing system.

Now I'm going to talk about some sample use cases based on whether to use micro-batches or real-time streaming for your specific use case. The first use case that I'm going to talk about is finance, where I'm going to talk about credit card detection. Credit card fraud detection is something that can happen in real time or micro-batches, but at the same time, detection is very different from fraud prevention. Detection is something that can happen over a micro-batch or real-time streaming, whereas fraud prevention has to happen in real time. Imagine a user is making a transaction and you want your system to see whether it’s a fraudulent transaction or a valid transaction.

That usually has to happen in less than a second; otherwise, you're keeping your user waiting for a long period of time. In that case, real-time streaming is very much required. The second thing that I'm going to talk about is the ad tech industry, where there are two different use cases. Let's talk about the first one, where you're aggregating different IP requests from different IP addresses to classify whether that IP address is blacklisted or whitelisted, in which case you could be using your micro- batch of real-time streaming to achieve this use case. Let's say you're trying to prevent processing a particular request because you already know it's from a blacklisted IP. In that case, things have to happen in real time, requiring very low latencies, for which you need real-time streaming.

The third use case I'm going to talk about is the telecom industry. Let's talk about two different types of use cases. One is usage aggregation. For example, how much bandwidth a particular user has used, in which case you really don't need a real-time streaming solution. In this case, a micro-batch is probably a more efficient solution, as opposed to a network anomaly detection, which really has to happen in real time. So in this case, a real-time streaming solution is very much required.

The final thing is in IoT. For example, let's say you have two different use cases: one is where you're running a trend line to see the usage of a particular piece of industrial equipment. The other is where you want to create an alert when a certain threshold is reached. In the first case, both micro-batching or real-time streaming solves your purpose. In the second scenario, where alerting is critical, whenever you want to raise an alert, whenever a particular threshold is reached, real-time streaming is very much required. Micro-batching is not an efficient solution for it.

I hope you enjoyed the video, and I hope that the sample uses that I talked about will help you to decide whether to use micro-batch or real-time processing. If you have any questions, please leave a comment below. Thank you for watching the video.

Learn more with these additional resources:

This blog post was published January 13, 2016.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.