A Quick Guide to Spark Streaming

Contributed by

11 min read

Stream processing is a power that has been added alongside Spark Core and its original design goal of rapid in-memory data processing. (Learn more about Spark’s purposes and uses in the ebook Getting Started with Apache Spark: From Inception to Production.) Although now considered a key element of Spark, streaming capabilities were only introduced to the project in its 0.7 release in February 2013.

Other stream processing solutions exist, including projects like Apache Storm and Apache Flink. In each of these, stream processing is a key design goal, offering advantages to developers whose sole requirement is the processing of data streams. These solutions typically process the data stream event-by-event, while Spark divides the stream into chunks (called “micro-batches”) to maintain compatibility and interoperability with Spark Core and Spark’s other modules.

The Details of Spark Streaming

Spark’s real and sustained advantage over the other stream processing alternatives is the tight integration between its stream and batch processing capabilities. When running in a production environment, Spark Streaming normally relies upon capabilities from external projects like ZooKeeper and HDFS to deliver resilient scalability. In real-world application scenarios, where observation of historical trends often augments stream-based analysis of current events, this capability is very valuable in streamlining the development process. For workloads in which streamed data must be combined with data from other sources, Spark remains a strong and credible option.

Spark Streaming

A streaming framework is only as good as its data sources, so a strong messaging platform is the best way to ensure solid performance for any streaming system.

Spark Streaming supports the ingest of data from a wide range of sources, including live streams from Apache Kafka, Apache Flume, Amazon Kinesis, and Twitter, as well as sensors and other devices connected via TCP sockets. Data can also be streamed out of storage services such as HDFS and AWS S3.

Spark Streaming processes data with a range of algorithms and high-level data processing functions like map, reduce, join, and window. Once processed, data can be passed to a range of external file systems or used to populate live dashboards.

Spark Streaming - input data stream

Logically, Spark Streaming represents a continuous stream of input data as a discretized stream, or DStream. Internally, Spark actually stores and processes this DStream as a sequence of RDDs. Each of these RDDs is a snapshot of all data ingested during a specified time period, which allows Spark's existing batch processing capabilities to operate on the data.

Spark RDD

The data processing capabilities in Spark Core and Spark's other modules are applied to each of the RDDs in a DStream in exactly the same manner as they would be applied to any other RDD: Spark modules other than Spark Streaming have no awareness that they are processing a data stream, and no need to know.

A basic RDD operation, flatMap, can be used to extract individual words from lines of text in an input source. When that input source is a data stream, flatMap simply works as it normally would, as shown below.

Spark RDD model

The Spark Driver

Spark Streaming driver

Activities within a Spark cluster are orchestrated by a driver program using the_SparkContext_. In the case of stream-based applications, the StreamingContext is used. This exploits the cluster management capabilities of an external tool like Mesos or Hadoop's YARN to allocate resources to the Executor processes that actually work with data.

In a distributed and generally fault-tolerant cluster architecture, the driver is a potential point of failure, and a heavy load on cluster resources.

Particularly in the case of stream-based applications, there is an expectation and requirement that the cluster will be available and performing at all times. Potential failures in the Spark driver must therefore be mitigated, wherever possible. Spark Streaming introduced the practice of checkpointing to ensure that data and metadata associated with RDDs containing parts of a stream are routinely replicated to some form of fault-tolerant storage. This makes it feasible to recover data and restart processing in the event of a driver failure.

Processing Models

Spark Streaming supports commonly understood semantics for the processing of items in a data stream. These semantics ensure that the system is delivering dependable results, even in the event of individual node failures. Items in the stream are understood to be processed in one of the following ways:

  • At most once: Each item will either be processed once or not at all.
  • At least once: Each item will be processed one or more times, increasing the likelihood that data will not be lost but also introducing the possibility that items may be duplicated.
  • Exactly once: Each item will be processed exactly once.

Different input sources to Spark Streaming will offer different guarantees for the manner in which data will be processed. With version 1.3 of Spark, a new API enables exactly once ingest of data from Apache Kafka, improving data quality throughout the workflow. This is discussed in more detail in the Integration Guide.

Picking a Processing Model

From a stream processing standpoint, at most once is the easiest to build. This is due to the fact that the stream is “okay” with knowing that some data could be lost. Most people would think there would never be a use case which would tolerate at most once, but consider a use case of a media streaming service.

Let’s say a customer of a movie streaming service is watching a movie, and every few seconds the movie player emits checkpoints that detail the current point in the movie back to the media streaming service. The most recent checkpoint would be used if the movie player crashed and the customer restarted the movie. With at most once processing, the worst-case scenario if a checkpoint is missed is that the customer rewatches a few additional seconds of the movie from the time of the most recent checkpoint that was created. This would have a very minimal impact on customers.

At least once guarantees that none of those checkpoints will be lost. With at least once processing, the previous use case would change in that the same checkpoint could potentially be replayed multiple times. If the stream processor handling a checkpoint saved the checkpoint, then crashed before it could be acknowledged, the checkpoint would be replayed when the server came back online.

This brings up an important note about replaying the same request multiple times. At least once guarantees every request will be processed one or more times, so there should be special considerations for creating code functions that are idempotent. (An idempotent action is one that can be repeated multiple times and never produce a different result. For example, x = 4 is idempotent and x++ is not.)

If at least once was the requirement for this media streaming example, we could add a field to the checkpoint to enable a different way of acting upon the checkpoint:

**                       `"usersTime": "20150519T13:15:14" `**
With that extra piece of information, the function that persists the checkpoint could check to see if usersTime is less than the latest checkpoint. This would prevent overwriting a newer value and would make the code function idempotent.

Within the world of streaming, it is important to understand these concepts and when they matter for a streaming implementation. Exactly once has a dramatic impact on throughput and performance in computing systems because of the guarantees that it makes. However, the trade-offs must be weighed before going down this route. Exactly once is the most costly model to implement, and if code functions can be made to be idempotent, there is no value in exactly once processing. Generally, the goal of any stream processing system should be implementing at least once with idempotent functions. Functions that cannot be made to be idempotent and still require such a guarantee have little choice but to implement exactly once processing.

Micro-Batches: The Difference Between Spark Streaming and Other Platforms

Spark Streaming operates on the concept of micro-batches, which means that it should not be considered a real-time stream processing engine. This is perhaps the single biggest difference between Spark Streaming and other platforms such as Apache Storm or Apache Flink. It is difficult to make comparisons between Spark Streaming and other stream processing engines because Spark Streaming is based on micro-batches. (A micro-batch consists of a series of events batched together over a period of time. The batch interval generally ranges from as little as 500ms to about 5,000ms.)

Spark Streaming will appear faster than nearly every system that is not based on micro-batches, but the cost is latency in processing events in a stream. Shortening the time frame brings it closer to real time but also increases the overhead the system has to endure. This is due to the need to create more RDDs for each and every micro-batch. The inverse is also true; the longer the time frame, the further from real time and the less overhead that will occur for processing each micro-batch.

The argument is often made that micro-batches in the context of streaming applications lack the time-series data from each discrete event. This can make it more difficult to know if events arrived out of order, but may or may not be relevant to a business use case. An application built upon Spark Streaming cannot react to every event as it occurs. This is not necessarily a bad thing, but it is very important to make sure that people understand the limitations and capabilities of Spark Streaming.

Limitations

Two of the biggest complaints about running Spark Streaming in production are back pressure and out-of-order data. Back pressure occurs when the volume of events coming across a stream is more than the stream processing engine can handle. In version 1.5 of Spark, there will be changes that enable more dynamic ingestion rate capabilities and make back pressure less problematic. In addition, more work is being performed to enable user-defined time extraction functions. This will enable developers to check event time against events already processed.


This blog post was published October 06, 2015.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now