Fast data processing pipeline for predicting flight delays using Apache APIs: Kafka, Spark Streaming and Machine Learning (part 2)

Contributed by Carol McDonald

Editor's Note: You can find Part 1 of this series here

According to Bob Renner, CEO of MapR Partner Liaison Technologies, in Forbes Machine Learning predictions for 2018, the possibility to blend machine learning with real-time transactional data flowing through a single platform is opening a world of new possibilities, such as enabling organizations to take advantage of opportunities as they arise. Leveraging these opportunities requires fast, scalable data processing pipelines which process, analyze, and store events as they arrive.

This is the second in a series of blogs, which discusses the architecture of a data pipeline that combines streaming data with machine learning and fast storage. The first post discussed creating a machine learning model to predict flight delays. This second post will discuss using the saved model with streaming data to do real-time analysis of flight delays. The third post will discuss fast storage with MapR-DB.

Saved model with streaming data

Microservices, Data Pipelines, and Machine Learning Logistics

The microservice architectural style is an approach to developing an application as a suite of small independently deployable services. A common architecture pattern combined with microservices is event sourcing using an append only publish subscribe event stream such as MapR Event Streams (which provides a Kafka API).

Immediate access to operational and analytical data in MapR

Publish Subscribe Event Streams with MapR-ES

A central principle of publish/subscribe systems is decoupled communications, wherein producers don’t know who subscribes, and consumers don’t know who publishes; this system makes it easy to add new listeners or new publishers without disrupting existing processes. MapR-ES allows any number of information producers (potentially millions of them) to publish information to a specified topic. MapR-ES will reliably persist those messages and make them accessible to any number of subscribers (again, potentially millions). Topics are partitioned for throughput and scalability, producers are load balanced and consumer can be grouped to read in parallel. MapR-ES can scale to very high throughput levels, easily delivering millions of messages per second using very modest hardware.

Kafka API

You can think of a partition like a queue; new messages are appended to the end, and messages are delivered in the order they are received.

MapR Cluster


Unlike a queue, messages are not deleted when read; they remain on the partition, available to other consumers. Messages, once published, are immutable, and can be retained forever.

Not deleting messages

Not deleting messages when they are read allows for high performance at scale and also for processing of the same messages by different consumers for different purposes such as multiple views with polyglot persistence.

New subscribers of information

New subscribers of information 
can replay the data stream, specifying a starting point as far back as the data retention policy enables.

Create new view, index, cache

Read From new View

Data Pipelines

When you combine these messaging capabilities with the simple concept of microservices, you can greatly enhance the agility with which you build, deploy, and maintain complex data pipelines. Pipelines are constructed by simply chaining together multiple microservices, each of which listens for the arrival of some data, performs its designated task,
and optionally publishes its own messages to a topic. Development teams can deploy new services or service upgrades more frequently and with less risk, because the production version does not need to be taken offline. Both versions of the service simply run in parallel, consuming new data as it arrives and producing multiple versions of output. Both output streams can be monitored over time; the older version can be decommissioned when it ceases to be useful.

Machine Learning Logistics and Data Pipelines

Combining data pipelines with machine learning can handle the logistics of machine learning in a flexible way by:

  • Making input and output data available to independent consumers
  • Managing and evaluating multiple models and easily deploying new models

Machine Learning Logistics and Data Pipelines

Architectures for these types of applications are discussed in more detail in the ebooks Machine Learning logistics, Streaming Architecture, and Microservices and Containers.

Below is the data processing pipeline for this use case of predicting flight delays. This pipeline could be augmented to be part of the rendezvous architecture discussed in the Oreilly Machine Learning Logistics ebook.

Data Processing Pipeline

Spark Streaming Use Case Example Code

The following figure depicts the architecture for the part of the use case data pipeline discussed in this post:

architecture data pipeline

  1. Flight data is published to a MapR Streams topic using the Kafka API.
  2. A Spark streaming application, subscribed to the first topic:
    1. Ingests a stream of flight data
    2. Uses a deployed machine learning model to enrich the flight data with a delayed/not delayed prediction
    3. publishes the results in JSON format to another topic.
  3. (In the 3rd blog) A Spark streaming application subscribed to the second topic:
    1. Stores the input data and predictions in MapR-DB

Example Use Case Data

You can read more about the data set in part 1 of this series. The incoming and outgoing data is in JSON format, an example is shown below:

"dofW”:4, "carrier":"AA", "origin":"EWR",
"dest":"ORD”, "crsdeptime":705,
"crsarrtime”:851, "crselapsedtime":166.0,"dist":719.0}

Example Use Case Data

Spark Kafka Consumer Producer Code

Parsing the Data Set Records

We use a Scala case class and Structype to define the schema, corresponding to the input data.


Loading the Model

The Spark CrossValidatorModel class is used to load the saved model fitted on the historical flight data.

Loading the Model

Spark Streaming Code

These are the basic steps for the Spark Streaming Consumer Producer code:

  1. Configure Kafka Consumer and Producer properties.
  2. Initialize a Spark StreamingContext object. Using this context, create a DStream which reads message from a Topic.
  3. Apply transformations (which create new DStreams).
  4. Write messages from the transformed DStream to a Topic.
  5. Start receiving data and processing. Wait for the processing to be stopped.

We will go through each of these steps with the example application code.

1) Configure Kafka Consumer Producer properties

The first step is to set the KafkaConsumer and KafkaProducer configuration properties, which will be used later to create a DStream for receiving/sending messages to topics. You need to set the following paramters:

  • Key and value deserializers: for deserializing the message.
  • Auto offset reset: to start reading from the earliest or latest message.
  • Bootstrap servers: this can be set to a dummy host:port since the broker address is not actually used by MapR Streams.

For more information on the configuration parameters, see the MapR Streams documentation.

Configure Kafka

2) Initialize a Spark StreamingContext object.

ConsumerStrategies.Subscribe, as shown below, is used to set the topics and Kafka configuration parameters. We use the KafkaUtils createDirectStream method with a StreamingContext, the consumer and location strategies, to create an input stream from a MapR Streams topic. This creates a DStream that represents the stream of incoming data, where each message is a key value pair. We use the DStream map transformation to create a DStream with the message values.

Initialize a Spark StreamingContext Object 1

Initialize a Spark StreamingContext Object 2

3) Apply transformations (which create new DStreams)

We use the DStream foreachRDD method to apply processing to each RDD in this DStream. We read the RDD of JSON strings into a Flight Dataset. Then we display 20 rows with the Dataset show method . We also create a temporary view of the Dataset in order to execute SQL queries.

Apply transformations

Here is example output from the :

Example output from the

We transform the Dataset with the model pipeline, which will tranform the features according to the pipeline, estimate and then return the predictions in a column of a new Dateset. We also create a temporary view of the new Dataset in order to execute SQL queries.

Temporary view

4) Write messages from the transformed DStream to a Topic

The Dataset result of the query is converted to JSON RDD Strings, then the RDD sendToKafka method is used to send the JSON key-value messages to a topic (the key is null in this case).

Write messages from the transformed DStream

Write messages from the transformed DStream 2

Example message values (the output for temp.take(2) ) are shown below:


5) Start receiving data and processing it. Wait for the processing to be stopped.

To start receiving data, we must explicitly call start() on the StreamingContext, then call awaitTermination to wait for the streaming computation to finish. We use ssc.remember to cache data for queries.

Receive and process data

Streaming Data Exploration

Now we can query the cached streaming data in the input temporary view flights, and the predictions temporary view flightsp. Below we display a few rows from the flights view:

Flights view

Below we display the count of predicted delayed/not delayed departures by Origin:

Predicted Departures

Below we display the count of predicted delayed/not delayed departures by Destination:

Predicted Delayed

Below we display the count of predicted delayed/not delayed departures by Origin,Destination:

Count predicted delayed


In this blog post, you learned how to use a Spark machine learning model in a Spark Streaming application, and how to integrate Spark Streaming with MapR Event Streams to consume and produce messages using the Kafka API.


Running the Code

All of the components of the use case architecture we just discussed can run on the same cluster with the MapR Converged Data Platform.

MapR Converged Data Platform

There are several ways you can get Started with the Converged Data Platform:


This blog post was published January 10, 2018.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.