Introduction to Apache Flink

by Ellen Friedman and Kostas Tzoumas

Stream-First Architecture

There is a revolution underway in how people design their data architecture, not just for real-time or near real–time projects, but in a larger sense as well. The change is to think of stream-based data flow as the heart of the overall design, rather than the basis just for specialized work. Understanding the motivations for this transformation to a stream-first architecture helps put Apache Flink and its role in modern data processing into context.

Flink, as part of a newer breed of systems, does its part to broaden the scope of the term “data streaming” way beyond real-time, low-latency analytics to encompass a wide variety of data applications, including what is now covered by stream processors, what is covered by batch processors, and even some stateful applications that are executed by transactional databases.

As it turns out, the data architecture needed to put Flink to work effectively is also the basis for gaining broader advantages from working with streaming data. To understand how this works, we will take a closer look at how to build the pipeline to support Flink for stream processing. But first, let’s address the question of what is to be gained from working with a stream-focused architecture instead of the more traditional approach.

Traditional Architecture versus Streaming Architecture

Traditionally, the typical architecture of a data backend has employed a centralized database system to hold the transactional data of the business. In other words, the database (be that a SQL or NoSQL database) holds the “fresh” (another word for “accurate”) data, which represents the state of the business right now. This might, for example, mean how many users are logged in to your system, how many active users a website has, or what the current balance of each user account is. Data applications that need fresh data are implemented against the database. Other data stores such as distributed file systems are used for data that need not be updated frequently and for which very large batch computations are needed.

This traditional architecture has served applications well for decades, but is now being strained under the burden of increasing complexity in very large-scale distributed systems. Some of the main problems that companies have observed are:

  • The pipeline from data ingestion to analytics is too complex and slow for many projects.

  • The traditional architecture is too monolithic: the database backend acts as a single source of truth, and all applications need to access this backend for their data needs.

  • Systems built this way have very complex failure modes that can make it hard to keep them running well.

Another problem of this traditional architecture stems from trying to maintain the current “state of the world” consistently across a large, distributed system. At scale, it becomes harder and harder to maintain such precise synchronization; stream-first architectures allow us to relax the requirements so that we only need to maintain much more localized consistency.

A modern alternative approach, streaming architecture, solves many of the problems that enterprises face when working with large-scale systems. In a stream-based design, we take this a step further and let data records continuously flow from data sources to applications and between applications. There is no single database that holds the global state of the world. Rather, the single source of truth is in shared, ever-moving event streams—this is what represents the history of the business. In this stream-first architecture, applications themselves build their local views of the world, stored in local databases, distributed files, or search documents, for instance.

Message Transport and Message Processing

What is needed to implement an effective stream-first architecture and to gain the advantages of using Flink? A common pattern is to implement a streaming architecture by using two main kinds of components, described briefly here and represented in Figure 2-1:

1. A message transport to collect and deliver data from continuous events from a variety of sources (producers) and make this data available to applications and services that subscribe to it (consumers).

2. A stream processing system to (1) consistently move data between applications and systems, (2) aggregate and process events, and (3) maintain local application state (again consistently).

The excitement around real-time applications tends to direct people’s attention to component number 2 in our list, the stream processing system, and how to choose a technology for stream processing that can meet the requirements of a particular project. In addition to using Flink for data processing, there are other choices that you can employ (e.g., Spark Streaming, Storm, Samza, Apex). We use Apache Flink as the stream processor in the rest of the examples in this book.

As it turns out, it isn’t just the choice of the stream processor that makes a big difference to designing an efficient stream-based architecture. The transport layer is also key. A big part of why modern systems can more easily handle streaming data at large scale is improvements in the way message-passing systems work and changes to how the processing elements interact with those systems.

The message transport layer needs to have certain capabilities to make streaming design work well. At present, two messaging technologies offer a particularly good fit to the required capabilities: Kafka and MapR Streams, which supports the Kafka API but is built into the MapR converged data platform. In this book, we assume that one or the other of these technologies provide the transport layer in our examples.

The Transport Layer: Ideal Capabilities

What are the capabilities needed by the message transport system in streaming architecture?

Performance with Persistence

One of the roles of the transport layer is to serve as a safety queue upstream from the processing step—a buffer to hold event data as a kind of short-term insurance against an interruption in processing as data is ingested. Until recently, message-passing technologies were limited by a tradeoff between performance and persistence. As a result, people tended to think of streaming data going from the transport layer to processing and then being discarded: a use it and lose it approach.

The assumption that you can’t have both performance and persistence is one of key ideas that has changed in order to design a modern streaming architecture. It’s important to have a message transport that delivers high throughput with persistence; both Kafka and MapR’s MapR Streams do just that.

A key benefit of a persistent transport layer is that messages are replayable. This key capability allows a data processor like Flink to replay and recompute a specified part of the stream of events (discussed in further detail in Stateful Computation). For now, the key is to recognize that it is the interplay of transport and processing that allows a system like Flink to provide guarantees about correct processing and to do “time travel,” which refers to the ability to reprocess data.

Decoupling of Multiple Producers from Multiple Consumers

An effective messaging technology enables collection of data from many sources (producers) and makes it available to multiple services or applications (consumers), as depicted in Figure 2-2. With Kafka and MapR Streams, data from producers is assigned to a named topic. Data sources push data to the message queue, and consumers (or consumer groups) pull data. Event data can only be read forward from a given offset in the message queue. Producers do not broadcast to all consumers automatically. This may sound like a small detail, but this characteristic has an enormous impact on how this architecture functions.

With message-transport tools such as Kafka and MapR Streams, data producers and data consumers (of which Flink applications would be included) are decoupled. Messages arrive ready for immediate use or to be consumed later. Consumers subscribe to messages from the queue instead of messages being broadcast. A consumer need not be running at the time a message arrives.
Figure 2-2. With message-transport tools such as Kafka and MapR Streams, data producers and data consumers (of which Flink applications would be included) are decoupled. Messages arrive ready for immediate use or to be consumed later. Consumers subscribe to messages from the queue instead of messages being broadcast. A consumer need not be running at the time a message arrives.

This style of delivery—with consumers subscribing to their topics of interest—means that messages arrive immediately, but they don’t need to be processed immediately. Consumers don’t need to be running when the messages arrive; they can make use of the data any time they like. New consumers or producers can also be added easily. Having a message-transport system that decouples producers from consumers is powerful because it can support a microservices approach and allows processing steps to hide their implementations, and thus provides them with the freedom to change those implementations.

Streaming Data for a Microservices Architecture

A microservices approach refers to breaking functions in large systems into simple, generally single-purpose services that can be built and maintained easily by small teams. This design enables agility even in very large organizations. To work properly, the connections that communicate between services need to be lightweight.

Note

“The goal [of microservices] is to give each team a job and a way to do it and to get out of their way.”

 

From Chapter 3 of Streaming Architecture, Dunning and Friedman (O’Reilly, 2016)

Using a message-transport system that decouples producers and consumers but delivers messages with high throughput, sufficient for high-performance processors such as Flink, is a great way to build a microservices organization. Streaming data is a relatively new way to connect microservices, but it has considerable benefits, as you’ll see in the next couple of sections.

Data Stream as the Centralized Source of Data

Now you can put together these ideas to envision how message-transport queues interconnect various applications to become, essentially, the heart of the streaming architecture. The stream processor (Flink, in our case) subscribes to data from the message queues and processes it. The output can go to another message-transport queue. That way other applications, including other Flink applications, have access to the shared streaming data. In some cases, the output is stored in a local database. This approach is depicted in Figure 2-3.

In a stream-first architecture, the message stream (represented here as blank horizontal cylinders) connects applications and serves as the new shared source of truth, taking the role that a huge centralized database used to do. In our example, Flink is used for various applications. Localized views can be stored in files or databases as needed for the requirements of microservices-based projects. An added advantage to this streaming style of architecture is that the stream processor, such as Flink, can help maintain consistency.
Figure 2-3. In a stream-first architecture, the message stream (represented here as blank horizontal cylinder) connects applications and serves as the new shared source of truth, taking the role that a huge centralized database used to do. In our example, Flink is used for various applications. Localized views can be stored in files or databases as needed for the requirements of microservices-based projects. An added advantage to this streaming style of architecture is that the stream processor, such as Flink, can help maintain consistency.
Note

In the streaming architecture, there need not be a centralized database. Instead, the message queues serve as a shared information source for a variety of different consumers.

Fraud Detection Use Case: Better Design with Stream-First Architecture

The power of the stream-based microservices architecture is seen in the flexibility it adds, especially when the same data is used in multiple ways. Take the example of a fraud-detection project for a credit card provider. The goal is to identify suspicious card behavior as quickly as possible in order to shut down a potential theft with minimal losses. The fraud detector might, for example, use card velocity as one indicator of potential fraud: do sequential transactions take place across too great a distance in too short a time to be legitimately possible? A real fraud detector will use many dozens or hundreds of such features, but we can understand a lot by dealing with just this one.

The advantages of stream-based architecture for this use case are shown in Figure 2-4. In this figure, many point-of-sale terminals (POS1 through POSn) ask the fraud detector to make fraud decisions. These requests from the point-of-sale terminals need to be answered immediately and form a call-and-response kind of interaction with the fraud detector.

Fraud detection can benefit from a stream-based microservices approach. Flink would be useful in several components of this data flow: the fraud-detector application, the updater, and even the card analytics could all use Flink. Notice that by avoiding direct updates to a local database, streaming data for card activity can be used by other services, including card analytics without interference. [Image credit: Streaming Architecture, Chapter 6, (O’Reilly, 2016).]
Figure 2-4. Fraud detection can benefit from a stream-based microservices approach. Flink would be useful in several components of this data flow: the fraud-detector application, the updater, and even the card analytics could all use Flink. Notice that by avoiding direct updates to a local database, streaming data for card activity can be used by other services, including card analytics without interference. [Image credit: Streaming Architecture, Batch Is a Special Case of Streaming, (O’Reilly, 2016).]

In a traditional system, the fraud-detection model would store a profile containing the last location for each credit card directly in the database. But in such a centralized database design, other consumers cannot easily make use of the card activity data due to the risk that their access might interfere with the essential function of the fraud-detection system, and they certainly wouldn’t be allowed to make changes to the schema or technology of that database without very careful and arduous review. The result is a huge slowing of progress resulting from all of the due diligence that must be applied to avoid breaking or compromising business-critical functions.

Compare that traditional approach to the streaming design illustrated in Figure 2-4. By sending the output of the fraud detector to an external message-transport queue (Kafka or MapR Streams) instead of directly to the database and then using a stream processor such as Flink to update the database, the card activity data becomes available to other applications such as card analytics via the message queue. The database of last card use becomes a completely local source of information, inaccessible to any other service. This design avoids any risk of overload due to additional applications.

Flexibility for Developers

This stream-based microservices architecture also provides flexibility for developers of the fraud-detection system. Suppose that this team wants to develop and evaluate an improved model for fraud detection? The card activity message stream makes this data available for the new system without interfering with the existing detector. Additional readers of the queue impose almost negligible load on the queue, and each additional service is free to keep historical information in any format or database technology that is appropriate. Moreover, if the messages in the card activity queue are expressed as business-level events rather than, say, database table updates, the exact form and content of the messages will tend to be very stable. When changes are necessary, they can often be forward-compatible to avoid changes to existing applications.

This credit card fraud detection use case is just one example of the way a stream-based architecture with a proper message transport (Kafka or MapR Streams) and a versatile and highly performant stream processor (Flink) can support a variety of different projects from a shared “source of truth”: the message stream.

Beyond Real-Time Applications

As important as they are, low-latency use cases are just one class of consumers for streaming data. Consider various ways that streaming data can be used: Stream-processing applications might, for example, subscribe to streaming data in a message queue, to update a real-time dashboard (see the Group A consumers in Figure 2-5).

Other users could take advantage of the fact that persisted messages can be replayed (see the Group C consumers in Figure 2-5). In this case, the message stream acts as an auditable log or long-term history of events. Having a replayable history is useful, for example, for security analytics, as a part of the input data for predictive maintenance models in industrial settings, or for retrospective studies as in medical or environmental research.

The consumers of streaming data are not limited to just low-latency applications, although they are important examples. This diagram illustrates several of the classes of consumers that benefit from a streaming architecture. Group A consumers might be doing various types of real-time analytics, including updating a real-time dashboard. Group B class of consumers include various local representations of the current state of some aspect of the data, perhaps stored in a database or search document.
Figure 2-5. The consumers of streaming data are not limited to just low-latency applications, although they are important examples. This diagram illustrates several of the classes of consumers that benefit from a streaming architecture. Group A consumers might be doing various types of real-time analytics, including updating a real-time dashboard. Group B consumers include various local representations of the current state of some aspect of the data, perhaps stored in a database or search document.

For other uses, the data queue is tapped for applications that update a local database or search document (see the Group B use cases in Figure 2-5). Data from the queue is not output directly to a database, by the way. Instead, it must be aggregated or otherwise analyzed and transformed by the stream processor first. This is another situation in which Flink can be used to advantage.

Geo-Distributed Replication of Streams

Stream processing and a stream-first architecture are not experimental toys: these approaches are used in mission-critical applications, and these applications need certain features from both the stream processor and the message transport layer. A wide variety of these critical business uses depend on consistency across data centers, and as such, they not only require a highly effective stream processor, but also message transport with reliable geo-distributed replication. Telecoms, for example, need to share data between cell towers, users, and processing centers. Financial institutions need to be able to replicate data quickly, accurately, and affordably across distant offices. There are many other examples where it’s particularly useful if this geo-distribution of data can be done with streaming data.

In particular, to be most useful, this replication between data centers needs to preserve message offsets to allow updates from any of the data centers to be propagated to any of the other data centers and allow bidirectional and cyclic replication of data. If message offsets are not preserved, programs cannot be restarted reliably in another data center. If updates are not allowed from any data center, some sort of master must be designed reliably. And cyclic replication is necessary to avoid single point of failure in replication.

These capabilities are currently supported in the MapR Streams messaging system, but not in Kafka as of yet. The basic idea with MapR Streams transport is that many streaming topics are collected into first-class data structures known as streams that coexist with files, tables, and directories in the MapR data platform. These streams are then the basis for managing replication as well as time-to-live and access control permissions (ACEs). Changes made to topics in a stream are tagged with the source cluster ID to avoid infinite cyclic replication, and these changes are propagated successively to other clusters while maintaining all message offsets.

This ability to replicate streams across data centers extends the usefulness of streaming data and stream processing. Take, for example, a business that serves online ads. Streaming data analysis can be useful in such a business in multiple ways. If you think in terms of the use classes described previously in Figure 2-5, in ad-tech, the real-time applications (Group A) might involve up-to-date inventory control, the current-state view in a database (Group B) might be cookie profiles, and replaying the stream (Group C) would be useful in models to detect clickstream fraud.

In addition to these considerations, there’s the challenge that different data centers are handling different bids for the same ads, but they are all drawing from the same pool of ad inventory. In a business where accuracy and speed are important, how do the different centers coordinate availability of inventory? With the message stream as the centrally shared “source of truth,” it’s particularly powerful in this use case to be able to replicate the stream across different data centers, which MapR Streams can do. This situation is shown in Figure 2-6.

Ad-tech industry example analyzes streaming data in different data centers with various model-based applications for which Flink could be useful. Each local data center needs to keep its own current state of transactions, but they are all drawing from a common inventory. Another requirement is to share data with a central data center where Flink could be used for global analytics. This use case calls for efficient and accurate geo-distributed replication, something that can be done with the messaging system MapR Streams, but not currently with Kafka.
Figure 2-6. Ad-tech industry example analyzes streaming data in different data centers with various model-based applications for which Flink could be useful. Each local data center needs to keep its own current state of transactions, but they are all drawing from a common inventory. Another requirement is to share data with a central data center where Flink could be used for global analytics. This use case calls for efficient and accurate geo-distributed replication, something that can be done with the messaging system MapR Streams, but not currently with Kafka.

In addition to keeping the different parts of the business up to date with regard to shared inventory (a situation that would apply to many other sectors as well), the ability to replicate data streams across data centers has other advantages. Having more than one data center helps spread the load for high volume and decreases propagation delay by moving computation close to end users during bidding and ad placement. Multiple data centers also serve as backups in case of disaster.

In the first two chapters, we’ve seen that handling and processing data as streams is a good fit for the way data from continuous events naturally occurs. We’ve also explored the advantages of a stream-based architecture that combines effective message-transport technology, such as Kafka or MapR Streams, with Apache Flink as the stream processor.

In the next chapter, we will examine the key features of Flink and provide an overview of what Flink can do before diving deeper in later chapters into how Flink functions.