Machine Learning Logistics

by Ted Dunning and Ellen Friedman

What Matters in Model Management

The logistics required for successful machine learning go beyond what is needed for other types of software applications and services. This is a dynamic process that should be able to run multiple production models, across various locations and through many cycles of model development, retraining, and replacement. Management must be flexible and quickly responsive: you don’t want to wait until changes in the outside world reduce performance of a production system before you begin to build better or alternative models, and you don’t want delays when it’s time to deploy new models into production.

All this needs to be done in a style that fits the goals of modern digital transformation. Logistics should not be a barrier to fast-moving, data-oriented systems, or a burden to the people who build machine learning models and make use of the insights drawn from them. To make it much easier to do these things, we introduce the rendezvous architecture for management of machine learning logistics.

Ingredients of the Rendezvous Approach

The rendezvous architecture takes advantage of data streams and geo-distributed stream replication to maintain a responsive and flexible way to collect and save data, including raw data, and to make data and multiple models available when and where needed. A key feature of the rendezvous design is that it keeps new models warmed up so that they can replace production models without significant lag time. The design strongly supports ongoing model evaluation and multi-model comparison. It’s a new approach to managing models that reduces the burden of logistics while providing exceptional levels of monitoring so that you know what’s happening.

Many of the ingredients of the rendezvous approach—use of streams, containers, a DataOps style of design—are also fundamental to the broader requirements of building a global data fabric, a key aspect of digital transformation in big data settings. Others, such as use of decoy and canary models, are specific elements for machine learning.

With that in mind, in this chapter we explore the fundamental aspects of this approach that you will need in order to take advantage of the detailed architecture presented in The Rendezvous Architecture for Machine Learning.

DataOps Provides Flexibility and Focus

New technologies offer big benefits, not only to work with data at large scale, but also to be able to pivot and respond to real-world events as they happen. It’s imperative, then, to not limit your ability to enjoy the full advantage of these emerging technologies just because your business hasn’t also evolved its style of work. Traditionally siloed roles can prove too rigid and slow to be a good fit in big data organizations undergoing digital transformation. That’s where a DataOps style of work can help.

The DataOps approach is an emerging trend to capture the efficiency and flexibility needed for data-driven business. DataOps style emphasizes better collaboration and communication between roles, cutting across skill guilds to enable teams to move quickly, without having to wait at each step for IT to give permissions. It expands the DevOps philosophy to include not only specialists in software development and operations, but also data-heavy roles such as data engineering and data science. As with DevOps, architecture and product management roles also are part of the DataOps team.

Note

A DataOps approach improves a project’s ability to stay on time and on target.

Not all DataOps teams will include exactly the same roles, as shown in Figure 2-1; overall goals direct which functions are needed for that particular team. Organizing teams across traditional silos does not increase the total size of the teams, it just changes the most-used communication paths. Note that the DataOps approach is about organizing around data-related goals to achieve faster time to value. DataOps is does not require adding additional people. Instead, it’s about improving collaboration between skill sets for efficiency and better use of people’s time and expertise.

DataOps team members fill a variety of roles notably including data engineering and data science. This is a cross-cutting organization that breaks down skill silos
Figure 2-1. Figure 2-1. DataOps team members fill a variety of roles notably including data engineering and data science. This is a cross-cutting organization that breaks down skill silos.

Just as each DataOps team may include a different subset of the potential roles for working with data, teams also differ as to how many people fill the roles. In the tensor chicken example presented in Why Model Management?, one person stretched beyond his usual strengths in software engineering to cover all required roles for this toy project—he was essentially a DataOps team of one. In contrast, in real-world business situations, it’s usually best to draw on the specialties of multiple team members. In large-scale projects, a particular DataOps role may be filled by more than one person, but it’s also common that some people will cover more than one role. Operations and software engineering skills may overlap; team members with software engineering experience also may be qualified as data engineers. Often, data scientists have data engineering skills. It’s rare, however, to see overlap between data science and operations. These are not meant as hard hard-edged definitions but; rather, they are suggestions for how to combine useful skills for data-oriented work.

What generally lies outside the DataOps roles? Infrastructural capabilities around data platform and network—needs that cut across all projects—tend to be supported separately from the DataOps teams by support organizations, as shown in Figure 2-1.

What all DataOps teams share is a common goal: the data-driven needs of the services they support. This combination of skills and shared goals enhance both the flexibility needed to adjust to changes as situations evolve and the focus needed to work efficiently, making it more feasible to meet essential SLAs.

DataOps is an approach that is well suited to the end-to-end needs of machine learning, For example, this style makes it more feasible for data scientists to have the support of software engineering to provide what is needed when models are handed over to operations during deployment.

The DataOps approach is not limited to machine learning. This style of organization is useful for any data-oriented work, making it easier to take advantage of the benefits offered by building a global data fabric, as described later in this chapter. DataOps also fits well with a widely popular architectural style known as microservices.

Stream-Based Microservices

Microservices is a flexible style of building large systems whose value is broadly recognized across industries. Leading companies, including Google, Netflix, LinkedIn, and Amazon, demonstrate the advantages of adopting a microservices architecture. Microservices enables faster movement and better ability to respond in a more agile and appropriate way to changing business needs, even at the detailed level of applications and services.

What is required at the level of technical design to support a microservices approach? Independence between microservices is key. Services need to interact via lightweight connections. In the past, it has often been assumed that these connections would use RPC mechanisms such as REST that involve a call and almost immediate response. That works, but a more modern, and in many ways more advantageous, method to connect microservices is via a message stream.

Stream transport can support microservices if it can do the following:

  • Support multiple data producers and consumers
  • Provide message persistence with high performance
  • Decouple producers and consumers

It’s fairly obvious in a complex, large-scale system why the message transport needs to be able to handle data from multiple sources (data producers) and to have multiple applications running that consume that data, as shown in Figure 2-2. However, the other needs can be, at first glance, less obvious.

A stream-based design with the right choice of stream transport supports a microservices-style approach
Figure 2-2. Figure 2-2. A stream-based design with the right choice of stream transport supports a microservices-style approach.

Clearly you want a high-performance system, but why do you need message persistence? Often when people think of using streaming data, they are concerned about some real-time or low-latency application, such as updating a real-time dashboard, and they may assume a “use it and lose it” attitude toward the streaming data involved. If so, they likely are throwing away some real value, because other groups or even themselves in future projects might need access to that discarded data. There are a number of reasons to want durable messages, but foremost in the context of microservices is that message persistence is required to decouple producers and consumers.

Note

A stream transport technology that decouples producers from consumers offers a key capability needed to take advantage of a flexible microservices-style design.

Why is having durable messages essential for this decoupling? Look again at Figure 2-2. The stream transport technologies of interest do not broadcast message data to consumers. Instead, consumers subscribe to messages on a topic-by-topic basis. Streaming data from the data sources is transported and made available to consumers immediately—a requirement for real-time or low-latency applications—but the message does not need to be consumed right away. Thanks to message persistence, consumers don’t need to be running at the moment the message appears; they can come online later and still be able to use data from earlier events. Consumers added at a later time don’t interfere with others. This independence of consumers from one another and from producers is crucial for flexibility.

Traditionally, stream transport systems have had a trade-off between performance and persistence, but that’s not acceptable for modern stream-based architectures. Figure 2-2 lists two modern stream transport technologies that deliver excellent performance along with persistence of messages. These are Apache Kafka and MapR Streams, which uses the Kafka API but is engineered into the MapR converged data platform. Both are good choices for stream transport in a stream-first architecture.

Streams Offer More

The advantages of a stream-first design go beyond just low-latency applications. In addition to support for microservices, having durable messages with high performance is helpful for a variety of use cases that need an event-by-event replayable history. Think of how useful that could be when an insurance company needs an auditable log, or someone doing anomaly detection as part of preventive maintenance in an industrial setting wants to replay data from IoT sensors for the weeks leading up to a malfunction.

Data streams are also excellent for machine learning logistics, as we describe in detail in The Rendezvous Architecture for Machine Learning. For now, one thing to keep in mind is that a stream works well as an immutable record, perhaps even better than a database.

Note

Databases were made for updates. Streams can be a safer way to persist data if you need an exact copy of input or output data for a model.

Streams are also a useful way to provide raw data to multiple consumers, including multiple machine learning models. Recording raw data is important for machine learning—don’t discard data that might later prove useful.

We’ve written about the advantages of a stream-based approach in the book Streaming Architecture: New Designs Using Apache Kafka and MapR Streams (O’Reilly, 2016). One advantage is the role of streaming and stream replication in building a global data fabric.

Building a Global Data Fabric

As organizations expand their use of big data across multiple lines of business, they need a highly efficient way to access a full range of data sources, types and structures, while avoiding hidden data silos. They need to have fine-grained control over access privileges and data locality without a big administrative burden. All this needs to happen in a seamless way across multiple data centers, whether on premises, in the cloud, or in a highly optimized hybrid architecture, as suggested in Figure 2-3. What is needed is something that goes beyond and works much better than a data lake. The solution is a global data fabric.

Preferably the data fabric you build is managed under uniform administration and security, with fine-grained control over access privileges, yet each approved user can easily locate data—each “thread” of the data fabric can be accessed and used, regardless of geo-location, on premises, or in cloud deployments.

Geo-distributed data is a key element in a global data fabric, not only for remote back-up copies of data as part of a disaster recovery plan, but also for day-to-day functioning of the organization and data projects including machine learning. It’s important for different groups and applications in different places to be able to simultaneously use the same data.

A global data fabric provides an organization-wide view of data, applications and operations while making it easy to find exactly the data that is needed. (Reprinted from the O’Reilly data report “Data Where You Want It: Geo-Distribution of Big Data and Analytics,” © 2017 Dunning and Friedman.)
Figure 2-3. Figure 2-3. A global data fabric provides an organization-wide view of data, applications and operations while making it easy to find exactly the data that is needed. (Reprinted from the O’Reilly data report “Data Where You Want It: Geo-Distribution of Big Data and Analytics,” © 2017 Dunning and Friedman.)

How can you do that? One example of a technology uniquely designed with the capabilities needed to build a large-scale global data fabric is the MapR converged data platform. MapR has a range of data structures (files, tables, message streams) that are all part of a single technology, all under the same global namespace, the same administration, and security. Streaming data can be shared through subscription by multiple consumers, and because MapR streams provide unique multi-master, omni-directional stream replication, streaming data is also shared across different locations, in cloud or on premises. In other words, you can focus on what your application is supposed to do, regardless of where the data source is located.

Administrators don’t need to worry about what each developer or data scientist is building—they can focus on their own concerns of maintaining the system, controlling access and data location as needed, all under one security system. Similarly, MapR’s direct table replication also contributes to this separation of concerns in building a global data fabric. Efficient mirroring of MapR data volumes with incremental updates goes even further to provide a way to extend the data fabric through replication of files, tables, and streams at regular, configurable intervals.

Note

A global data fabric suits the DataOps style of working and is a big advantage for the management of multiple applications, including machine learning models in development and production.

With a global data fabric, applications also need to run where you want them. The ability to deploy applications easily in predictable and repeatable environments is greatly assisted by the use of containers.

Making Life Predictable: Containers

It’s not surprising that you might encounter slight differences in conditions as you go from development to production or in running an application in a new location—but even small differences in the application version itself, or any of its dependencies, can result in big, unexpected, and generally unwanted differences in the behavior of your application. Here’s where the convenience of launching applications in containers comes to the rescue.

A container behaves like just another process running on a system, but containerization is a much lighter-weight approach than virtualization. It’s not necessary, for instance, to run a copy of the operating system in a container, as would be true for a virtual machine. With containers, you get environmental consistency that does away with surprises. You provide the same conditions for running your application in a variety of different situations, making its behavior predictable and repeatable. You can package, distribute, and run the exact bits that make up applications (including machine learning models) along with their dependencies in a carefully curated environment. Containers are a good fit to flexible approaches, useful for cloud deployments, and help you build a global data fabric. They’re particularly important for model management.

To keep containers lightweight, however, there is a challenge regarding data that needs to live beyond the lifetime of the container. Storing data in the container, especially at scale, is generally not a good practice. You want to keep containers stateless, yet at times you need to run stateful applications, including machine learning models. How, then, can you have stateless containers running stateful applications? A solution, shown in Figure 2-4, is for containers to access and persist data directly to a data platform. Note that applications running in the containers can communicate with each other directly or via the data platform.

Containers can remain stateless even when running stateful applications if there is data flow to and from a platform. (Based on “Data Where You Want It: Geo-Distribution of Big Data and Analytics.)
Figure 2-4. Figure 2-4. Containers can remain stateless even when running stateful applications if there is data flow to and from a platform. (Based on “Data Where You Want It: Geo-Distribution of Big Data and Analytics.)

Scalable datastores such as Apache Cassandra could serve the purpose of persistence, although it is limited to one data type (tables). Files could be persisted to a specialized distributed file system. Hadoop Distributed File System (HDFS), however, has some limitations on read performance for container-based applications. An alternative to either of these is the MapR Converged Data Platform that not only easily handles data persistence as tables or files at scale, it also offers the option of persistence to message streams. For more detail on running stateful container-based applications with persistence to an underlying platform, see the data report "Data Where You Want It: Geo-Distribution of Big Data and Analytics".

Canaries and Decoys

So far, we’ve talked about issues that not only matter for model management but are also more broadly important in working with big data. Now, let’s take a look at a couple of challenges that are specific for machine learning.

The first issue is to have a way to accurately record the input data for a model. It’s important to have an exact and replayable copy of input data. One way to do this is to use a decoy model. The decoy is a service that appears to be a machine learning model but isn’t. The decoy sits in exactly the same position in a data flow as the actual model or models being developed, but the decoy doesn’t do anything except look at its input data and record it, preferably in a data stream. Managing Model Development describes in detail the use of decoy models as a key part of the rendezvous architecture.

Another challenge for machine learning is to have a baseline reference for the behavior of a model in production. If a model is working well, perhaps at 90 percent of whatever performance is possible, introducing a new model usually should produce only a relatively small change even if it is a desirable improvement. It would be useful, then, to have a way to alert you to larger changes that may signal something has been altered or gone wrong in your data chain for instance. Perhaps customers are behaving differently such that customer-based-data is very different.

The solution is to deploy a canary model. The idea behind a canary is simple and harkens back to earlier times when miners carried a canary (a live bird, not software) into a mine as a check for good air. Canaries are particularly sensitive to toxic gases, so as long as the canary was still alive, the air was assumed to be safe for humans. The good news is that the use of a canary model in machine learning is less cruel but just as effective. The canary model runs alongside the current production model, providing a reference for baseline performance against which multiple other models can be measured.

Types of Machine Learning Applications

The logistical issues we discuss apply to essentially all types of machine learning applications, but the solutions we propose—in particular the rendezvous architecture—are a best fit for a decisioning type of machine learning. To help you recognize how this relates to your own projects, here’s a brief survey of machine learning categories drawn with a broad brush.

Decisioning

Machine learning applications that fall under the description of “decisioning” basically seek a “correct answer.” Out of a short list of answers, we hope number one is correct, and maybe partial credit for number two, and so on. Decisioning systems involve a query—response pattern with bounded input data and bounded output. They are typically built using supervised learning and usually require human judgments to build the training data.

You can think of decisioning applications as being further classified into two styles: discrete or flow. Discrete probably is more familiar. You have a single query followed by a single response, often in the moment, that goes back to the origin of the query. This is then repeated. With the flow style, there is a continuous stream of queries and the corresponding responses go back to a stream. The architecture we propose as a solution for managing models is a streaming system packaged to make it look discrete.

Use cases that fall in the decisioning category include transaction fraud detection projects such as those based on credit cards, credit risk analysis of home mortgage applications, and identifying potential fraud in medical claims. Decisioning projects also include marketing response prediction (predictive analytics), churn prediction, and detection of energy theft from a smart meter. Deep learning systems for image classification or speech recognition also can be viewed as decisioning systems.

Search-Like

Another category of applications involves search or recommendations. These projects use bounded input and return a ranked list of results. Multiple answers in the list may be valid—in fact the goal of search is often to provide multiple desired results. Use cases involving search-like or recommendation-based applications include automated website organization, ad targeting or upsell, and product recommendation systems for retail applications. Recommendation is also used to customize web experience in order to encourage a user to spend more time on a website.

Interactive

This last broad category contains systems that tend to be more complex and often require even higher level of sophistication than those we’ve already described. Answers are not absolute; the validity of the output generally depends on context, often in real-world and rapidly changing situations. These applications use continuous input and involve autonomous worldly interactions. They may also use reinforcement learning. The actions of the system also determine what future actions are possible.

Examples include chat bots, interactive robots and autonomous cars. For the latter, a response such as “turn right” may or may not be correct, depending on the context and the exact moment. In the case of self-driving cars, the interactive nature of these applications involve a great deal of continuous input from the environment and the car’s own internal systems in order respond in the moment. Other use cases involve sophisticated anomaly detection and alerting, such as web system anomalies, security intrusion detection or even predictive maintenance.

Conclusion

All of these categories of machine learning applications could benefit from some aspects of the solutions we describe next, but for search-like projects or sophisticated interactive machine learning, the rendezvous architecture will likely need to be modified to work well. Solutions for model management for all these categories are beyond the scope of this short book. From here on, we focus on applications of the decisioning type.