Machine Learning Logistics

by Ted Dunning and Ellen Friedman

Models in Production

...developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive

D. Sculley et al., “Hidden Technical Debt in Machine Learning Systems”, NIPS 2015

Modern machine learning systems are making it much easier to build a basic decisioning system. That ease, however, is a bit deceptive. Building and deploying a first decisioning system tends to go very well, and the results can be quite impressive for the right applications. Adding another system typically goes about as well. Before long, however, strange interactions can begin to appear in ways that should be impossible from a software engineering point of view. Changing one part of the system affects some other part of the system, even though tests in isolation might suggest that this is impossible.

The problem is that systems based on machine learning can have some very subtle properties that are very different from more traditional software systems. Partly, this difference comes from the fact that the outputs of machine learning systems have much more complex behaviors than typical software components. It also comes about, in part at least, because of the probabilistic nature of the judgments such systems are called upon to make.

This complexity and subtlety makes the management of such systems trickier than the management of traditional well-modularized software systems based on microservices. Complex machine learning systems can grow to exhibit pathological “change anything, change everything” behaviors—even though they superficially appear to be well-designed microservices with high degrees of isolation.

There are no silver bullet answers to these problems. Available solutions are pragmatic and have to do with making basic operations easier so that you have more time to think about what is really happening. The rendezvous architecture is intended to make the day-to-day operations of deploying and retiring machine learning models easier and more consistent. Getting rid of mundane logistical problems by controlling those processes is a big deal because it can give you the time you need to think about and understand your systems. That is a prime goal of the rendezvous architecture.

The architecture is also designed to provide an enormous amount of information about the inputs to those models and what exactly those models are doing. Multiple models can be run at the same time for comparison and cross-checking as well as helping to meet latency and throughput guarantees.

Note

A common first impression by people unfamiliar with running models in production is that the rendezvous architecture goes to extremes in measuring and comparing results. Engineers experienced in fielding production machine learning models often ask if there are ways to get even more information about what is happening.

Life with a Rendezvous System

There are a few basic procedures that cover most of what must happen operationally with a rendezvous architecture. These include bringing new models into preproduction (also known as staging), rolling models to production and retiring models. There are also a few critical processes that involve with doing maintenance on the rendezvous system itself.

It should be remembered that no machine learning system except the most trivial is ever really finished. There will always be new ideas to try, so model performance can be subject to decay over time. It’s important to monitor ongoing performance of models to determine when to bring out a new edition.

Model Life Cycle

The life cycle of a model consists of five main phases: development, preproduction, production, post-production, and retirement. The goal of development is to deliver a container-based model that meets the data science goals for the service.

Note

The key criterion for exiting the development phase is that the model should have been run on archived input data and behaved well in terms of accuracy and runtime characteristics such as memory footprint and latency distribution.

Models enter preproduction as a container description that specifies the environment for the model, including all scaffolding for integrating the container into the rendezvous architecture. In addition to specifying the environment, the container description should refer to a version-controlled form of the model itself. Putting a model into preproduction consists of inspecting and rebuilding the model container and starting it on production hardware. On startup, the model is connected to streams with production inputs and production outputs. Next, you scale the container with enough replicas to give the required performance. You also need to bring any internal state in the model up to date by supplying a snapshot of the internal state and replaying transactions after the snapshot. If the internal state depends only on a relatively short window, you can just replay transactions on top of an initially zero state.

At this point, the model is running in production except for the fact that the rendezvous server is ignoring its results. As far as the model is concerned, everything is exactly the same way it would be in production.

During the time a model is fully in preproduction and all internal state is live, you should monitor it for continued good behavior in terms of latency, and you can compare its output to the canary model and current production models. If the new model exhibits operational instability or has significantly lower accuracy than current champion and contender models, you may decide to send the model directly to retirement by stopping any containers running the model. Many models will go directly to retirement at this point, hopefully because your current champion is really difficult to beat.

After the preproduction model has enough runtime that you feel confident that it is stable and you have seen it give competitive accuracy relative to the current champion, you can soft-roll it into production by giving the rendezvous server new schedules that use the new model for progressively more and more production traffic, and use the previous champion for less and less traffic. You should keep the old champion running during this time and beyond in case you decide to roll back. The transition to the new champion should be slow enough for you to keep an eye out for any accidental data dependencies, but the transition should also be sudden enough to leave a sharp and easily detectable signature in the metrics for the rest of the system in case there is an adverse effect of the roll. You want to avoid a case in which rolling to the new model has a bad effect on the overall system, but happens so slowly that nobody notices the effect right away.

At this point, the model is in production, but nothing has changed as far as the model is concerned. It is still evaluating inputs and pumping out results exactly as it did in preproduction. The only difference is that the output of the model can now affect the rest of the world. For some models, their output inherently selects the training data for all the models and so rolling a new model into production can have a big effect data-wise. Especially with recommendation engines, this leads to a tension between giving the best recommendations that we know how to give versus giving recommendations that include some speculative material that we don’t know as much about. Presenting the best results is good, but so is getting a broader set of training data. We have discussed this effect in our book Practical Machine Learning: Innovations in Recommendation. If you suspect that something like this is happening, you may want to randomly select between the decisions of several models to see if there is a perceptible effect on output quality.

Eventually, any champion model will be unseated by some new contender. This might be because the contender is more efficient or faster or more accurate. It could also happen because model performance commonly degrades over time as the world changes or, in the case of fraud models, due to innovation by adversaries.

Moving a model into post-production is inherent in the process of bringing the new champion into production and is done by giving the rendezvous server new schedules that ignore the old champion. In the post-production phase, the model will still handle all the production traffic but its outputs will be ignored by the rendezvous server. Post-production models should still be monitored to verify that they are safe to keep around as a fallback. Typically, you keep the last champion to enter post-production in the rendezvous schedule as a fallback in case the new champion fails to produce a result in a timely manner.

Eventually, a former champion is completely removed from the rendezvous schedule. After you are confident that you won’t need to use output of the old champion any more, you can retire it. A few models should be kept in post-production phase for a long time, serving as canary models. After a model is fully retired, you can stop all of the containers running it.

Upgrading the Rendezvous System

Eventually, the services making up the rendezvous architecture itself will need to be updated. This is likely to be far less common than rolling new models, but it will happen. Happily, the entire architecture is designed to allow zero downtime upgrades of the system components.

Here are the key steps in upgrading the rendezvous system components:

  1. Start any new components. This could be the external state systems or the rendezvous server. All new components should start in a quiescent state and should ignore all stream inputs.
  2. Inject a state snapshot token into the input of the system. There should be one token per partition. As this snapshot token passes through the system, all external state components will snapshot their state to a location specified in the token and pass the snapshot token into their output stream.
  3. The new versions of the external state maintenance systems will look for the snapshots to complete and will begin updating their internal state from the offsets specified in the snapshots. After the new external state systems are up to date, they will emit snapshot-complete tokens.
  4. When all of the new external state systems have emitted their snapshot-complete tokens, transition tokens are injected into the input of the system, again one token per partition.
  5. As the external state systems receive transition tokens, the old versions will stop augmenting input records, and the new versions will start. This will provide a zero-delay hand-off to the new systems. The transition tokens will be passed to the models, which should simply pass the tokens through.
  6. When the old rendezvous server sees a transition token on the system input, it will stop creating new mailboxes for incoming transactions but will continue to watch for results that might apply to existing mailboxes.
  7. When the new rendezvous server sees a transition token on the system input, it will start creating mailboxes for all transactions after the token and watching for results for those requests.

When this sequence of steps is complete, the system will have upgraded itself to new versions with no latency bumps or loss of state. You can keep old versions of the processes running until you are sure that you don’t need to roll back to the previous version. At that time, you can stop the old versions.

Beware of Hidden Dependencies

If you build a simple system in which you use a rendezvous architecture to run and monitor a single kind of model, things are likely to work just fine with no big surprises. Unfortunately, life never stays that simple. Before long, you will have new models that do other things, so you will light up some more rendezvous systems to maintain them. Next, you’ll have models that compute some signal to be used as external state for other models. Then, you will have somebody who starts using a previously stable model as external state. Suddenly, one day, you will have a complicated system with models depending on other models.

When that day comes (and it probably will) you are likely to get some surprises due to model interactions. Even though the models are run completely independently in separate rendezvous systems, the fact that one model consumes the results of another can cause some very complex effects. Models can become coupled in very subtle ways purely by the data passing between them. The same thing could conceivably happen with ordinary microservices, but the data that passes between microservices doesn’t normally carry as much nuance as the data that passes between machine learning systems, thus limiting the amount of data coupling.

A Simplified Example of Data Coupling

Let’s look at a simplified example of a fairly common real-world scenario. In this example, we want to find fraud and let’s just consider how much of the total fraud that we find and not consider the problem of false alarms where we mark a clean transaction as fraud. Now, let’s suppose that there are actually two different kinds of transactions, red and blue each equally common. We can’t actually see the difference, but we know that there are still two kinds. For simplicity, let’s suppose that fraud is equally common in both colors.

We have two models that we want to use in a cascade such as that shown in Figure 6-1.

Two models cascaded to find as much fraud as possible
Figure 6-1. Figure 6-1. Two models cascaded to find as much fraud as possible.

Now, let’s assume that A and B have complementary performance when looking for fraud. Model A marks 80 percent of red frauds as fraud but never finds any blue frauds. This means that A by itself can find 40 percent of all fraud because only half of all frauds are red. Model B is just the opposite. It marks all red transactions as clean and finds 80 percent of the blue frauds. Working together in our cascaded model, A and B can find 80 percent of all the fraud.

Now suppose we “improve” A so that it finds 70 percent of fraud instead of just 40 percent. This sounds much better! Under the covers, let’s suppose that after the improvement, A finds 100 percent of all blue frauds, and 40 percent of all red frauds.

If we use A in combination with B as we did earlier, A will find 100 percent of the blue frauds and 40 percent of the red frauds. B, working on the transactions A said were clean, will have no blue frauds to find, and it never finds any red frauds anyway, so B won’t find any frauds at all. The final result will be that with our new and improved version of A, we now find 70 percent of all the fraud instead of the 80 percent we found with the original “lousy” version of A.

How is it that nearly doubling the raw performance of A made things worse? The problem is that the “improvement” in A came at the expense of making A’s strength correlate with the strength of B. Originally, A and B complemented each other, and each compensated for the substantial weakness of the other. After the change, the strength of A completely overlaps and dominates the strength of B, but A has lost much of the strength it original had. The net is that B has nothing to offer and the overall result is worse than before.

In real life, unfortunately, this can easily happen. It seems easy to spot this kind of thing, but in practice we can’t tell the red transactions from the blue transactions. All we get to see is raw performance and the new A looks unconditionally better than the old A. In such a situation, we might have a graph of frauds found over time that looks like Figure 6-2.

The surprising effect of model dependencies. If improvements in model A cause performance of A to correlate with performance of model B, the cascaded performance may actually decrease even if A alone performs better
Figure 6-2. Figure 6-2. The surprising effect of model dependencies. If improvements in model A cause performance of A to correlate with performance of model B, the cascaded performance may actually decrease even if A alone performs better.

If we were to look at the graph without knowing that it was just A that changed, we would quite reasonably suspect that somebody had made a disastrous change to B because it was B that showed the drop in performance. Given that we know exactly when A’ was brought live, we can guess that there is a causal connection between upgrading A and losing B. In practice, the team running A should know that B is downstream and should watch for changes. Conversely, the team running B should know that A is upstream. With cascaded models like this, it is commonly the same team running A and B, but it is surprisingly common to have problems like this unawares.

The trick is to either avoid all data dependencies or to make sure you know as much as possible about any that do occur. One key way to limit surprise dependencies is to use permissions on streams to limit the consumers to systems that you know about instead of letting there be accidental dependencies propagating through your systems as other teams set up models.

Of course, it isn’t feasible to completely eliminate all data dependencies because processes that are completely outside of your control might accidentally couple via data streams that you import from the outside world. For example, a competitor’s marketing program can change the characteristics of the leads you get from your marketing program resulting in a coupling from their system to yours. Likewise, a fraudster’s invention of a new attack could change the nature of frauds that you are seeing. You will see these effects and you will need to figure out how to deal with the results.

Monitoring

Monitoring is a key part of running models in production. You need to be continually looking at your models, both from a data science perspective and from an operational perspective. In terms of data science, you need to monitor inputs and outputs looking for anything that steps outside of established norms for the model. From an operational perspective, you need be looking at latencies, memory size, and CPU usage for all containers running the models.

Meta Analytics presents more details on how to do this monitoring.