This chapter is about model evaluation in the context of *running production models*. You will learn practical tips to deal with model evaluation as an ongoing process of improvement—strengthening your models against real data, and solving real problems that change over time.

We won’t be covering (offline) model evaluation in the traditional sense; to learn more about that topic, we recommend you check out Alice Zheng’s free eBook *Evaluating Machine Learning Models: A Beginner’s Guide to Key Concepts and Pitfalls* (O'Reilly).

When it comes to improving models *in production*, it becomes critical to compare the models against what they did yesterday, or last month—at some previous point in time. You may also want to compare your models against a known stable canary model, or against a known best model. So, let’s dig in.

In a working production system, there will be many models already in production or in pre- or post-production. All of these models will be generating scores against live data. Some of the models will be the best ones known for a particular problem. Moreover, if that production system is based on a rendezvous-style architecture, it will be very easy and safe to deploy new models so that they score live data in real-time. In fact, with a rendezvous architecture it is probably easier to deploy a model into a production setting than it is to gather training data and do offline evaluation.

With a rendezvous architecture, it is also be possible to replicate the input stream to a development machine without visibly affecting the production system. That lets you deploy models against real data with even lower risk.

The value in this is that you can take your new model’s output and compare that output, request by request, against whatever benchmark system you want to look at. Because of the way that the rendezvous architecture is built, you know that both models will have access to exactly the same input variables and exactly the same external state. If you replicate the input stream to a development environment, you also can run your new model on the live data and pass data down to replicas of all downstream models, thus allowing you to quantify differences in results all the way down the dependency chain.

The first and biggest question to be answered by these comparisons is, “What is the business risk of deploying this model?” Very commonly, the new model is just a refinement of an older model still in production, so it is very likely that almost all of the requests will be almost identical results. To the extent that the results are identical, you know that the performance is trivially the same, as well. This means that you already have a really solid bound on the worst-case risk of rolling out this new model. If you do a soft roll out and give the new model only 10 percent of the bandwidth, the worst-case risk is smaller by a factor of 10.

You can refine that estimate of expected risk/benefit of the new model by examining a sample of the differing records that is stratified by the score difference. Sometimes, you can get ground truth data for these records by simply waiting a short time, but often, finding the true outcome can take longer than you have. In such cases, you might need to use a surrogate indicator. For example, if you are estimating likelihood of response to emails, response rate at two hours is likely a very accurate surrogate for the true value of response rate after 30 days.

Where a new model has a very different implementation from existing production models, the new model may not have anything like the same output range or calibration. For models that produce numerical scores, comparisons are still possible against the production models if you reduce all scores for the new and standard models to quantiles relative to an historical set of scores. You can then do stratified sampling on the difference in quantile instead of the potentially meaningless difference in scores.

In fact, reducing scores to quantiles relative to the distribution of recent scores is enormously helpful in comparing models. This is particularly handy for models with simple numerical scores as outputs because it allows us to use what is known as a *q-q* plot to compare the scores in terms of percentiles, which gets rid of problems to do with highly variable score calibration. Figure 5-1 shows the result of such a comparison on synthetic data both in terms of raw score and *q*-*q* diagram.

With the raw scores, the drastic differences in score calibration prevent any understanding of the correlation of scores below about 2 on the x-axis and obscure the quality of correlation even where scales are comparable. The *q-q* diagram on the right, however, shows that the same two scores are an extremely good match for the top 10 percent of all scores and a very good match for the top 50 percent.

If this were a fraud model, it is likely that we would be able to declare the risk of using the new model as negligible because fraud models typically care only about the largest scores and the *q-q* plot shows that for the top 5 percent, these models are functionally identical, if not in score calibration.

If we have only a few thousand queries, computing quantiles directly in R or Python is a nonissue. Even up to a million scores or so, it isn’t all that big a deal. But what we need to do is much more nuanced. We need to be able to pick subsets of scores selected by arbitrary qualifiers over pretty much arbitrary times and compare the resulting distribution to other distributions. Without some sort of distribution sketch, we would need to read every score of interest and sort all of them to estimate quantiles. To avoid that, we suggest using the *t*-digest.

The *t*-digest is an algorithm that computes a compact and efficient sketch of a distribution of numbers, especially for large-scale data.

You can combine multiple *t-**digest* sketches easily. This means that we can store sketches for score distributions for each, say, 10-minute period for each of the unique combinations of request qualifiers such as time of day, or source geo-location, or client type. Then, we can retrieve and combine sketches for any time period and any combination of qualifiers to get a sketch of just the data we want. This capability lets us compare distributions for different models, for different times in terms of raw values, or in terms of quantiles. Furthermore, we can retrieve the *t*-digest for any period and condition we want and then convert a set of scores into quantiles by using that sketch.

Many of the comparisons described here would be very expensive to do with large numbers of measurements, but they can be completed in fractions of a second by storing *t*-digests in a database.

After we are pretty sure that the downside risk of a model is small and that the operational footprint is acceptable, the best way to evaluate it is to put it into service by using its results a fraction of the time, possibly selected based on some user identity hash so that one user sees a consistent experience. As we begin to pay attention to the new candidate, we need to watch whatever fast surrogate for performance that we have for the effects of the new model.

Offline evaluation of models is a fine thing for qualifying candidates to be tried in production, but you should keep in mind the limits to offline evaluation. The fact is, machine learning systems can be sensitive to data coupling. When these systems make real-world decisions and then take training data from the real world, there is a risk that they couple to themselves, making offline evaluation have only marginal value. We have seen a number of times where adding noise to a model’s output (which inevitably degrades its current performance) can broaden the training data that the system acquires and thus improve performance in the future. This can have a bigger effect than almost any algorithmic choice and can easily make a model that scores worse offline perform much better in the real world. Conversely, a model that scores better offline can couple to other systems and make them perform more poorly.

As such, rolling a model out to get some real-world data is really the only way to truly determine if a model is better than the others.