Machine Learning Logistics

by Ted Dunning and Ellen Friedman

Meta Analytics

I know who I WAS when I got up this morning, but I think I must have been changed several times since then.

Alice in Wonderland, by Lewis Carroll

Just as we need techniques to determine whether the data science that we used to create models was soundly applied to produce accurate models, we need additional metrics and analytics to determine whether the models are functioning as intended. The question of whether the models are working breaks down into whether the hardware is working correctly, whether the models are running, and whether the data being fed into the models is as expected. We need metrics and analytics techniques for all of these. We also need to be able to synthesize all of this information into simple alerts that do not waste our time (more than necessary).

The rendezvous architecture is designed to throw off all kinds of diagnostic information about how the models in the system are working. Making sense of all of that information can be difficult, and there are some simple tricks of the trade that are worth knowing. One major simplification is that because we are excluding the data science question of whether the models are actually producing accurate results, we can simplify our problem a bit by assuming that the models were working correctly to begin with—or at least as correctly as they could be expected to work. This means that our problem reduces to the problem of determining whether the models are working like they used to do. That is a considerably easier problem.

Note that in doing this, we are analyzing the behavior of the analytical engines themselves. Because of that flavor of meta analysis, we call our efforts “meta analytics.”

Generally, meta analytics can be divided into data monitoring and operational monitoring. Data monitoring has to do with looking at the input data and output results of models to see how things are going. Operational monitoring typically ignores the content and looks only at things like latencies and request frequency. Typically, data monitoring appeals and makes sense to data scientists and data engineers, whereas operational monitoring makes more sense to operations specialists and software engineers. Regardless, it is important that the entire team takes meta analytics seriously and looks at all the kinds of metrics to get a holistic view of what is working well and what is not.

Basic Tools

There are a few basic techniques that are very useful for both data monitoring and operational monitoring for meta analytics, many of which are surprisingly little known. All of these methods have as a goal the comparison of an estimate of what should be (we call that “normal”) with what is happening now. In model meta analytics, two key tools are an ability to look for changes in event rates and to estimate the distribution of a value. We can combine these tools in many ways with surprising results.

Event Rate Change Detection

Events in time are a core concept in meta analytics. Simply realizing when an intermittent event has changed rate unexpectedly is a very useful kind of meta-analytical measure. For instance, you can look at requests coming to a model and mark times when the rate of requests has changed dramatically. If the rate of requests has dropped precipitously, that often indicates an upstream problem. If the rate of outgoing requests for external state changes in a fashion incompatible with the number of incoming requests, that is also a problem.

The good news is that detecting changes in event rates is pretty simple. The best way to detect these changes is typically to look at the times since the last or nth last event happened. Let’s assume that you have a predicted rate (call this λ) and have seen the times of all events up to now (call these t1...ti). If you want to detect complete stoppage in the events, you can just use the time since the last event scaled by the predicted rate λ (tti), where t is the current time. If your predicted rate is a good estimate, this signal will be as good as you can get in terms of trading-off false positives and false negatives versus detection time.

If you are looking for an alarm when the rate goes up or down, you can use the nth event time difference or λ(ttin +1) as your measure. If n is small, you will be able to detect decreases in rate. With a larger n you can also detect increases in rate. Detecting small changes in rate requires a large value of n. Figure 7-1 shows how this can work.

Detecting shifts in rate is best done using nth order difference in event time. The t-digest can help pick a threshold
Figure 7-1. Figure 7-1. Detecting shifts in rate is best done using n-th order difference in event time. The t-digest can help pick a threshold.

You may have noticed the pattern that you can’t see changes that are (too) small (too) quickly without paying a price in errors. By setting your thresholds, you can trade off detecting ghost changes or missing real changes, but you can’t fundamentally have everything you want. This is a kind of Heisenberg principle that you can’t get around with discrete events.

Similarly, all of the event time methods talked about here require an estimate of the current rate λ. In some cases, this estimate can be trivial. For instance, if each model evaluation requires a few database lookups, the request rate multiplied by an empirically determined constant is a great estimator of the rate for database lookups. In addition, the rate of website purchases should predict the rate of frauds detected. These trivial cross-checks between inputs and outputs sound silly but are actually very useful as monitoring signals.

Aside from such trivial rate predictions, more interesting rate predictions based on seasonality patterns that extend over days and weeks can be made by using computing hourly counts and building a model for the current hour’s count based on counts for previous hours over the last week or so. Typically, it is easier to use the log of these counts than the counts themselves, but the principle is basically the same. Using a rate predictor of this sort, we can often predict the number of requests that should be received by a particular model within 10 to 20 percent relative error.

t-Digest for One-Dimensional Score Binning

If we look into the output of a model, we often see a score (sometimes many scores). We can get some useful insights if we ask ourselves if the scores we are producing now look like the scores we have produced previously. For instance, the TensorChicken project produced a list of all potential things that it could see such as a chicken or an empty nest along with scores (possibly probabilities) for each possible object. The scores for each kind of thing separately form a distribution that should be roughly constant over time if all is going well. That is, the model should see roughly the same number of chickens or blue jays or open doors over time. This gives us a score distribution for each possible identification.

As an example, Figure 7-2 shows the TensorChicken output scores over time for “Buff Orpington.” There is a clearly a huge change in the score distribution part way across the graph at about sample 120. What happened is that the model was updated in response to somebody noticing that the model had been trained incorrectly so that what it thought were Buff Orpington chickens were actually Plymouth Rocks. At about sample 120, the new model was put into service and the score for Orpingtons went permanently to zero.

The recognition scores for Buff Orpington chickens dropped dramatically at sample 120 when the model was updated to correct an error in labeling training data
Figure 7-2. Figure 7-2. The recognition scores for Buff Orpington chickens dropped dramatically at sample 120 when the model was updated to correct an error in labeling training data.

From just the data presented here, it is absolutely clear that a change happened, but it isn’t clear whether the world changed (i.e., Buff Orpington chickens disappeared) or whether the model was changed.

This is exactly the sort of event that one-dimensional distribution testing on output scores can detect and highlight. The distinction between the two options is something that having a canary model can help us distinguish.

A good way to highlight changes like this is to use a histogramming algorithm to bin the scores by score range. The appearance of a score in a particular bin is an event in time whose rate of occurrence can be analyzed using the rate detection methods described earlier in this chapter. If we were to use bins for each step of 0.1 from 0 to 1 in score, we would see nonzero event counts for all of the bins up to sample 120. From then on, all bins except for the 0.0–0.1 bin would get zero events.

The bins you choose can sometimes be picked based on your domain knowledge, but it is often much handier to use bins that are picked automatically. The t-digest algorithm does exactly this and does it in such a way that the structure of the distribution near the extremes is very well preserved.

K-Means Binning

Taking the issue of the model change in TensorChicken again, we can see that not only did the distribution of one of the scores change, the relationship between output scores changed. Figure 7-3 shows this.

The scores before the model change (black) were highly correlated but after the model change (red), the correlation changed dramatically. K-means clustering can help detect changes in distribution like this
Figure 7-3. Figure 7-3. The scores before the model change (black) were highly correlated but after the model change (red), the correlation changed dramatically. K-means clustering can help detect changes in distribution like this.

A very effective way to measure the change in the relationship between scores like this is to cluster the historical data. In this figure, the old data are the black dots. As each new score is received, the distance to the nearest cluster is a one-dimensional indicator of how well the new score fits the historical record. It is clear from the figure that the red points would be nowhere near the clusters found using the black data points and the distance to nearest cluster would dramatically increase when the red scores began appearing. The rate for different clusters also would dramatically change, which can be detected as described earlier for event rates.

Aggregated Metrics

Aggregating important metrics over short periods of time is a relatively low-impact kind of data to collect and store. Values that are aggregated by summing or averaging (such as number of queries or CPU load averages) can be sampled every 10 seconds or minute and driven into a time–series database such as Open TSDB or Influx or even ElasticSearch.

Other measurements such as latencies are important to aggregate in such a way that you understand the exact distribution of values that you have seen. Simple aggregates like min, max, mean, and standard deviations do not suffice.

The good news is that there are data structures like a FloatHistogram (available in the t-digest library) that do exactly what you need. The bad news is that commonly used visualization systems such as Grafana don’t handle distributions well at all. The problem is that understanding latency distributions isn’t as simple as just plotting a histogram. For instance, Figure 7-4 shows a histogram of latencies in which about 1 percent of the results have three times higher values than the rest of the results. Because both axes are linear, and because the bad values are spread across a wider range than the good values, it is nearly impossible to determine that something is going wrong.

The float histogram uses variable width bins. Here, we have synthetic data in which one percent of the data has high latency (horizontal axis). A linear scale for frequency (vertical axis) makes it hard to see the high latency samples
Figure 7-4. Figure 7-4. The float histogram uses variable width bins. Here, we have synthetic data in which one percent of the data has high latency (horizontal axis). A linear scale for frequency (vertical axis) makes it hard to see the high latency samples.

These problems can be highlighted by switching to nonlinear axes, and the nonuniform bins in the FloatHistogram also help. Figure 7-5 shows the same data with logarithmic vertical axis.

With a logarithmic vertical axis, the anomalous latencies are clearly visible. The black line shows data without slow queries while the red line shows data with anomalous latencies
Figure 7-5. Figure 7-5. With a logarithmic vertical axis, the anomalous latencies are clearly visible. The black line shows data without slow queries while the red line shows data with anomalous latencies.

With the logarithmic axis, the small cluster of slow results becomes very obvious and the prevalence of the problem is easy to estimate.

Event rate detectors on the bins of the FloatHistogram could also be used to detect this change automatically, as opposed to visualizing the difference with a graph.

Latency Traces

The latency distributions shown in the previous figures don’t provide the specific timing information that we might need to debug certain issues. For instance, is the result selection policy in the rendezvous actually working the way that we think it is?

To answer this kind of question, we need to use a trace-based metrics system in the style of Google’s Dapper (Zipkin and HTrace are open source replications of this library). The idea here is that the overall life cycle of a single request is broken down to show exactly what is happening in different phases to get something roughly like what is shown in Figure 7-6.

A latency breakdown of a single request to a rendezvous architecture shows details of overlapping model evaluation
Figure 7-6. Figure 7-6. A latency breakdown of a single request to a rendezvous architecture shows details of overlapping model evaluation.

Trace-based visualization can spark really good insight into the operation of a single request. For instance, the visualization in Figure 7-6 shows how one model continues running even after a response is returned. That might not be good if computational resources are tight, but it also may provide valuable information to let the model run to completion as the gbm-2 model is tuned to make a good trade-off between accuracy and performance. The visualization also shows just how much faster the logistic model is. Such a model is often used as a baseline for comparison or as a fallback in case the primary model doesn’t give a result in time. Latency traces are most useful for operational monitoring.

Data Monitoring: Distribution of the Inputs

Now that we have seen how a few basic tools for monitoring work, we can talk about some of the ways that these tools can be applied.

Before that, however, it is good to take a bit of a philosophical pause. All of the examples so far looked at gross characteristics of the input to the model such as arrival rates, or they looked at distributional qualities of the output of the model. Why not look at the distribution of the input?

The answer is that the model outputs are, in some sense, where the important characteristics of the inputs are best exposed. By looking at the seven-dimensional output of the TensorChicken model, for instance, we are effectively looking at the semantics of the original image space, which had hundreds of thousands of dimensions (one for each pixel and color). Moreover, because our team has gone to quite considerable trouble to build a model that makes sense for our domain and application, it stands to reason that the output will make sense if we want to look for changes in the data that matter in our domain. The enormously lower dimension of the output space, however, makes many analyses much easier. Some care still needs to be taken not to focus exclusively on the output, even of the canary model. The reason is that your models may imitate your own blind spots and thus be hiding important changes.

Methods Specific for Operational Monitoring

Aside from monitoring the inputs and outputs of the model itself, it is also important to monitor various operational metrics, as well. Chief among these is the distribution of latencies of the model evaluations themselves. This is true even if the models are substantially outperforming any latency guarantees that have been made. The reason for this is that latency is a sensitive leading indicator for a wide variety of ills, so detecting latency changes early can give you enough time to react before customers are impacted.

There are a number of pathological problems that can manifest as latency issues that are almost invisible by other means.1 Competition for resources such as CPU, memory bandwidth, memory hierarchy table entries and such can cause very difficult-to-diagnose performance difficulties. If you have good warning, however, often you can work around the problems by just adding resources even while continuing to diagnose them.

Latency is a bit special among other measurements that you might make in a production system in that we know a lot about how we want to analyze it. First, true latencies are never negative. Second, it is relative accuracy that we care about, not absolute accuracy and not accuracy in terms of quantiles. Moreover, 5 to 10 percent relative accuracy is usually quite sufficient. Third, latency often displays a long-tailed distribution. Finally, for the purposes here, latencies are only of interest from roughly a millisecond to about 10 seconds. These characteristics make a FloatHistogram ideal for analyzing latencies.

An important issue to remember when measuring latencies is to avoid what Gil Tene calls “structured omission of results.” The idea is that your system may have back pressure of some form so that when part of the system gets slow, it inherently handles fewer transactions which makes the misbehavior seem much rarer. In a classic example, if a system that normally does 1000 transactions per second in a 10-way parallel stream is completely paused for 30 seconds during a 10-minute test, there will be 10 transactions that show latencies of about 30 seconds and 60,000 transactions that show latencies of 10 milliseconds or so. It is only at the 99.95th percentile that the effect of this 30 second outage even shows up. By many measures, the system seems completely fine in spite of having been completely unavailable for five percent of the test time. A common solution for handling this is to measure latency distributions in short time periods of, say, five seconds. These windowed values are then combined after normalizing the counts in the window to a standardized value. Windows that have no counts are assigned counts that are duplicates of the next succeeding nonzero window. This method slightly over-emphasizes good performance in some systems that have slack periods, but it does highlight important pathological cases such as the 30-second hold.

Combining Alerts

When you have a large number of monitors running against complex systems, the danger changes from the risk of not noticing problems because there are no measurements available to a risk of not noticing problems because there are too many measurements and too many false alarms competing for your attention.

In highly distributed systems, you also have a substantial freedom to ignore certain classes of problems since systems like the rendezvous architecture or production grade compute platforms can do a substantial amount of self-repair.

As such, it is useful to approach alerting with a troubleshooting time budget in mind. This is the amount of time that you plan to spend on diagnosing backlog issues or paying down technical debt on the operational side of things. Now devote half of this time to a “false alarm budget.” The idea is to set thresholds on all alarms so that you have things to look at for that much of the time, but you are not otherwise flooded with false alarms. For instance, if you are willing to spend 10 percent of your team’s time fixing things, plan to spend five percent of your time (about two hours a week per person) chasing alarms. There are two virtues to this. One is that you have a solid criterion for setting thresholds (more about this in a moment). Second, you should automatically have a prioritization scheme that helps you focus in on the squeaky wheels. That’s good because they may be squeaking because they need grease.

The trick then, is you need to have a single knob to turn in terms of hours per week of alarm time that reasonably prioritizes the different monitoring and alarm systems.

One way to do this is to normalize all of your monitoring signals to a uniform distribution by first estimating a medium-term distribution of values for the signal (say for a month or three) and then converting back to quantiles. Then combine all of the signals into a single composite by converting each to a log-odds scale and adding them all together. This allows extreme alerts to stand out or to have combinations of a number of slightly less urgent alerts to indicate a state of general light havoc. Converting to quantiles before transforming and adding the alert signals together has the property of calibrating all signals relative to their recent history, thus quieting wheels that squeak all the time. Another nice property of the log-odds scale is that if you use base-10 logs, the resulting value is the number of 9’s in the quantile (for large values). Thus, log-odds (0.999) is very close to 3 and log-odds (0.001) is almost exactly –3. This makes it easy to set thresholds in terms of desired reliability levels.

1This is a good example of a “feature” that can cause serious latency surprises: "Docker operations slowing down on AWS (this time it’s not DNS)".
And this is an example of contention that is incredibly hard to see, except through latency: "Container isolation gone wrong".