Anomaly Detection Using Metrics and Exception Logs | Whiteboard Walkthrough

Contributed by

15 min read

In this week’s Whiteboard Walkthrough, Ted Dunning, Chief Applications Architect at MapR, will talk about how you can use logs containing metrics and exceptions to detect anomalies in the behavior of a micro-service.

For related material on this topic see:

Here is the full video transcription:

Hi, I'm Ted Dunning, and I'm going to talk about anomaly detection, particularly anomaly detection in the realm of operations, how systems are running and finding out when they're not running the way they should. I work for MapR. They're my hat sponsor, but we don't need that today. What I want to do is particularly highlight how standard architectures in a microservice environment can be used to really good advantage to build anomaly detectors which can detect when something is wrong with a large distributed system.

If we review just for a moment here, if we have a microservice and it's purposely opaque, we don't know what it does really, we know that it has what we call technically goes-ins, and we know it has goes-outs. In addition to that, a very standard and very, very good practice in microservices is if this microservice has secondary streams that come out that carry, for instance, exceptions. Every time some exceptional condition is observed, say the input is malformed somehow, say the machine somehow that this is running on is upset somehow, we should produce an exception on the topic that contains exceptions.

We should also quite conventionally, quite ordinarily be producing metrics. If possible, if it's a large amount of work for each bit of input that comes in, we should produce metrics on each element that's processed. If the elements are small and we do a small amount of work, we should probably pre-aggregate those metrics a little bit, and so we put out a modest amount of data out the metrics channel. We obviously don't want more data coming out here than either comes in or comes out of a high volume system. That would be silly. We do want enough metrics to make it easy to spot problems, and that's usually more than people start with. Whatever level you're putting out in terms of level of detail in your metrics, bump it up a little bit is the rule of thumb.

Once we have that, the key question is how do we actually detect anomalies? How do we detect anomalous operation of this microservice even though it's this white box that we have no idea what it's really supposed to be doing? We don't have visibility inside it. All we can see is what's going in, what's coming out. We don't understand what's going in or out or what the correct relationship really is, and we get some of these metrics out, perhaps exceptions out as well. We have to decide whether or not this white box that we don't really understand is operating correctly.

How do we do that? The key is two things. One is we need to convert whatever we're measuring, whatever particular thing we're looking at, into something that looks like the log of a probability. Two, we have to determine whether or not the log of that probability is anomalous. Anomalous means not normal. Implied in the problem of finding anomalies is finding out what normal might be. We have to observe the system for some time, decide that most of what it does is normal, and then the exceptional cases we can mark as anomalies. If those are common enough, we could learn about them and make them part of the new normal. Normal could include a certain level of malfunction, for instance, and hopefully of course we've built the system well enough so that what is rare and odd is the broken part as opposed to some systems that some of us build where the odd and the unusual is when it's working normally. That's an early stage of a system like this. We're going to look at that, and we're going to look at what normal is.

Let's take first a case where the system has either inputs, or outputs, or exceptions which are events in time. These events happen. It's not really a thing that we can measure, just that it did happen. It might be particular exceptions, particular kinds of inputs, or particular kinds of outputs. Those might be the events in time that we measure. If we do that, the only characteristic of these things that we can talk about is the time that they happen. We're not allowed to understand these events.

If we look at time passing, we have events that happen. It happens to be that for a wide class, a very important class of distribution of events in time that the time since the last event or the time since the nth last event, the fifth last event here, is something that is very, very useful statistically. If these things are Poisson distributed, then the time to the last event, the so called first order difference in time from now back to the time of the last event, is logarithmically distributed, at least assuming we have another event there. That means that we can understand what the distribution is of this time, what the longest time should be if we know the rate of events.

There's a slightly more complex relationship to the nth difference, but whereas the first difference is useful for finding when events stop, the nth difference is useful for finding out when events occur too fast or too slow. The big difference though is with the first difference we get an alarm sooner. Because it's trying to solve a simpler problem, it needs less information in order to solve that. If we look at say the times that exceptions occur or the times that outputs are generated by this system and we look at how long it has been since the last one of those has occurred, we can find out when that has stopped happening. We can build an anomaly detector which if it knows how fast it should be happening can set off an alarm when it doesn't happen. For instance, the output might be ‘I have completed a sale.’ Has the system broken? Is it no longer getting sales? How long is the time since that last one? We can now build a system.

The way we would build that, we would be looking at our event times coming in here, we would take the nth difference there, typically the first difference because we're usually worried about when things stop rather than when things start. We look at the nth difference there and we multiply by an estimated rate. The time since the last event will be always quite small if the rate is high, so when we multiply them that becomes unit. When the rate is low, the times will be large but the rate will be low, and multiplying them again gives us a standardized distribution here. When that standardized output becomes large, we can say we have an anomaly, in particular we have an anomaly which represents the stoppage of events. Situation's a little bit more complex when we take the nth difference because upward and downward excursions of this anomaly score mean different things then.

Once we have that anomaly score, we can keep a history of what the values have been in the past, and we can then compute when the 99.9th largest value has been found, and that can let us set off an alarm. There's a missing piece still. The missing piece is this rate model. The rate model lets us estimate what the rate should be right now. We can't measure what the rate is right now because there may be an anomaly which is making the rate wrong. We have to use our experience somehow to predict, to retrodict, to now cast what the rate should be right now. Based on evidence in the past, what should the rate be if everything is working correctly? Here, for instance, is a nominal graph. This is time horizontally and say the count of events in some period, probably minutes, 10 minutes, an hour, something like that. Commonly, rate goes up, it dips, and it goes down, and then it comes back up the next day.

We can predict the current rate very well as it turns out, this is a practical thing, very well by predicting instead the log of the count based on recent counts, recent logs of counts, and distant log of counts by looking at things 24, 25, and 26 hours ago, or 7 times 24, 7 times 24 plus 1, and so on hours ago. That gives us information about the daily pattern that we should expect and the weekly patterns. By looking at information from an hour ago and two hours ago, that lets us set the scale. The earlier measures from a day or a week ago give us the shape, and the current measurements gives us the scale. By combining those log counts into a single prediction, we can predict the thing that I've circled there, which is the current log count. If you look at our book on anomaly detection, you can see how this is done using the traffic to the page for Christmas at Wikipedia. You can see that even training over November, we can do a pretty good job of predicting how things go in December when the page for Christmas becomes very popular.

This is how we build a count model, a rate model. The rate model plugs in here, it multiplies times the nth, typically the first difference since the last event, and that gives us an anomaly. Whenever the anomaly is out of the normal range, we have something wrong, either too fast or too slow. That's how we can look at these metrics and exceptions or the times of events coming in and out of that. That's one way of building an anomaly detector which works really well. Don't tell anybody, but it's really pretty easy to build. You have a little bit of linear regression here to do the rate model, you have a little bit of subtraction, PhD level subtraction, and multiplication to get this value. You can use something called the t-digest to build a histogram of what should be considered big in terms of anomaly, or you can just calculate that it should be exponentially distributed. Either way, you've got a very, very simple system going on here.

You can do some more complex things as well. Some systems don't have time-wise repetitive dynamics. This usually occurs when we have some output from an audience coming to us, clicks, visits, things like that, that drive traffic up and down, down at night, up in the day. The audience has patterns to it, so we can predict that pretty well. Internally a system might run at somewhat irregular intervals. There might be a job that kicks off or something when data is available or pending some other event, and it might not be at regular intervals. We need to be able to look for anomalies in that as well.

What we can do there is we can take the size of input, the size of output, or any other metrics that we can compute, typically small aggregates, and for a system say that is doing aggregation, or transformation, or other simple operations like that, the size of the input should predict the size of the output, or it should predict the latency of a record, or any number of things like that. We can also build somewhat more complex relationships between multiple variables, and that then gives us some distribution of points in history that we can now label as normal. We can build a model literally that says what is normal, what is the normal relationship between the characteristics of the system working. Then if we measure something that is far from that normal, we can again build an anomaly.

Here you see, for instance, as input size goes up, output size goes up, but here at this anomalous moment here we have large input and almost no output. Something broke. It's like a pipe burst. Water's going in, but water's not coming out. That's the sort of anomaly we can detect with that. The real key here is not the mathematics, is not the algorithms. These are all fairly simple things to do mathematically.

The key is the hard work of planning ahead and building a microservice architecture that puts out metrics and puts out exceptions. It tells you something about what's going on inside. Once you have that tiny bit of visibility, once the black box turns white, you can detect anomalies and you can set off alarms naturally that tell you when the system is not normal, i.e., broken. That's the idea of anomaly detection in the context of a microservice architecture, a very, very powerful concept. Thank you very much.

This blog post was published January 31, 2017.