April 01, 2015 | BY Ted Dunning
Editor's Note: In this week's Whiteboard Walkthrough, Ted Dunning, Chief Application Architect at MapR, gets you up to speed on anomaly detection, a simple and easy to implement technique to "figure out stuff that just happened but shouldn't have".
Here's the transcript:
Hi, this is Ted Dunning at Strata. You can see it happening behind us, it's happening all around us, but I want to talk a little bit about anomaly detection. This is a really powerful technique and it can be implemented really easily, so let's talk about it. Lose the hat.
Okay, so anomaly detection is figuring out stuff that just happened that shouldn't have. That somehow is abnormal, is stuff that you don't really want to have happened that's out of character. In particular right here what I'm going to talk about is how you can tell whether or not there's an anomaly, and how often events are happening in time.
These events very often might be something like a web hit, or a transaction, or a sale, or something like that. These are events in time. Here's a picture. See, time is going across like that. These events occur at irregular times. It's irregular because all of the sources of events are different people, different machines, different things, that are uncorrelated. Now they come in at times that they decide and they decide independently so to us it looks random.
The time between two events is a key factor here, and I'm going to talk first of all about the special case of detecting when these events stop. I want to detect that as soon as possible. You can't just say that any particular time interval means they've stopped. It might just be a bit longer than you've seen before, it could be a lot longer than you've seen before, but there's always some small chance that it was just a momentary lull, a quietness in a café, that sort of thing, so what you need is a probabilistic model, and we use what it use the Poisson model. This governs events in time that are independent.
If we know about the Poisson we know that the time between successive events is exponentially distributed, with a parameter that is the rate at which the events occur on average per unit of time. Exponential means that short intervals are more common than long intervals, but the probability never goes to zero. There's nothing special about this other than it describes how uncorrelated things happen. Now that means if we take the logarithm of this interval, it is uniformly distributed, and that's really cool, because that means that we can do some really nice numbers against it.
Now the problem here, the remaining issue is that there's a rate in here. The average rate has to be estimated somehow. We have to have some idea what rate we expect right now, and if you think about it most of these sorts of events have some sort of diurnal or weekly variation in rate. That means that we have to see this changing rate. It might be busy during the day and then quiet at night. Perhaps not quite zero because it might be another global market that's being served, but up and down every day, possibly ten to one. What we can do is we can take samples from history, perhaps 24 hours ago exactly, perhaps one hour ago and two hours ago, because that gives us kind of a rate at which things are increasing or decreasing.
One day ago gives us the average daily shape. We can combine these by multiplying them by factors, I'll talk about that in a moment, and add those up to get a predicted rate. We might actually predict the logarithm of the actual rate, because we're talking about percentage changes, but this linear combination is a very, very effective way to do that prediction.
Now one of the really cool things is you can derive these constants using linear regression. Now you want a form of linear regression that is biased toward omitting variables, toward throwing things away, toward settings these parameters to zero, but it's built in to Python's NumPy, there's a lot of different implementations of this linear regression with so called regularization. Regularization is a process which prefers simple models. Once you have the rate predictor, using probably some linear model of the log of a number of events, you can generate a surprise statistic, an anomaly score, using the log of the time between these scaled by rate, and sham, you've got an anomaly detector.
All that's left now is to decide how do you know when there's a really big anomaly score? That really big anomaly score means bigger than has been seen in the past in the high percentiles. I'll talk about that in another video on the t-digest. That's an optimal way to estimate these percentiles, so there you go. That's anomaly detection in one whiteboard – thanks!