Editor's Note: In this week's Whiteboard Walkthrough, Ted Dunning, Chief Application Architect at MapR, explains how long tail distribution distorts the appearance of data and how to detect it.
Here's the unedited transcript:
I'd like to talk about long tail distributions and some of their practical implications, how they can fool you, how you can fool them, how you can catch the long tail. Let's talk first about a class of phenomenon that behave the way you might expect and also they behave in a way that doesn't matter what scale you're looking at.
Here is a random walk, the idea is that at each time step the signal goes up or down by a random amount. The random amount happens to be normally distributed here. It could be distributed in lots of different ways and the random walk would look about the same. Here is another, and another and another. You see that there's a kind of a feeling that you get about the shape of these things. They look all roughly of a kind.
But now watch, I'm going to magnify our scale by a factor of 10 and then a 100 and you'll see that they look about the same. Here's 10X and again, and again, and again. Notice they look a lot like the 1X that the original that the original size did. We magnify by another 10X, there's a 100X and again, and again, and again. You can see that again, even though the vertical scale has changed, the horizontal scale has changed by now a factor of a 100, the vertical scale has changed by a factor of 10.
The shape, which is the spirit of these curves, stays the same. This is called scale invariance, and this sort of invariance with respect to scaling and sizing is a very very important mathematical property, it is the basis of a large amount of physics. This is how, for instance, universal transition to chaos is computed, by this same process of size re-normalization.
In that case, the re-normalization goes to smaller and smaller intervals. But it's the invariance with respect to size that gives us key aspects and key intuitions into the whole process. Now, not everything behaves that way. If we have what's called, a long tail distribution, and the normal distribution is not long tailed, it is short tailed, it has tails which are well behaved mathematically.
Here is a long tail distribution. You can see that it sits still and then takes large jumps. It is exhibiting small steps in-between the large jumps and here is another and another and another. They all exhibit this pattern of random walk in-between very large jumps. Those large jumps are on the tail of the size of that distribution and you can see that they qualitatively look very very different than the normally distributed random walk.
Random processes that are long tailed have a different feel, a different mathematical intuition behind them than the normally distributed sort of things, and much of classical physics is based on this idea of short tail distributions of the normal distribution. Now, I'm going to show you an example of how these popped up in real life. How one of our customers run into a long tail distribution and how it initially led to some very very wrong conclusions which very much obscured the problem.
Here, on this larger panel, is an example of data that across time here, until we get to roughly right here, we see that the samples are quite small, and then where time equals roughly zero, it appears that they spike in magnitude dramatically, and then they subside again. But that's not happening at all. The distribution at every point in time is the same.
What's happening is the rate of sampling is changing. Near time zero, we're sampling at a much higher rate than at time minus five or plus five. What that does is it allows the long tail distribution an opportunity to get a couple of those big samples in there, so just like the random walk would take large steps occasionally, this process gives large values back occasionally, and when you sample it a lot near zero, we get some large samples. When you don't sample it very often you could get a large sample but you probably won't.
What happened here was one of our customers had a system which exhibited very much this pattern. The latency, the response time of the system, appeared to spike every morning. But what was happening, in fact, was that the number of samples being taken, the number of transactions being observed would spike very much in the morning and with lots and lots more transactions than at other times of the day the extreme values of latencies were being observed.
Now, there's a couple of tricks you can use to find these sort of situations. First, we have to remember that long tails are different. Long tail distributions behave very differently than classical distributions. THey're often deceptive as indeed this one was, and the key to finding this was looking at percentiles. The median, the 50th percentile, was unchanged as we went through the day. The 90th percentile was unchanged as we went through the day, even the 95th percentile was unchanged.
What changed was the extreme percentiles, those are the tails, those are the long tails in a distribution like this. Now, we can produce a module, this isn't necessarily the correct module in any particular situation, but I can show you how this sort of thing could occur in real life, a real mechanism that this could have come from.
If we wipe that away a little bit, what we could have is a data source with very very high data rates, going through many channels to the destination, and if one of these channels misbehaved under load, if it could not handle the full traffic load and suddenly started backing up and clogging, only under high not situations or something like that, that would affect only a small fraction of these transaction. In fact, if all of these things have different characteristics we can arrange to see what's called the mixture distribution of responds times out here, which then can give us this exact sort of long tailed behavior.
Mixture distributions are often the way that these happen. Where we have some processes that is changing the average latency or the average frequency of something and then we see a mixture of multiple processes. The original random walk that we had was a T distribution and a T distribution is a mixture, an infinite mixture of normal distributions with different standard deviations.
The responds times that we saw could in fact, be responds times taken from a mixture distribution. That's one way that this could have arisen. The moral is don't forget about long tails, they are bound in the real world, when they appear you may be deceived by them unless you're ready. Look at percentiles, especially the high percentiles, that's where you'll find out that there's evidence of the long tail. Thanks.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.