August 12, 2015 | BY Ted Dunning
Editor's Note: In this week's Whiteboard Walkthrough, Ted Dunning, Chief Application Architect at MapR, shows you how to find out what's popular tomorrow by using Zipf's Law.
Here's the unedited transcription:
Hi, I want to talk about a problem that comes up all the time in trying to analyze and understand things that occur at variable rates. The problem is what's hot and of course, what's not. The problem is to determine not what's popular. This is typically in media or web pages or all kinds of things where you have lots of different kinds of things that occur or interacted with at different rates, so the problem is not to find what's popular. That's the frequency that something has, that's easy. That's also yesterday's news. What we want to do is find out what's going to be popular tomorrow, not yesterday. We want to know what's hot, what's rising in popularity.
One very, very cool aspect of this is we get to use something called Zipf's Law. Zipf's Law is one of the most amazing coordinates in the universe that's ever been noticed but people complain because it's not very useful. Today it is. The idea is that the most common, the largest, the most popular things are relatively rare and the least common, the least populous, the least popular are much more common.
This was noticed in the 40s by this guy named Zipf. It's an amazing name right there, it sounds like science fiction right off the bat. He noticed it with city sizes. There are very very few of the largest cities, there are very, very many of the smaller cities. He also noticed it with word lengths, word frequencies. In fact, he began to notice it everywhere. It is indeed almost a ubiquitous property of things we observe that have frequencies. This is true for content items on the web, or in all kinds of situations.
The idea is that the frequency of something is inversely proportional to it's rank when we plot this on a log-log scale. That's logs, and that's logs. Now, if we were to plot it without the log scale, we would barely be able to distinguish it from the axes. It would come down so sharply and go out so flatly, but on the log-log, we see this characteristic linear presentation. This relationship between rank and frequency is key to finding what's hot.
What we want to find is the things that are changing in rank, changing in a rank by some proportion. Something that goes from tenth to fifth is reasonably said to be about as significant as something that goes from hundredth to fiftieth. Certainly, going from tenth to fifth is much more exciting than going from hundredth to ninety-fifth. It can't just be how many places in rank we change, it has to be something like how much of the rank did we change.
Proportional change in rank is the intuition that we're going to go after. Rank is proportional to one over frequency, that's what we saw over there. There's some element of proportionality, some constant there. That's the power law exponent, but we'll see that that doesn't really matter in just a moment. What we can do, we could measure rank. The way we would measure rank of course is we would count the frequency of everything, sort it, and then look at the rank, but that's an inefficient operation because it needs the data for everything to compute anything. What we'd much rather have is the ability to say how hot is something just based on it's own statistics. It's own statistics are it's own frequency. Instead of the ratio of ranks, we can talk about hotness being proportional to the log of the ratio of frequencies.
Here, these are the ranks, this is proportionally the rank of the old measurement, here is the rank of the new measurement. How much that changes is how hot this thing is. The constants here, the alphas, can go away because we're just dividing these logs. We can keep them if we'd like or we can ignore them if we'd like. What we wind up with is a very simple formula here where hotness is the log of the new frequency, plus some fudge factor, divided by the old frequency plus the same fudge factor. The fudge factor has actually got a deep philosophical and mathematical rationale underneath it. For the moment, you can just look at it as a way to avoid ever dividing by zero.
What it also does is says that things that go from zero frequency to two, times in a month say, are not infinitely exciting, they are moderately exciting. You have to get some minimal frequency before you rise above the significance that "k" gives you right off the bat. If you have zero before and zero after, you'll get a ratio of one, which gives you zero hotness, if that's what you'd like. If your frequency stays the same in two succeeding time periods, you'll also get zero hotness. If it goes up by a lot and if the frequencies are much larger than "k", then you'd be truly hot. That's how we use Zipf's Law to calculate hotness. It's really pretty easy. This is amazingly effective and very simple to do. Thanks!