**18 min read**

Ted Dunning, Chief Applications Architect for MapR, presented a session titled: “Some Important Streaming Algorithms You Should Know About” at the Spark Summit 2015 conference held in San Francisco. During the session, he highlighted some newer streaming algorithms such as t-digest and streaming k-means. This article is adapted from his talk.

**Streaming algorithms have to be approximate**

The key thing about streaming algorithms is they have to be approximate algorithms. There are a few things that you can compute exactly in a streaming fashion, but there are lots of important things that you can't compute that way, so we have to approximate.

Most important aggregates can be approximated online. Many of these approximate aggregates can be computed online and all the algorithms mentioned here are that way.

There are two tricks that I'm going to highlight here. One is * hashing,* which turns a beautiful identity function into hash. The second trick is

Let's talk very quickly about HyperLogLog. The basic idea is if I have n samples that are hashed and inserted into a [0, 1) interval, those n samples are going to make n+1 intervals. Therefore, the average size of the n+1 intervals has to be 1/(n+1). By symmetry, that means the average distance to the minimum of those hashed versions is also going to be 1/(n+1). In addition, duplicates values will go exactly on top of previous values, so they won't change the picture at all, so the n that we're talking about is the number of unique values we have inserted. That means that if we have ten samples, the minimum is going to be right around 1/11. I said .1 on the slide here for brevity, but you get the idea.

Here it is as a much larger aggregate. Here we have the same 100 samples, and I've hashed them 10,000 times. This gives lots of different values for the minimum hashed result (and different samples have the minimum hashed value in different trials), but key observation is that the mean value there is 0.0099, or 1/101. That tells us how many unique values there were. What we need to do is store many, many hashed minimum values and average them. HyperLogLog does exactly this, and it accomplishes this by cleverly storing the minimums in a very small space. So the result is that you get a sketch that helps you compute a very useful aggregate. Remember that the Hyper-log-log is good for computing the number of distinct values.

**Count min sketch**

Count Min Sketch is another, different algorithm, where the goal is to know the frequency of popular items, but we don't have enough room for a full table of counters for all possible items. The basic idea is we can hash each incoming item several different ways, and increment a count for that item in a lot of different places, one place for each different hash. Clearly, because each array that we use is much smaller than the number of unique items that we see, it will be common for more than one item to has to a particular location. The trick is that for the any of most common items, it is very likely that at least one of the hashed locations for that item will only have collisions with less common items. That means that the count in that location will be mostly driven by that item. The problem is how to find the cell that only has collisions with less popular items.

The answer is that when we try to find out the count for this popular item, we look in all the places it hashed to and take the minimum count that we find in any of them, hoping that no more popular item collided with it in all of those locations. If at least one of those has only less popular items, we'll get a pretty good count. Because we take the minimum of all of the counts that we find, we know that we have found the least impacted count. That's the Count Min Sketch which is a kind of a sketch that is really useful for counting number of occurrences of the most popular items we have seen.

We’ve discussed two sketches so far. Leaky counters are another interesting thing. This graph, which is actually very hard to see because it drops so rapidly, is a count of word frequencies in English. This is what Zipf's law looks like if you don't do a log-log scale, but the key point about this is because it drops so fast, the frequency of anything that's rare is very, very low compared to the things that are more common. This is the same observation that makes the count-min sketch work, but it is also what makes the leaky counter idea work as well. This kind of distribution of counts means we can keep a table of counters for just the things that have the highest count, then periodically cull the ones that have the smallest counts. The advantage of the leaky counter over the count-min sketch is that the leaky counter remembers *which* items were the most popular as well as approximating their count. The count-min sketch tells us the approximate count for any item we ask about and gives good answers for popular items, but it doesn’t remember which items we should ask about.

If we use this trick of culling counters that have small counts, eventually some of them are going to reappear in our input and thus get put into the table. At that point, we won’t remember what their count was when we discarded them, so what we can do is we can reinsert them with the average count based on how long we have been running the algorithm and thus what the counts probably are for things we have forgotten. The cool thing is that we will tend to forget things that are rare so that the counter table can stay small. Things will drop out of it, but when they come back, we know about how much count they should have on average and we know how much error there is based on what the minimum count is in the table; it’s a clever, clever trick. So this abbreviated counter table is effectively a sketch that helps us remember which items are popular and how many times we have seen them.

**Streaming _k_-means**

Now, here we move to algorithms that are not very well known. The first of these little known algorithms is streaming _k_-means. The problem to solve is that k-means clustering requires multiple tries to get a good clustering and each try involves going through the input data several times. The graph above shows what can happen if you don’t do this. What happened there is that the initial conditions were chosen with two points in the upper left cluster which resulted in a poor clustering that can’t be fixed with k-means iterations. Now, to avoid confusion, what I am talking about here is really streaming k-means, not the kind that's in MLlib. MLlib has a streaming k-means assigner; it doesn't do the clustering in a single pass, and it doesn't do the clustering in an online fashion. This algorithm will—it will do what is normally a multi-pass algorithm in exactly one pass. Now the problem in k-means, typically, is that you wind up with clusterings like in the graph above. Because of bad initial conditions, you will split some cluster and some other two clusters will be joined together as one. You can see red and yellow here are splitting that upper-left cluster, and blue is sharing the two clusters in the middle—this is very typical. This is why you restart k-means. K-means is not only multi-pass, but you often have to do restarts and run it again. Even worse, with complex data in many dimensions, and with many clusters, this kind of failure becomes almost inevitable so even with many restarts, you get bad results.

But if we could come up with a sketch (a small representation of the data) that would avoid this sort of problem , then we could do the clustering on the *sketch*, not on the data. Even better, if we can create the sketch in a single fast pass through the data, we have effectively converted k-means into a single pass algorithm. The graphic above shows the picture of some data (that's the red, green and the blueish dots). Superimposed on that is a k-means clustering of that set of data, except that I've put in way more clusters in the that clustering than I would like to have in the final clustering result. This clustering with too many clusters is the idea behind the streaming k-means sketch. The point is that all of the actual clusters in the original data have several sketch centroids in them, and that means that I will almost always have something in every interesting feature of the data, so I can cluster the sketch instead of the data.

The other interesting thing is that you can do a very slimy job of clustering to get the sketch centroids. You can use an approximate distance, and you can do an approximate search and slam those babies in there. You can also adapt how many you need on the fly. It's really kind of a cool, one-pass algorithm to get the sketch.

The sketch can represent all kinds of amazing distributions if you have enough clusters. This spiral, shown above, can be approximated with 20 clusters as shown above. So any sort of clustering you'd like to do on the original data can be done on the sketch.

**Clustering Summary**

- Sketch is just lots of clusters
- Sketching can be done very approximately in one pass
- High quality clustering of the sketch is a high quality clustering of the data

That's what's behind the proof, and that's what allows us to take a one-pass through the data with a very, very fast algorithm. That gives me a high quality sketch at the end with many clusters compared to how many I want in the end and that, then, is what I cluster in memory on one machine—piece of cake! Sketching, again, is the key here. Very often, with high dimensional clustering, you also want to hash down to a lower dimension, especially when building the sketch.

**_T_-digest**

The final algorithm I want to talk about is t-digest. The idea here is that I want to compute the quantiles of a vast number of samples. In fact, I may want to keep millions of these quantiles for different kinds of subsets of everything I'm looking at. It’s an OLAP cube sort of arrangement, but for all the distributions. I'm going to get an approximation of the entire one dimensional distribution of my data.

The basic idea here is that we can do an adaptive k-means algorithm in one dimension. The idea is we find the nearest cluster. If that cluster is too full, we'll start a new cluster. If it's not too full, we add the new point to the existing cluster. Now, the cool thing about that is we can choose the meaning of “too full” to give us very, very cool accuracy bounds on our estimate of the quantile. If we choose to have small centroids (small clusters, near Q=0 and Q=1), what we can do is bound the accuracy, so the accuracy is relative to the distance to the nearer end.

If you think about it, the 50th percentile (plus or minus .1 percent), it's still pretty much the medium, but the 99.99th percentile (plus or minus .1 percent) isn't the 99.99th percentile at all. We want a much finer accuracy at the ends with lots of nines and lots of zeros. With 99.99 percent, we probably want .001 or less error. So we want relative accuracy, not absolute accuracy. Small clusters and big clusters let us trade that off and have the smallest distribution in memory.

The way we do that is that we can interpolate in the quantiles space, versus back to the sample space. The X-axis here is the sample space and you can see that's a piece wise linear approximation of the cumulative distribution function. You can see how linear interpolation there follows the actual distribution quite nicely.

Now we might want to turn this a little bit. The way that these sizes are limited in the t-digest is we translate from q (which is now on the horizontal axis) to a centroid number. We do that by this non-linear curve. What that does, because it's steep near the ends, is the centroids are close to together in q space near the ends, so they have to be small. They're far apart in the middle, so they can be large. That means our accuracy is very, very fine at the ends and course in the middle.

As we're inserting new samples in there, we determine for every cluster whether the k-value at the beginning of the cluster, before the cluster and the k-value after (the k2 and k1 in the graph) have a difference of greater than one or not. If the difference is less than one, then the cluster is small enough to have a sample added.

The size bound on this data structure follows directly from the form here— the arcsin function that we use there and the accuracy on real data. These are two kinds of data: uniform and Gamma 0.1-0.1. The Gamma function is so disturbed that the mean is 1, but the median is 10 to the -6th. The .1 percentile is 10 to the -30th. So it's very, very skewed, but you can see the accuracy is virtually the same as in the uniform case. And near the ends, at .001 or 99.9 percent, the accuracy is within a few parts per million. This is with a finite data structure at about 200 nanoseconds per sample.

- Hashing and Sketching
- Hyper log log = count distinct
- Count min = count(s)
- Streaming k-means
- Quantiles via t-digest

In summary, hashing and sketching are two important tricks for streaming algorithms. HyperLogLog is one well-known algorithm for count distinct. The count-min sketch is a good one for doing counts, and there’s also the leaky counter for the heavy hitters. Truly streaming k-means is done with sketches and the quantiles via the t-digest algorithm for quantile summaries.

**Take Advantage of Free, On-Demand Training**

I'd like to point out that we have free online Hadoop training that includes helpful information on streaming algorithms.

**Short Books by Ted Dunning & Ellen Friedman**

- Published by O'Reilly in 2014 and 2015
- For sale from Amazon or O'Reilly
- Free e-books currently available courtesy of MapR

In addition, I have a series of short books (written with my co-author, Ellen Friedman) on practical techniques. The first book, titled “Practical Machine Learning: Innovations in Recommendation” sheds light on a simplified approach for building effective recommender systems. The second book is about anomaly detection, titled < a href="/practical-machine-learning-new-look-anomaly-detection" target="_blank">“Practical Machine Learning: A New Look at Anomaly Detection.” The third book, titled “Time Series Databases: New Ways to Store and Access Data” shows you effective ways to collect, persist, and access large-scale time series data for analysis. The final book, titled “Real World Hadoop,” examines several use cases of the Hadoop ecosystem, including Spark.

**Want to learn more? Check out these resources:**

- Apache Spark Streaming
- Official Apache page for Spark Streaming
- Download Sandbox for Hadoop
- GitHub - MapR
- MapR Developer Central

This blog post was published August 17, 2015.

Share

Share

Share