I had the chance to co-present recently with my friend Chao Yuan, of American Express at the New York City Open Analytics Meet Up last month. We discussed lightning-fast nearest neighbor algorithms and associated clustering capabilities. The clustering components are particularly interesting in terms of how much they speed up conventional algorithms.
The overall topic was k-nearest neighbor (knn) models which are just about the simplest kind of model possible. Typically, knn models are considered infeasible at large scale because they involve enormous amounts of computation. There are algorithms to accelerate these searches, but at very large scale, the data preparation for these search algorithms becomes prohibitive. One simple algorithm for scaling search involves clustering the data ahead of time, but clustering vast amounts of data isn’t feasible either. Or at least, it wasn’t until recently.
The knn project uses the Mahout math library and Hadoop to speed up k-means clustering to the point that knn can be usefully applied to real problems.
This is a big deal.
It is a big deal because it has the potential to be a fundamental change in what is possible when building models.
By fundamental, what I mean is that the new clustering algorithms speed up clustering on really big data so much that they are essentially a new capability. Every time an algorithm changes more than a thousand fold it really needs to be viewed as fundamentally different from a practical perspective. The new clustering algorithm in the knn package is about 30,000 times faster than conventional k-means and has even a bigger advantage over the current Mahout implementation because of the cost of iteration in Hadoop’s map-reduce. That’s like going around the world twice in the same time it takes to drive to the corner store and back. When applied to big data mining, this is going to change clustering into something it has never been before.
It isn’t clear exactly where this is going to lead. It is clear that new horizons are opening up for these techniques. Ping me for more information if you want to try these new implementations out.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.