9 min read
Ted Dunning, MapR's Chief Applications Architect, recently presented an invited talk titled "Which Algorithms Really Matter?" at the CIKM conference in San Francisco on October 30th, and it's generated a lot of discussion. In less than a week after the talk, over 4500 people viewed the slides posted online. Why is there so much interest? One reason is that people increasingly want to go beyond considerations of large data storage to explore applications at scale. They want to know how to get the best value from their data. They want to know how best to do big data analytics, including machine learning, in ways that are accessible, productive and cost-effective in business settings. For success in big data analytics, choosing the right data is paramount. Time spent on data extraction and feature engineering generally constitutes a large portion of the time budget for project development, and if done properly, it is time well spent. But even with good data, you still need to choose effective algorithms that will produce the results you're after in real world situations. While there is a huge range of cleverly constructed algorithms to chose from, some algorithms are more practical than others in getting successful business results. In his "Which Algorithms Really Matter?" talk Ted offered his thoughts not only about which particular algorithms he recommends but also his guidelines about how you should you make your own choices. During a Twitter discussion after Ted's talk, one person brought attention in particular to a slide that listed five characteristics important for machine learning algorithms to be of value in practical settings.
Excerpt of tweets discussing Ted's talk "Which Algorithms Really Matter?"
Let's look a little deeper into the ideas from this slide and why it resonates so strongly with people.
A key to success is the divide between theory and practice. I talked to Ted about the criteria he listed on slide 7 as to what's important to consider in choosing practical algorithms, and here is what I learned.
A flash of creativity may result in a cleverly designed prototype, which is really cool, but for it to be of business value you must capture the genius so that it is scalable. A shooting star sort of approach is spectacular, but in the end what matters is to build or choose an algorithm and application design that "goes to work every day." Often it's not the elaborate design that wins; streamlined design and simplicity in algorithms can be the right combination for success. But whatever version of Hadoop you use, remember that the best algorithms for large-scale analytics are those that scale well for successful deployment.
A handy frame of mind in planning a machine-learning project is to imagine you are "hiring" the algorithms you will use – and that you've posted the job announcement:
The successful algorithm in practice in production will be able to deal with some degree of rough handling. For example, good candidates take into account machine portability, can handle different input formats and are sufficiently robust to handle configuration and operational errors.
Machine learning programs are intended to change as they gain experience, and with luck, they improve as they learn. But this is not always the only way in which they change. Errors may occur or changes in environment cause previously useful steps to become problematic. Degradation is always a threat, and when it happens, adjustments in data, in the algorithm or the design of the project should be made. But a larger question looms: Will degradation be obvious? It's not always easy to know what it is that your program isn't showing you. An example involves the processing of text documents. One day, one of the feeds starts giving unencoded documents, and the indexing processes them as though they are "just fine". They aren't fine, of course; they produce gibberish and so none of the documents of that collection are likely to appear as output in the results. Will you know an error is occurring?
Transparency is important: A document classification application won't select defective documents, but this omission may not be obvious if the other results mask the absence of the ill-formatted documents.
With even a moderate number of collections, you may not realize that some are consistently missing from the results, especially if those documents that are chosen by the application are fairly well matched. Yet your results are much less effective than they could be if the erroroneous collection of documents were potentially included in the output. How can you make degradation be transparent? One way is through the use of internal metrics that have expected values and limits. In the example illustrated here, you could do something as simple as to build in an automated test to look at the top ten words of each document and check average word lengths against what is expected for your language. A simple automated test such as this should be able to set off an alarm to signal faulty extraction. Remember: evaluate early and often. Building and maintaining a successful machine-learning program is not like running a race to a finish line and checking to see what is your final score. The successful application requires ongoing evaluation, first in cycles while training and adjusting a model to make it good enough to deploy, and again during production to check for degradation.
The choice of a practical versus elaborate approach has implications at the human level, too. There is an advantage to using simple and accessible algorithms and project design in the impact on personnel. The degree of expertise and effort needed to install, implement and maintain fancy programs may require additional and possibly unnecessary training. There's a possibility that implementation may not follow the intended plan if the plan is too complicated. It's often better to make a system that is built on brilliant innovation but that runs on reliable and predictable operation. Consistent delivery is the kind of brilliance you need in production. Inventors need to be allowed to innovate and also to fail and thus must make optimistic lists - a hope-to-do list. Ops guys make a will-do list. Who do you want to have running your system? Conversely, which kind of list are you giving your ops team when you try to deploy a new algorithm?
Sometimes an elegant modification or addition to an already successful algorithm can improve performance, but often the effort needed to do it is quite large. And the resulting degree of improvement may be very small. On the other hand, some simple modifications can make a big difference. For example, inclusion of internal metrics in a variety of applications can avoid degradation or identify problems early so that they can be addressed. Another example is the use of the simple but powerful technique known as dithering to improve recommendation performance in production. These efforts do not require a great deal of specialized expertise or huge amounts of development time. The basic idea is to ask yourself, "Where is the highest value per minute of effort?" Remember, the goal for these practical applications is real-world performance rather than fame through sophisticated nuance. To see what algorithms Ted recommends, check out his slides for the CIKM talk and watch for additional articles here that go into more depth. http://www.slideshare.net/tdunning/which-algorithms-really-matter Meet Ted at Strata London 11- 12 November 2013 at MapR Booth 310 Hear Ted's talk on Recommendation at Devoxx Antwerp 14 November 2013 You can also follow Ted on Twitter: Follow @ted_dunning
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.