Recommender Algorithms: How to Appropriately Use Offline Testing to Prequalify Real-Time Testing

Contributed by Ted Dunning

I was recently asked if I was aware of any papers that talk about systematic exploration of parameters for Matrix Factorization (MF).

Automatic grid search is pretty standard in this sort of area. In general, however, offline evaluation of recommender algorithms is extremely dangerous and potentially quite misleading. Modern practitioners believe that real-time testing is producing better results.

The recent Google paper on click-through prediction also described their method of doing online learning with cross-validation for multiple regularization parameters. Mahout also has an automatic online learning system for regularization in the AdaptiveLogisticRegression, which uses the recorded-step meta-evolutionary algorithm described here.

A simple thought experiment can demonstrate why offline evaluation is so dangerous. Take a recommendation algorithm that orders the results it gets according to the best possible estimate of potential customer interest. Then add dithering noise to the algorithm, which permutes the results in such a way that the top few results have a high probability of remaining in the top few spots, but as one goes deeper into the results, the degree of mixing increases dramatically. One can easily see that in an offline evaluation, the system with dithering will perform worse than the system without, because the dithering itself and permutation on the best known ordering will result in an ordering that isn’t as good.

In practice, what happens on the first day is that the system with dithering does worse. On the first day, the system puts a few second- or third-page items on the first page, which gives the system a broader range of training data to work with after the first day’s data is recorded. In my experience, this broader range of training data quickly makes the system with dithering perform much better than the system without dithering. This effect is so pronounced that I have had people tell me that in all the tweaking they have done with their recommendation systems, nothing has had as dramatic a positive effect on their results as the addition of dithering. Simply put, the single most important change one can make to a recommendation system is one that will be scored as much worse on the offline evaluation method.

This is a difficult quandary to resolve. The value of offline evaluation is that we can try hundreds of algorithms without compromising production results. The cost is that offline evaluations will inevitably veto some algorithms that would actually produce better results.

Given this fact, I think that it is best to view offline evaluation of recommenders as no more than a smoke test that will eliminate any algorithm that is too laughably bad to consider. Any algorithm that passes this basic screening should not be ranked based on offline performance, but should instead be incorporated into real-time testing. That real-time testing should, in turn, be based on something like Thompson sampling (Bayesian Bandit), which is applied not to the models themselves, but to meta-model parameters that combine the scores and ranks assigned to options by all live algorithms as features.

Ironically, however, even this live testing has to be handled very carefully. Imagine that we have two recommendation engines in live test. I will refer to these basic models as “R” and the model plus dithering as “R+D.” If these two models share all training data, then R+D will cause the overall system to gather wider data, thus allowing R to benefit from this wider training data. R will be declared the winner, R+D will be killed, and the system will no longer get the wider training data and will stagnate.

I would argue that it’s much better to use a more sophisticated method, in which the training data for each system is limited to those items that they would have recommended. So if R is our current champion, most of our decisions will be made based on it. R+D, as the challenger, will get some of its recommendations added to the mix, which will broaden the data set that we gather. Going forward, R will only see training data stemming from recommendations that it agreed with in the first place. This means that R will see results from impressions based on R's recommendations, as well as data from R+D's recommendations, only where the dithering had little effect. R+D will see most of R's impressions, as well as cases where the dithering caused some non-R results to be included. R will improve as much as it normally would, and R+D will do likewise, although more slowly than if it saw all of the impressions.

This is a subtle and deep topic.

I’d love to hear any thoughts anybody has on this.

This blog post was published October 17, 2013.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.