7 min read
Recommendation engines help narrow your choices to those that best meet your particular needs. In this post, we’re going to take a closer look at how all the different components of a recommendation engine work together. We’re going to use collaborative filtering on movie ratings data to recommend movies. The key components are a collaborative filtering algorithm in Apache Mahout to build and train a machine learning model, and search technology from Elasticsearch to simplify deployment of the recommender.
Recommendation is a class of machine learning that uses data to predict a user's preference for or rating of an item. Recommender systems are used in industry to recommend:
The architecture of the recommendation engine is shown below:
A Mahout-based collaborative filtering engine looks at what users have historically done and tries to estimate what they might likely do in the future, if given a chance. This is accomplished by looking at a history of which items users have interacted with. In particular, Mahout looks at how items co-occur in user histories. Co-occurrence is a simple basis that allows Apache Mahout to compute significant indicators of what should be recommended. Suppose that Ted likes movie A, B, and C. Carol likes movie A and B. To recommend a movie to Bob, we can note that since he likes movie B and since Ted and Carol also liked movie B, movie A is a possible recommendation. Of course, this is a tiny example. In real situations, we would have vastly more data to work with.
In order to get useful indicators for recommendation, Mahout’s ItemSimilarity program builds three matrices from the user history:
1. History matrix: contains the interactions between users and items as a user-by-item binary matrix.
2. Co-occurrence matrix: transforms the history matrix into an item-by-item matrix, recording which items co-occur or appear together in user histories.
In this example movie A and movie B co-occur once, while movie A and movie C co-occur twice. The co-occurrence matrix cannot be used directly as recommendation indicators because very common items will tend to occur with lots of other items simply because they are common.
3. Indicator matrix: The indicator matrix retains only the anomalous (interesting) co-occurrences that will serve as clues for recommendation. Some items (in this case, movies) are so popular that almost everyone likes them, meaning they will co-occur with almost every item, which makes them less interesting (anomalous) for recommendations. Co-occurrences that are too sparse to understand are also not anomalous and thus are not retained. In this example, movie A is an indicator for movie B.
Mahout runs multiple MapReduce jobs to calculate the co-occurrences of items in parallel. (Mahout 1.0 runs on Apache Spark). Mahout’s ItemSimilarityJob uses the log likelihood ratio test (LLR) to determine which co-occurrences are sufficiently anomalous to be of interest as indicators. The output gives pairs of items with a similarity greater than the threshold you provide.
The output of the Mahout ItemSimilarity job gives items which identify interesting co-occurrences, or which indicate recommendation, for each item. For example, the Movie B row shows Movie A is indicated, and this means that liking Movie A is an indicator that you will like Movie B.
Elasticsearch is an open-source search engine built on top of Apache Lucene™, a full-text search engine library. Full-text search uses precision and recall to evaluate search results:
Elasticsearch stores documents, which are made up of different fields. Each field has a name and content. Fields can be indexed and stored to allow documents to be found by searching for content found in fields.
For our recommendation engine, we store movie meta data such as id, title, genre, and also movie recommendation indicators, in a JSON document:
The output row from the indicator matrix that identified significant or interesting co-occurrence is stored in the Elasticsearch movie document indicator field. For example, since Movie A is an indicator for Movie B, we will store Movie A in the indicator field in the document for Movie B. That means that when we search for movies with Movie A as an indicator, we will find Movie B and present it as a recommendation.
Search engines are optimized to find a collection of fields by similarity to a query. We will use the search engine to find movies with the most similar indicator fields to a query.
For more resources on building a recommendation engine, we recommend checking out the resources below:
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.