Matthew Ulmer and Jonathan Ye
A recommendation system tries to take advantage of unique individual preferences for the purpose of exposing users to items that they are most likely to enjoy.
The generic nature of ratings allows for the flexible implementation of recommender systems into a variety of areas, such as music, news, books, and movies to name a few. Providing quality recommendations is not just beneficial to users, who will discover new items of interest at a faster rate, but also to the company providing the service as happy customers are loyal customers. A common approach to generating these recommendations is through the use of collaborative filtering methods.
Collaborative Filtering relies on using a matrix, with rows and columns representing users and their items, purchases, or previously visited content. Historically, there have been two avenues to approach collaborative filtering. Memory-based collaborative filtering focuses on using user ratings to compute similarities between users or items which is then used to make recommendations. Model-based collaborative filtering attempts to make recommendations using different statistical methods and models such as singular value decomposition, stochastic gradient descent, and clustering among others.
This case study focuses on the implementation of recommendation systems using user and item based collaborative filtering to predict how users rate movies on a 1-5 star scale, given their previous rating history.
User and item based collaborative filtering follow an identical process, but where user based focuses on using similarity between other users to collaboratively generate rating predictions, the item based version focuses on using the similarity between other items to accomplish the task. For the purposes of this article, we will just describe how user collaborative filtering works.
In user based collaborative filtering, a similarity score is generated for each user to all other users. These similarity scores can be viewed as weights, quantifying how much the ratings of one user should be influencing rating predictions for another user. Users with high similarity effect the rating predictions more so than those of low similarity and as such, we decided to select only the top-K most similar users, or neighbors, to generate the rating predictions.
However, just as preferences are unique to each user, so too are rating scales. A rating of 5 stars might not mean as much if it comes from someone who rates mostly everything a 5 rather than someone who only gives 5 star ratings 10% of the time. As such, ratings are normalized per user by subtracting them from their average movie rating.
Additionally, not all of the neighbors in question have necessarily rated movie and thus they will not be used in the rating calculation. The weighted average of these normalized ratings from similar users who have rated the movie in question are then added to the average movie rating of user, generating a rating prediction of a specific movie for user. This process is done for each movie for the user and then for all of the other users in the set.
Exploring and Transforming the Data
The dataset being used for this project is a modified version of the MovieLens<sup>1</sup> dataset that contains 500,100 ratings of 3,255 users for 3,551 movies. The distribution of these ratings is displayed below.
These values were then placed into a matrix with rows representing users and columns representing movies, resulting in a sparse matrix of 11,558,505 cells with approximately 4.3% of them having values. Non-rated cells are represented by zeros.
In order to split the data into training and testing sets, 20% of the nonzero cells were randomly selected and set to zero. This was after their indices and values stored were stored for comparing to predicted values later. Similar user and similar item matrices were then calculated using the cosine distance metric, which uses only movies that users have both rated to calculate the angle between their vectors.
Using these similarity matrices we predicted raw ratings which were initially processed by changing all predictions that were more extreme than allowed by the rating scale to its closest possible rating, i.e. 1 or 5. Then simple rounding was used on the other predictions in order to discretize them.
The predictions were then evaluated by using root mean squared error and it became apparent that item based collaborative filtering outperformed the user based method in addition to the maximal neighborhood size becoming increasingly negligible.
With 100 maximal neighbors, the confusion matrix of the predictions of item based collaborative filtering is shown below.
Despite the RMSE being low, the confusion matrix shows the results are much poorer than first imagined.
When looking at the distribution of the raw ratings, it becomes apparent why this is so and that simple rounding of the raw predictions is not ideal.
The data is centered heavily around 3 and 4 (the overall average movie rating) which will cause a lot of misclassification if standard 0.5 thresholds are used for rounding.
This is likely due to there being either not enough neighbors who have watched the same movie or that the assumption that 'similar' users behave similarly is wrong. The former can be solved by collecting more data, whilst the latter needs a more thorough analysis of users to throw out spurious ratings.
Such differences in ratings may come about for a variety of reasons, for example a significant amount of time may have passed in between each rating made by the users. This is potentially a problem as cultural shifts may change over time. By better understanding and profiling incorrectly classified movies, we can correct for those behaviors.
It is relatively straight-forward to maximize the separation by implementing a custom thresholding per movie that maximizes accuracy. By adjusting these classification thresholds, the confusion matrices for the training (LHS) and test (RHS) sets using the predictions made from item based collaborative filtering with 100 maximal neighbors were then computed:
The predictions have much better (although still far from ideal) confusion matrices. Below the recall delta confusion matrix and recall-precision-count plot are displayed for the test set predictions made from item (red) and user (green) methods with 100 maximal neighbors.
From these figures, it is shown that the item based method resulted in better recall and precision for all of the ratings, except for movies rated 5 which has better recall with the user based method.
The goal of a recommender system is to suggest items that the users will like and in this case it would be movies that they would give a rating of 5 stars. This method results in classifying 5 star movies correctly for users 59% of the time for the test case, which is by no means optimal output. In terms of classifying the ratings, we would expect histograms of the predictions like above. What is limiting here is that the spread of the histograms is too much, leading to significant overlap between the classes. A less sparse rating matrix would have performed better, but in reality access to extra data is not always possible. It seems like this algorithm is more general than specific and could perform better if the movies were classified as either liked (4-5) or other (1-3) and then make recommendations based on the predicted likes.
The next steps for expanding this analysis would be to try and take advantage of the various metadata present for each user (gender, age, occupation, zip-code) and movie (genre) to help make more personalized predictions for user movie ratings.
We can also potentially cluster this demographic data to find even more similarity among users before performing collaborative filtering.
Finally, we could combine different model-based collaborative filtering methods with memory-based collaborative filtering methods to try to improve our model.