A Simple Recommender System in C#
Posted by Joao CoelhoRecommender systems are based on fairly sophisticated machine learning algorithms, usually expressed in the language of linear algebra. But the basic intuitions behind these systems are not complicated. And because of the sophistication of available open-source libraries, it is possible to implement such a system without mastering all the details of machine learning.
So let's envisage a really simple scenario and try to understand it thoroughly. We have a website that sells books. Some of these have been rated by individual users, others not. Based on the ratings that a user has given, we would like to recommend three new books to that user, when he lands on the home page of the site.
To do that, we have to create a model that predict this user's hypothetical ratings for the books he has not rated, and out of those predicted ratings, select the three highest ones and recommend the corresponding books to our user. So what we are looking to do is build a model of our user's preferences, and use it to make predictions about how he would rate new books.
How is our data organized? Well, each book has a certain number of characteristics, called "features", that give it a specific profile. In our case, we have defined 5 features: action, comedy, crime, war, historical, and for each book that we have in stock, we could give each of these features a score between 0 and 1000. For the purposes of this post, we are only going to consider a single user.
We are going to represent our stock of rated books as a matrix X of n items having m features (in this case, m=5) that is to say, and m*n grid or matrix. For our purposes, we don't want thinks such as the title or author as part of our features, or even any kind of identifier; we will know which book we are referring to just from its position in the list, or its index in the matrix.
action | comedy | crime | war | history | |
---|---|---|---|---|---|
Book 1 | 45 | 330 | 200 | 110 | 20 |
Book 2 | 500 | 210 | 575 | 700 | 980 |
Next, we will build a vector Y, containing the ratings given by the user to each book in our stock. This will have again n rows, one for each book, and a single column, since we are dealing with just one user, so it's going to be an n*1 vector. We will store the unrated books in another list.
rating | |
---|---|
Book 1 | 4 |
Book 2 | 1 |
Now, we have to create a model for our user's preferences. For any possible model, the assumption we are going to make is that the user's preferences are not just random, but that they somehow depend on the features of the book. For example, one user's tastes tend more to romance/erotic, another's to comedy, while another prefers war/action. While our user is not going to explain his preferences, we can deduce them from the ratings he has already given, and use the resulting model to predict how he would rate books that he has never read.
This kind of machine learning is called supervised learning. Supervised means that we already have some of the "answers" (in this case, past user ratings), and that we can use that "labelled" training data to help our model learn about the user's preferences. There are many different algorithms for supervised learning. Why so many? Because they define different types of relationships between the features and the labels, or in this case, between the book descriptions and the ratings.
The simple algorithm we will use for our example is called Linear Regression, which means that we are going to assume that each rating that the user has given to a book is a linear combination of the feature values for that book. Let's explain.
We have a book in stock, "The Dogs of War", which has the following feature scores in matrix X:
action | comedy | crime | war | history | |
---|---|---|---|---|---|
"The Dogs of War" | 800 | 100 | 200 | 800 | 300 |
and our user has given this book a rating of 4 stars:
rating | |
---|---|
"The Dogs of War" | 4 |
So when we say that we want the score to be a linear combination of the features, what we mean in this case is that there are some weights w such that:
action*action_weight + comedy_feature*comedy_Weight + ... + hist_feature*hist_weight = rating
which for this example would work out to:
800*w1 + 100*w2 + 200*w3 + 100*w4 + 300*w5 = 2
which is what we call a linear combination. What our model will do is work out the weights W, and use these to make future predictions, by plugging different feature sets into this equation.
predictedRating = x1*w1 + x2*w2 + x3*w3 + x4*w4 + x5*w5
where (w1,...,w5) are the optimal weights given by our model for this user, after training it, and (x1,...,x5) are the feature values for the unrated movie. So the big question now is, how do we train our linear regression model so that it gives us a weight matrix that is a good fit with our existing ratings, and will therefore predict our user's future preferences as accurately as possible? At this stage we will simply apply the linear regression algorithm. I'm not going to explain the theory (check out the details on Wikipedia if you're interested), but here is the implementation. We are using an open-source machine learning library called Numl(http://numl.net). It's made for developers lacking a strong background in CS theory, but who are nevertheless good at coding. So, let's create a new console application project, then add a reference to Numl via the NuGet package manager. Next, let's create some labelled data. We're going to create 200 rated examples.
// number of training examples (200 books with ratings) var n = 200; // number of features (categories according to which books are rated) var m = 5; // initialize feature matrix (how each film is weighted by category) // and label vector (how one specific user rated each film) // with random values Matrix x = new Matrix(n, m); Vector y = new Vector(n); Random rnd = new Random(); for (int i = 0; i < n; i++) { // ratings are in range 0-5 y[i] = rnd.Next(0, m); // feature weights are in range 0-1000 for (var j = 0; j < m; j++) { x[i, j] = rnd.Next(0, 1000); } };
Now, let's set up a linear regression model (usually a lot of work, but in this library a one-liner):
// create supervised learning model that uses linear regression algorithm LinearRegressionGenerator generator = new LinearRegressionGenerator() { LearningRate = 0.1, // big steps - fast but may diverge; small steps - may run too slowly MaxIterations = 400, // how many times do you want it to run Lambda = 0 // compensation factor to prevent overfitting of training data - just ignore it for now };
And let's train the model:
LinearRegressionModel model = (LinearRegressionModel)generator.Generate(x.Copy(), y.Copy()); var finalSettings = model.Theta;
The optimized weight matrix will be given by model.Theta. Finally, let's assume we have another set of books, this time unlabelled, called x_unrated. We are going to use our model to predict how our user would rate them,
Vector y_prediction = new Vector(n); for (int i = 0; i < n; i++) { // make a prediction for the rating var unratedBook = new Vector(new double[] { x_unrated[i, 0], x_unrated[i, 1], x_unrated[i, 2], x_unrated[i, 3], x_unrated[i, 4] }); y_prediction[i] = model.Predict(unratedBook); };
and then pick the three highest-scored predictions as recommendations to display on the front page.
// now find the indexes of the 3 highest ratings in the predictions var recommendations = y_prediction.Top(3);
I hope that makes sense, and that it makes you feel like playing around with Numl and going deeper into machine learning. By the way, we don't need all this code for a production system. All we need is the finalSettings matrix, which we will use to initialize a linear regression model, allowing us to make predictions on the fly. Of course, as we accumulate new ratings data, which should re-run our training process at regular intervals so as to keep our model accurate and up-to-date.
Source Download
LRExample.7z
Advertising
Agile
Branding
Business
Cloud
Computers
Mobile
Open Source
Product Management
Programming
Project Management
SaaS
Software
Web