•over 5 years ago
The problem of content discovery and recommendation is very common in many machine learning applications: social networks, news aggregators and search engines are constantly updating and tweaking their algorithms to give individual users a unique experience.
Personalization engines suggest relevant content with the objective of maximizing a specific metric. For example: a news website might want to increase the number of clicks in a session; on the other hand, for an e-commerce app it is very important to identify visitors that are more likely to buy a product in order to target them with special offers.
In this post I will explore some techniques that can be used to generate recommendations and predictions using the amazingly fast Vowpal Wabbit library.
Install Vowpal Wabbit
Download the Dataset
You can find our sample dataset here: https://stream-machinelearning.s3.amazonaws.com/vw-files/RecSys_vw.zip
For most applications, collaborative filtering yields satisfactory results for item recommendations; there are however several issues that arise that might make it difficult to scale up a recommender system.
- The number of features can grow quite large, and given the usual sparsity of consumption datasets, collaborative filtering needs every single feature and datapoint available.
- For new data points, the whole model has to be re-trained
Vowpal Wabbit’s matrix factorization capabilities can be used to build a recommender that is similar in spirit to collaborative filtering but that avoids the pitfalls that we mentioned before. After installing, I recommend that you quickly go check out it’s input format and main command line arguments.
We preprocessed the RecSys data to produce a file that is already compatible with VW’s format. For ease of explanation we decided to use a pretty basic feature set, but if you are interested in finding a precise solution to this problem I suggest that you check out the winning solution for the RecSys 2015 challenge.
We will use the information included in buys.vw to fit a model, every data point in this file represents a purchase with a quantity, session id and item id:
The --rank K argument enables matrix factorization mode, where K denotes the number of latent features. In order to use matrix factorization you need to specify at least one pair of variable interactions between namespaces, in this case --interactions is represents interactions between namespace i and namespace s.
The previous command fits a model with quadratic interactions for 20 latent features. Vowpal Wabbit does not output the matrix factorization weights by default, but we can use the gd_mf_weights script included in the library directory to dump all the information that we need:
If you have trouble finding the path for gd_mf_weights:
The file i.quadratic should be among the files that gd_mf_weights writes out, this a compressed representation of every item and can be used to find pairs of items that are similar to each other and recommend those to users given their past browsing and purchasing history.
I like to use scikit learn’s kd-tree nearest neighbours implementation, but you can choose any other nearest neighbor search algorithm:
Vowpal Wabbit also includes its own recommendation script implemented in recommend, in order to use it you need to specify the following parameters:
--topk: the number of items to recommend
-U: a list of the subset of all users for which you want to output a recommendation
-I</b: a list of items for which to recommend from
-B: a list of user-item pairs that we should not recommend
The file items.vw contains a list of all items and blacklist.vw is empty by default. If we want to recommend 5 items to session number 420471 amongst all possible items with no blacklisted pairs:
Outputs the following recommendations:
If you want to read more about the mathematics of matrix factorizations for item recommendation I suggest that you check out Matrix Factorization Techniques for Recommender Systems by Koren, Bell & Volinksy.
You can also use Vowpal Wabbit to predict whether a session will end up with a buy event. The file labeled_clicks.vw contains the following features:
- Whether the session ended in a buy event or not: labeled 2 or 1, respectively
- Importance weight
- Session duration in seconds
- Total number of clicks
- Id number of all items visited during that session
Where every line is in Vowpal Wabbit compatible format:
For instance, the first datapoint:
Represents a session that ended in no buys (label=1), had a total of 4 clicks, a duration of 352.029 seconds and included the items with ID: 214577561, 214536506, 214536500 and 214536502.
We decided to add importance weights to counteract the fact that around 95% of sessions end without an item being bought, this makes our training set highly unbalanced and training without any weights would result in a highly skewed predictor. All sessions that ended with a buy were assigned an importance score of 10, these weights are arbitrary and I recommend that you play with different configurations to achieve optimal performance.
Let’s fit a the model with the out-of-the box VW parameters:
You can always save the current state of your Vowpal Wabbit model using the -f [file] argument and retrain your model with more recent data without having to go through the whole dataset again:
Load it back and update it with a datapoint:
If you want to output predictions you need to pass the -t flag or input an unlabeled sample
We have barely explored the tip of the iceberg of what can be done with Vowpal Wabbit, I invite you to play with your own datasets and tweak some of the more advanced parameters that can be used in the library. Stay tuned for further updates, feel free to post comments and questions below.