The problem of content discovery and recommendation is very common in many machine learning applications: social networks, news aggregators and search engines are constantly updating and tweaking their algorithms to give individual users a unique experience.
Personalization engines suggest relevant content with the objective of maximizing a specific metric. For example: a news website might want to increase the number of clicks in a session; on the other hand, for an e-commerce app it is very important to identify visitors that are more likely to buy a product in order to target them with special offers.
In this post I will explore some techniques that can be used to generate recommendations and predictions using the amazingly fast Vowpal Wabbit library.
Install Vowpal Wabbit
$ git clone https://github.com/JohnLangford/vowpal_wabbit.git $ cd vowpal_wabbit $ make $ make install
Download the Dataset
You can find our sample dataset here: https://stream-machinelearning.s3.amazonaws.com/vw-files/RecSys_vw.zip
For most applications, collaborative filtering yields satisfactory results for item recommendations; there are however several issues that arise that might make it difficult to scale up a recommender system.
- The number of features can grow quite large, and given the usual sparsity of consumption datasets, collaborative filtering needs every single feature and datapoint available.
- For new data points, the whole model has to be re-trained
Vowpal Wabbit’s matrix factorization capabilities can be used to build a recommender that is similar in spirit to collaborative filtering but that avoids the pitfalls that we mentioned before. After installing, I recommend that you quickly go check out it’s input format and main command line arguments.
We preprocessed the RecSys data to produce a file that is already compatible with VW’s format. For ease of explanation we decided to use a pretty basic feature set, but if you are interested in finding a precise solution to this problem I suggest that you check out the winning solution for the RecSys 2015 challenge.
We will use the information included in buys.vw to fit a model, every data point in this file represents a purchase with a quantity, session id and item id:
[quantity] |s [session id] |i [item id]
The –-rank K argument enables matrix factorization mode, where K denotes the number of latent features. In order to use matrix factorization you need to specify at least one pair of variable interactions between namespaces, in this case –interactions is represents interactions between namespace i and namespace s.
$ vw -d buys.vw --rank 20 --interactions is
The previous command fits a model with quadratic interactions for 20 latent features. Vowpal Wabbit does not output the matrix factorization weights by default, but we can use the gd_mf_weights script included in the library directory to dump all the information that we need:
$ /library/gd_mf_weights \ -I buys.vw --vwparams '-d buys.vw --rank 20 --interactions is'
If you have trouble finding the path for gd_mf_weights:
$ find ~/ -name gd_mf_weights
The file i.quadratic should be among the files that gd_mf_weights writes out, this a compressed representation of every item and can be used to find pairs of items that are similar to each other and recommend those to users given their past browsing and purchasing history.
I like to use scikit learn’s kd-tree nearest neighbours implementation, but you can choose any other nearest neighbor search algorithm:
from sklearn.neighbors import NearestNeighbors import pandas as pd import numpy as np items_quadratic = pd.read_csv("i.quadratic", sep="\t", header=None) nbrs = NearestNeighbors(n_neighbors=5, algorithm='kd_tree').fit(items_quadratic) distances, indices = nbrs.kneighbors(items_quadratic) print indices
Vowpal Wabbit also includes its own recommendation script implemented in recommend, in order to use it you need to specify the following parameters:
–-topk: the number of items to recommend
-U: a list of the subset of all users for which you want to output a recommendation
-I</b: a list of items for which to recommend from
-B: a list of user-item pairs that we should not recommend
The file items.vw contains a list of all items and blacklist.vw is empty by default. If we want to recommend 5 items to session number 420471 amongst all possible items with no blacklisted pairs:
$ echo '|s 420471' | /library/./recommend --topk 5 \ -U /dev/stdin -I items.vw -B blacklist.vw --vwparams '-d buys.vw --rank 20 \ --interactions is --quiet'
Outputs the following recommendations:
0.271379 |s 420471|i 3391236 0.271379 |s 420471|i 3915524 0.271506 |s 420471|i 3095836 0.279096 |s 420471|i 2531796 0.279096 |s 420471|i 5677524
If you want to read more about the mathematics of matrix factorizations for item recommendation I suggest that you check out Matrix Factorization Techniques for Recommender Systems by Koren, Bell & Volinksy.
You can also use Vowpal Wabbit to predict whether a session will end up with a buy event. The file labeled_clicks.vw contains the following features:
- Whether the session ended in a buy event or not: labeled 2 or 1, respectively
- Importance weight
- Session duration in seconds
- Total number of clicks
- Id number of all items visited during that session
Where every line is in Vowpal Wabbit compatible format:
[label] [label weight] |len [session duration] |cli [number of clicks] | it [item 1] [item 2] ...
For instance, the first datapoint:
1 1|len 352.029 |cli 4 |it 214577561 214536506 214536500 214536502
Represents a session that ended in no buys (label=1), had a total of 4 clicks, a duration of 352.029 seconds and included the items with ID: 214577561, 214536506, 214536500 and 214536502.
We decided to add importance weights to counteract the fact that around 95% of sessions end without an item being bought, this makes our training set highly unbalanced and training without any weights would result in a highly skewed predictor. All sessions that ended with a buy were assigned an importance score of 10, these weights are arbitrary and I recommend that you play with different configurations to achieve optimal performance.
Let’s fit a the model with the out-of-the box VW parameters:
$ vw -d labeled_clicks.vw
Num weight bits = 18 learning rate = 0.5 initial_t = 0 power_t = 0.5 using no cache Reading datafile = labeled_clicks.vw num sources = 1 average since example example current current current loss last counter weight label predict features 1.000000 1.000000 1 1.0 1.0000 0.0000 7 0.900698 0.801397 2 2.0 1.0000 0.1048 8 0.774933 0.649168 4 4.0 1.0000 0.2232 5 0.522392 0.269851 8 8.0 1.0000 0.4725 4 1.658445 2.567287 9 18.0 2.0000 0.3977 12 1.355825 1.146318 17 44.0 2.0000 0.7707 5 0.935035 0.514245 43 88.0 1.0000 1.0372 5 0.632303 0.351878 111 183.0 2.0000 0.8911 7 0.476271 0.320239 231 366.0 1.0000 1.4677 6 0.443583 0.410895 462 732.0 1.0000 1.0966 5 0.402748 0.362024 890 1466.0 2.0000 1.3384 5 0.344143 0.285539 1870 2932.0 1.0000 0.9615 5 0.316211 0.288279 3704 5864.0 1.0000 1.2017 10 0.284272 0.252332 7318 11728.0 1.0000 1.0498 6 0.262397 0.240522 14870 23456.0 1.0000 1.1891 5 0.249489 0.236585 29520 46917.0 2.0000 1.3768 4 0.240956 0.232423 58779 93843.0 2.0000 1.1645 8 0.232603 0.224250 118013 187691.0 2.0000 1.3749 6 0.224553 0.216503 235639 375382.0 2.0000 1.7727 9 0.220508 0.216464 472363 750769.0 2.0000 1.3269 5 0.218877 0.217246 953501 1501538.0 2.0000 1.4717 6 0.219469 0.220061 1952128 3003076.0 1.0000 1.6532 5 0.222089 0.224710 3959696 6006161.0 2.0000 1.5566 4 0.215883 0.209676 8018878 12012322.0 1.0000 0.9819 5 finished run number of examples per pass = 9249681 passes used = 1 weighted example sum = 13836918.000000 weighted label sum = 18933848.000000 average loss = 0.214053 best constant = 1.368357 best constant's loss = 0.232670 total feature number = 54364504
You can always save the current state of your Vowpal Wabbit model using the -f [file] argument and retrain your model with more recent data without having to go through the whole dataset again:
$ vw -d labeled_clicks.vw -f serialized.model
Load it back and update it with a datapoint:
$ echo '2 10 |len 35.029 |cli 4 |it 214577561 214536506 214536500 2145365027' | \ vw -d /dev/stdin -i serialized.model -f updated.model
If you want to output predictions you need to pass the -t flag or input an unlabeled sample
echo '|len 1.029 |cli 1 |it 214759961' | vw -d /dev/stdin -i serialized.model \ -p /dev/stdout --quiet
We have barely explored the tip of the iceberg of what can be done with Vowpal Wabbit, I invite you to play with your own datasets and tweak some of the more advanced parameters that can be used in the library. Stay tuned for further updates, feel free to post comments and questions below.