In this post I discuss the Multi Armed Bandit problem and its applications to feed personalization.Â First, I will use a simple synthetic example to visualizeÂ arm selection in with bandit algorithms, I also evaluate the performance of some of the best known algorithms on a dataset for musical genre recommendations.

## What is a Multi-Armed Bandit?

Imagine you have a bag full of biased coins. How would you go about finding the one that gives you the highest reward in expectation?Â You could use your favorite statistical tool by running enough independent trials to get a good estimate of the expected reward for each coin. This, of course, wouldnâ€™t be realistic if you only had access to a limited amount of trials or if you had to pay a penalty every time you toss a coin and you get a bad outcome. If bias exploration is costly, you would need to be smarter about how you carry your experiments by somehow learning on the go and making sure that you explore all possibilities.

The biased coin scenario essentially captures the Multi-Armed Bandit (MAB) problem: A repeated game where the player chooses amongst a set of available arms, and at every point of the game he/she can only see the outcome of the action that was chosen. Although highly restrictive, the assumption that only rewards for chosen actions can be observed is much closer to many real-life situations:

- Clinical trial where there is no way to know what the results would have been if a patient had received a different treatment
- Sponsored ad placement on a website since it is always difficult to estimate what the clickthrough-rate would have been if we had chosen a different ad
- Choosing to eat chicken noodle soup instead of tomato soup

The MAB has been successfully used to make personalized news recommendation, test image placement on websites and optimizing random allocation in clinical trials. In most machine learning applications, bandit algorithms are used for making smart choices on highly dynamical settings where the pool of available options is rapidly changing and the set of actions to choose has a limited lifespan.

## Arm Exploration vs Arm Exploitation

I used John Myleâ€™s implementation of Bandit Algorithms to illustrateÂ how we can tackle the biased coin problem using a MAB. For illustration purposes, we will use Normally distributed rewards instead of Bernoulli trials. I am assuming that we can choose between eight arms, whereÂ each arm draws from a normal distribution with the following parameters.

Arm 1 | Arm 2 | Arm 3 | Arm 4 | Arm 5 | Arm 6 | Arm 7 | Arm 8 | |

Mean | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 |

Variance | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |

In this case, Arm **8** gives the highest reward; we want to test whether the algorithms correctly identify it. We willÂ test the performance of two of the most well known MAB algorithms:Â Îµ-greedy and UCB1.

The image above shows the how the average reward and arm selection progresses over time for two MAB algorithms: Îµ-greedy and UCB1. There is an initial period of exploration and after a while, the algorithms converge to their preferred strategy. In this case, Îµ-Greedy quickly converges to the second-best solution (Arm **7**) and UCB1 slowly converges to the optimal strategy. Note how both algorithms tend to concentrate on a small subset of available arms but they never stop exploring all possible options. By keeping the exploration step alive we make sure that we are still choosing the best action but in the process, we fail to exploit the optimal strategy to its fullest, this captures the* Exploration-vs-Exploitation* dilemma that is often found in bandit problems.

Why is exploration useful even when we have identified an optimal arm? In the next experiment, I replicated the same scenario as before, but I have included a distributional shift at* t=2500* where the optimal choice becomes Arm **1**. The shift in distributions is summarized in the following table:

Arm 1 | Arm 2 | Arm 3 | Arm 4 | Arm 5 | Arm 6 | Arm 7 | Arm 8 | |

Mean | .8 | .6 | .7 | .4 | .3 | .2 | .1 | .5 |

Variance | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |

The graph shows how both Îµ-greedy and UCB1 adapt to a change in reward distribution. For this specific simulation, UCB1 quickly adapted to the new optimal arm, whereas Îµ-greedy was a bit slower to adapt (note how the Average Reward curve dips a little bit after*Â *the change in distribution takes place). As always, we should take these results with a grain of salt and not try to draw too many conclusions from one simulation, if you are interested in experimenting with different parametersÂ feel free to tweakÂ the script that I used to generate these results.

## Contextual Bandits

In most real-life applications, we have access to information that can be used to make a better decision when choosing amongst all actions in a MAB setting, this extra informationÂ is what givesÂ Contextual Bandits their name. For the ad-placement example, having access to historical data about the userâ€™s buying habits can be highly informative of what type of products or promotions they will engage with in the future.

For really large datasets, the highly optimized Contextual Bandit Algorithms in Vowpal Wabbit are the way to go. There are a four of bandit algorithms that are ready to use in VW:

- Îµ-greedy: Exploit the best strategy with probability 1-epsilon, keep exploring uniformly over all the other actions with probability epsilon.

- Explore-first: Starts with a phase of pure exploration in the first
*K*trials, after the exploration phase the algorithm exploits the leading strategy.

- Bagging: Different policies are trained using bootstrapping (sampling with replacement).

- Online cover: Explores all available actions by keeping only a small subset of policies active, this approach is fully described in the 2014 paper by Agarwal et. al.

Bandit datasets are highly proprietary and hard to come by, in order to test some of the bandit algorithms that are implemented in VW we can use the *--cbify* option to transform any multiclass classification training set into a contextual bandit training set.

I will use a dataset that was created using theÂ Stream AnalyticsÂ client. Each line in stream_labaled_data.vw is a VW format labeledÂ sample with one hundred different features:

5 |features w1:5 w2:0 w3:0 w4:37 w5:6 w6:0 w7:0 w8:0 w9:0 w10:25 w11:0 w12:0 â€¦

These features were extracted from a large and sparse dataset using a Principal Component Analysis, where theÂ originalÂ features were related to individual userâ€™s listening habits, but are not directly related to a genre. We have also preprocessed the dataset to maintain privacy by tokenizing each of the features. In the setting of a Contextual Bandit, we can think of features of each sample asÂ the context and rewards are either **1** or **0** depending on whether we predict the class label correctly or not.

I haveÂ added the option* --progress 1 *to all my VW calls in order to save the average lossÂ into a file. IÂ also trained a model using the original dataset with VWâ€™s multiclass classifier (*--oaa*) to see how the performance compares to a full information setting.

1`vw -d stream_labeled_data.vw --cbify 10 --cb_explore 10 --cover 6 --progress 1`

I made the following graph using the average validation error reported by VW:

Unsurprisingly, under full information, we get a really good predictor with an average error rate of 1.5% (remember, in this case we can observe the reward associated with each arm at each step).Â On the other hand, --*first K *performs rather poorly but it is stillÂ an improvementÂ fromÂ a predictor that outputs random labelsÂ (the expected error that comes fromÂ randomly choosing among ten equally likely choices should be around 0.9 if we have a training set with an equal proportion of labels). Â Cover converges to an error rate of 56.91% and Îµ-greedy gets the average loss down to 24.06%; Bag has the best performance overall with a 6.71% average loss, this is very impressive considering the fact that compared to the full information scenario it only can seeÂ one arm-reward value per trial, as opposed to the 10 arm-reward values that can be observed under the full information setting.

If you are interested inÂ experimenting with Contextual Bandits algorithms to personalize your feeds, feel free to download the dataset and try your own models. You could do some feature engineering or create a model where each arm is associated with a different bandit classifier.

Feel free to contactÂ us if you are interested in learning more about how Personalization can help improve retention and conversion in your app.