There's hardly a developer who doesn’t use GitHub. With all those stars, pulls, pushes and merges, GitHub has a plethora of data available describing the developer universe.
As a Data Scientist at Stream, my job is to develop recommender systems for our clients so that they can provide a better user experience for their customers. With that said, I wanted to see if I could build a recommendation system for a product that I use daily (similar to the tool I built for Instagram), as well as try out some new deep learning architectures and “big data” processing tools that I’ve been wanting to play with.
I chose to use Dask (which provides advanced parallelism for analytics, enabling performance at scale for the tools you love) for my “Big Data” processing needs. I like to think of it as out-of-core, parallel, Numpy, and Pandas. What’s not to love? For building the deep learning architectures, I decided to use PyTorch. PyTorch provides “Tensors and Dynamic neural networks in Python with strong GPU acceleration”. Its easy to use interface and superior debugging capabilities make PyTorch amazingly pleasant to work with.
In my mind, there were 3 main parts of building this recommender system:1) Downloading and processing data, 2) Building a recommender system, and 3) putting that system into a production environment.
First things first. Check out the demo!
Downloading and Processing Data
I know that data munging isn’t always fun or sexy. However, since it is such a large part of many data science and machine learning workflows, I wanted to go through how I handled processing over 600 Million events.
Downloading Data from the GitHub Archive
“In 2012, the community led project, GitHub Archive was launched, providing a glimpse into the ways people build software on GitHub, This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains activity data for more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths and the contents of the latest revision for 163 million files” .
That’s a good chunk of data to play with. This dataset includes over 20 different event types, everything from commits, comments, and starts for public event data. However, it does not contain private repo information (thankfully), no “‘view” or “clone” data. While this is awesome for privacy, it does provide a little bit of a handicap in providing recommendations, as viewing and cloning repos would provide excellent indications of interest. That said, there should still be plenty of interaction data to provide some cool insights. I ended up using data only created after 1/1/2017 to get at least a year of interaction data, and because it sounded like a nice number.
So….. why not use only stars? It does seem like a great indicator of what a user is interested in, however, within both academia and industry, use of explicit data has fallen out of favor and the gold standard is to now use implicit data (any engagement event, that could signal a user’s interest in an activity). To keep it simple, we’re going to use all of the data that GitHub gives us and treat it as implicit data. That means that all 20+ events (including stars) will be treated the same as an implicit event. This ends up being about half a billion analytic events per year. Not too shabby. My computer, unfortunately, can’t store all that data into memory, which is where Dask comes in.
GitHub Archive updates once per hour and allows the end user to download .gz files of JSON data for each hour. After a bit of consideration, I ended up storing the data as parquet files, as it seemed like a natural fit, plus it’s the suggested file format for storing Dask Dataframes. It’s also almost as fast as HDF5 for reading into memory (just as fast with multiple cores) and compresses WAY better on disk.
The functions I used to iteratively download data can be seen below. For ongoing updates, I simply wrapped the “update_data” function into a simple cronjob that runs once a day before my model is re-trained.
Alright, now that we have lots of data to play with, we need to do some serious preprocessing before we can dump it into any sort of model.
The end goal is to have to have an array of data with each interaction in a normalized integer form.
|User Id||Repo Id|
The workstation I was using has 32 cores and 64 GB of memory, however, the below should all be doable on a standard laptop. I did run it on my MacBook Pro with 16GB of memory and 8 cores. Some steps just took a lot longer due to the parallelization that Dask takes advantage of across multiple cores. One extremely nice thing about Dask is the monitoring that it provides on your tasks. An example visualization seen below is what helped me debug the following steps.
The steps were as follows:
Set up a local compute cluster for Dask, and define a computation graph to strip out user and repo names from JSON strings within a Dask Dataframe.
Turn Dask DataFrame into Dask array to take advantage of slicing capabilities and store to disk as Numpy stack to force freezing of current state of the computation.
Iterate through those stacks to find all unique repos and users to create user and item to id dictionaries. I was having some memory troubles on the final aggregation step using Dask to do the unique count entirely out of the core, so I ended up just iterating through it in chunks, then, mapping users and items and storing to disk one more time.
To reduce noise, repos with low engagement were removed from the training set (any repo with less than 50 associated interactions). I then mapped each user and item to a normalized index to get the format that we were striving for above.
Whew, that was a lot of data munging. I know it’s not glamorous, but I end up spending a lot of my time doing this sort of work, so it seemed worth covering the plumbing instead of the just the cool shiny stuff. Now, on to the fun stuff!
Building a Recommender System Model
Collaborative filtering and matrix factorization approaches have been king in the recommender system space ever since the Netflix challenge. Today, sequence-based models have started to become increasingly prevalent. Fortunately, deep learning techniques can be applied to both.
Collaborative Filtering using Neural Matrix Factorization.
Neural Matrix Factorization is an approach to collaborative filtering introduced last year that tries to take advantage of some of the non-linearities the neural networks provides while keeping the generalization that matrix factorization provides. This is done by concatenating the two feature vectors extracted from a multilayer perceptron with an element-wise multiplication from item and user feature vectors.
A simple illustration can be seen below:
PyTorch was used as the building blocks for this network, and many ideas were taken from here:
Making our network look like so:
Sequence-based models have been extremely popular in recommender systems lately. The main idea behind them is that instead of modeling a user as a unique identifier, users are modeled as their past x interactions. This provides a couple of very nice properties. New interactions on a user don’t need to trigger a new model rebuild to generate up-to-date recommendations, as it is all based on the past x item interactions. Additionally, they can generalize immediately to new users once they start clicking around. In situations like personalizing recommendations for e-commerce, where all the data you have is based on a single session, this is essential. Many of these ideas have been adapted from natural language processing where language models are used to predict the next character or word in a sentence.
A Mixture-of-tastes model tries to represent the diverse interests that users may have by trying to rank a user's interest in an item using separate taste vectors. It does this by representing these taste vectors as different feature maps using a CNN layer with a stride of one and a depth output of the number of taste vectors. For a nice breakdown of the CNN architectures, I highly recommend reading Convolutional Neural Networks for Visual Recognition
This idea seemed like it would work well here, as developers tend to have multiple interests. For instance, I’m mostly interested in machine learning related repos, however, I’m also interested in big data processing tools and backend web development, and it would be really nice if the model could somehow treat these as different subpopulations.
I highly recommend taking a look at the paper Maciej Kula wrote, as well as his implementation of the model. The only tweaks I made for my own model were to add some dropout layers and some ReLU activation functions between layers. This is the model that ended up being put into production for this project.
Due to the number of items updating, each embedding during every backward pass can become computationally expensive, if all negative samples are taken into account. Thankfully, we can take some tricks from natural language processing and take advantage of some of those negative sampling techniques. Since any data point that isn’t part of a user’s interaction history is considered to be implicitly negative, and the sample size (user’s interaction history) is much smaller than the number of items in our population (all the repos), chances are, a randomly sampled item will be implicitly negative. To help our network sample things more efficiently, we can also take the sampling distribution idea from word2vec where the probability of selecting something is:
As this is a learning to rank problem with the use of implicit data points, I ended up using Bayesian Personalized Loss (which is a variant of pairwise loss) for my loss metric. In PyTorch this ends up looking like
Where positive predictions are taken from the forward pass through our network with the given batch of training data, and negative predictions are randomly sampled values from our interactions.
Following standard best practices, I split my data based on users into 80% for training 10% for validation and another 10% for testing.
After 15 epochs the model achieved a train loss of .0085 and a validation loss of .0213.
I thought the model had severely overfitted. But it appeared that, after the first couple steps, the validation and training losses were decreasing at the same rate, so I chalked it up to generalization error plus or minus standard deviation in estimating a population.
Using Mean Reciprocal Rank as an evaluation metric, the test set was set up to use all but the last interaction for any given user, and all items were ranked to see if it could rate the next item in the list the highest. On my test set (as of writing this blog post). I achieved an MRR of .058. Which means that, out of predicting scores for about 1.4 million items, the last item that the user interacted with was within about 17 of the highest ranked items that the model thought they would interact with. [u]
Typically in production environments, it is too expensive and slow to rank every item within a strict response cycle of 10’s of ms. The most common way around this is the break to response into two steps: (1) candidate generation and then (2) ranking using every feature available.
It’s impossible to rank millions of repos within a strict response cycle (at least on a CPU). One standard way of dealing with this problem is to provide a subset of candidates (in the range of hundreds) that may be relevant, and then rank those candidates, using our model above.
I took an approach similar to YouTube, where an approximate nearest neighbors approach is used on top of a Neural Net to find candidates based on the average of their last interactions. However, instead of building another Neural Network, I figured I could just use the computed item feature vectors that were calculated from the mixture model above for a given user.
I was slightly concerned that computing the cosine similarity between different repos could produce poor results based on the fact that the embeddings don’t exactly have a linear relationship with one another. However, simply based off of some empirical evidence, nearest neighbors seemed to generate some good candidates.
Nearest neighbors to https://github.com/pytorch/pytorch
- "https://github.com/pytorch/pytorch", <- sanity check
Nearest neighbors to https://github.com/facebook/react
- "https://github.com/facebook/react", <- sanity check
Using Spotify’s annoy library to calculate Approximate Nearest Neighbors, to generate 1000 candidates and ranking those instead of ranking over 1.4 Million candidates cut my response time from seconds to 10’s of milliseconds.
Building the index is rather simple:
Now, I just need a vector to query off of. By taking the average of the past 90 days worth of interactions, it is possible to quickly generate 1,000 reasonable candidates to rank.
Now that I have candidates, I can generate predictions for each one by passing them through my network, making the final response cycle look like:
- (90+% of the time is spent here) Query GitHub to get users past 90 days of interactions
- Get average embedding for those interactions
- Generate 1000 candidates using approximate nearest neighbors based on average embedding
- Rank those 1000 candidates using the mixture model
Putting PyTorch in Production
I wanted the model to run outside of a strict file structure and on the CPU (more so for economic reasons), so I serialized the state dictionary of the model instead of the whole thing.
I just needed to make sure to call `model.eval()` to get out of training mode and into evaluation mode. This is important, as it takes care of things such as ignoring dropout for inference.
I am very excited about the recent announcement of PyTorch 1.0, which has a very large emphasis on production environments. However, until its official release, I’m keeping production a little more simple at the cost of some efficiency. So, I kept everything in Python and ran it off of Django using the Django Rest Framework to handle API response.
Now when a user sends a request for recommendations, we get their last 90 days of interactions through the GitHub API, map them to our normalized id’s and run them through our Neural Network. The response is a ranked list of what the model thinks you may want to interact with next. As long as there's at least one interaction with a known repo, it can give recommendations. However, if no public information is available it defaults to GitHub’s own discovery page.
I’ve mentioned this before, but sequence-based models can be relevant up to your last interaction without having to retrain the whole model. Which means that you could go star/commit/open issues (don’t tell the OSS people I said that) and see how your recommendations change in real time.
- Download 0.6+ billion events from GH Archive.
- Use Dask to process all events on a single machine.
- Build sequence neural network model to predict user interactions.
- Use embeddings from the model and approximate nearest neighbors to generate candidates.
- Serve ranked list of repos and help you find your new favorite one!
Of course, (BTW thanks for reading this far!) if you have anything that you would like me to try/write about, make sure to comment!