Building Your Own Instagram Discovery Engine: A Step-By-Step Tutorial

Isn’t it great how Instagram’s “Explore” section displays content that matches your interests? When you open the application, the content and recommendations shown are almost always relevant to your specific likes, interests, connections, etc. While it may be fun to think we’re the center of the Instagram universe, the reality is that personalized, relevant content is also uniquely curated for 400 million other people daily.  With 400M active users and 80M photos posted daily, how does Instagram decide what to put on your explore section? Let’s explore the key factors Instagram uses to determine scores for posts in your Instagram timeline and explore section.

Before we get into the nitty-gritty, here are some features Instagram uses to determine what content to serve up:

  • Timing: the more recent the post, the higher the score.
  • Engagement: could be determined by the number of likes, comments and/or views. If a user engages with certain tags more often, such as snowboarding, that user will be shown more images of snowboarding.
  • Previous Interactions: how often you have interacted with this user in the past. The more you engage with certain users indicates how relevant their posts are to you.
  • Affinity: how you are related to this person. A friend of a friend, a friend that you haven’t connected with yet, or someone you don’t know?

Now, let’s use these features to build our own Instagram discovery engine.  In order to query data from Instagram I am going to use the very cool, yet unofficial, Instagram API written by Pasha Lev. For Mac users, the following should get you up and running. All other libraries are pip installable, and all Python code was run within a Jupyter notebook.  

To get up and running, run the following in your terminal:

https://gist.github.com/nparsons08/4d4c9f568eb6cb0ff541a6b6499da300

Then run jupyter notebook in your terminal, which will open in your default browser. I would also recommend verifying your Instagram phone number before continuing. This will prevent some unexpected redirects.

Now on to the good stuff. Let’s start with finding my social network and a bit of graph analysis.

https://gist.github.com/nparsons08/8d78b4e365401d0e453f2c42a345d20d

If all goes well you should get a ‘Login success!’ response.

We can now build a true social network by finding everyone I follow as well as everyone they follow. For a quick intro on social network analysis and personalized pagerank, take a look at this blog post.

Before stepping into the code, let’s take a look at my own profile to see what we’re trying to analyze.

As you can see, I follow 42 people, who are considered my immediate network, which isn’t too many. If we start to look at 2nd degree connections that number quickly grows. In my case, if we look at 2nd degree connections the number of nodes reaches over 24,000. A nice visualization of this can be seen in step 2.

https://gist.github.com/nparsons08/dd13ac69e9ac33f9e54595875ac48b6f

Cool, now let's get that into a nicely formated Pandas Dataframe.

https://gist.github.com/nparsons08/0d055e185697c1696a6bc39dde468700

While it’s not essential to visualize your network in order to build your own discovery engine, it is pretty interesting and may help with understanding personal pageranks. I’m going to use one of my new favorite graph visualizations library, Graphistry (check them out sometime). However, if you don’t want to wait around for an API key (though I got a same day response), there are lots of other good libraries such as Lightning and NetworkX.

https://gist.github.com/nparsons08/124af23e2bfe9f58f87dc78519ec00f3

For this example, I’m going to display to src_id, and dst_id to give my friends a bit of privacy, though it is pretty fun to display usernames (which is what the below code will do). The first graph only displays edges that are sourced from me and filtered using the built in tools in Graphistry.

The second graph shows all of my extended network.

Isn’t that cool? You can already see a couple interesting features such as the few external centroids and how they interact with the rest of my social network.

It’s now time to grab the most recent images from everyone and rate them by how relevant they are to me. Since there about 24,000 nodes, it may take a while to download all the data.

Let’s do a quick trial run of only the 44 people I immediately follow to make sure we’re on the right track.

Based on what I thought might determine the relative score of Instagram posts, we need to grab the # of likes, # of comments and the time the photo was taken for all recent photos of people I follow (in this example I considered recent equivalent to one week and cut off photos older then that). It would also be useful to grab how many times I’ve ‘liked’ that user's posts and how connected that person is to me. Everything besides “how connected” that user is to me is a simple sum. To calculate the “connected” piece, we’ll use a personalized pagerank. Once we’ve compiled that information, we can define an importance metric like:

Alright, now that we have that defined, let’s see how it works! I apologize for the big chunk of code coming up, but don’t you worry...there is a picture of my new puppy at the end!

https://gist.github.com/nparsons08/12de0cb53d0a2e859c8fa70bc48b7e5b

Which gives me:

This actually looks very similar to my personal timeline - cool! Now that we know we're onto something, let's tackle the discovery section.

We can take the same approach as before by calculating the relative score of each photo of friends of friends. To do this, we’d start with the first social graph that we calculated...but that has over 24K nodes and I’m too lazy to wait for all the data. Instead, let's grab photos of friends of friends whose posts I’ve ‘liked’. This drops the number of nodes down to just over 1,500 which, depending on your internet speed, is the perfect amount of time for a coffee break.

There are a couple minor tweaks to the above code that are needed to deal with the extended user base, but most of the code is the same.

https://gist.github.com/nparsons08/cfebc5a762db42f4f67ba65de1e251a7

The results ended up showing a lot of images from National Geographic and Red Bull, which I currently don’t follow, but might starting now!

Interests haven’t yet been taken into account just yet. A nice aspect of Instagram is its rich set of #hashtags used to describe photos. Let’s see if we can discover my interests by using the hashtags of photos I’ve ‘liked’, and photos that I’ve been tagged in. While Instagram most likely uses click data alongside ‘like’ data, we don’t have access to clicks, so we’re going to stick with likes only.

https://gist.github.com/nparsons08/45560a6a4587333a9a59e1acf78edd61

Which gives:

https://gist.github.com/nparsons08/59c9fe8ce48d91edf5192a2015793993

Now let's grab the most popular images for each of those tags:

https://gist.github.com/nparsons08/34c5b33a231fbd5f21473781670a670c

Now that we have the most popular image from each hashtag feed, we can display them.

https://gist.github.com/nparsons08/5d7f0058429d30932039f44d5876db25

Now let’s combine these two techniques.

You may have noticed I was saving all the collected image data to top_graph_img and images_top_tags. Let’s combine them using a fairly naive technique, random sampling:

https://gist.github.com/nparsons08/c02c138fe66ea116f41197e719f198f2

That’s not too shabby! I personally find some of those photos pretty cool, but it definitely could be better.

Ways to improve the discovery engine:

  • With access to the entire social graph, we could run a similar analysis with weights between nodes, determined by the number of likes and comments.
  • Combine click data alongside ‘like’ data to take advantage of implicit feedback and engagement metrics. This can be extremely useful for downgrading more clickbait style posts that don’t have very many likes, and showing interests of users who don’t tend to ‘like’ very often.
  • Calculate image features using Convolutional Neural Nets. Remove the final dense layers, then calculate and display similar images to those the user has liked based on those features. Integrate Facebook’s social network to display images of people you're connected with.
  • Use Matrix Factorization to see if we can recommend content. You could even use image features and hashtags to construct feature vectors for hybrid techniques.
  • Use natural language processing (NLP) and clustering techniques to find similar hashtags (even ones with emoji’s!). This provides normalization of hashtags (bike vs biking) and similarity metrics (nature vs mountain)

This is by no means is an exhaustive list, so if you have any other ideas please let me know!

For more information on discovery engines, check out our personalization page or schedule a demo to learn more about Stream’s personalized feeds.

Happy building!

Engineering

Personalization & AI