I wanted to highlight some of the best articles out there about newsfeed and activity stream development. As an author of Stream-Framework and the hosted getstream.io I’ve read most of them and here are my favourites:
Twitter 2010 - by Nick Kallen (@nk)
Starting from slide 24 it explains the troubles with Twitter’s original approach. It also goes over the tradeoffs with memory vs disk storage.
Twitter’s timelines at scale 2013 - by Raffi Krikorian (@raffi)
It’s nice to see how they’ve progressed over the years. The 2013 article lists a pure fanout approach using Redis. At the time they had 150 million active users. It’s interesting to see how they’ve optimized the Redis memory usage. I still wonder though how much they were spending on this approach. It seems quite expensive to operate.
Yahoo Research & Princeton 2010 - by Adam Silberstein et al (@YahooResearch)
One of the issues with a pure fanout approach is that it’s expensive. Furthermore it also gets tricky when you have extremely popular users on your platform. In Twitter’s 2013 approach every tweet by Katy Perry leads to 65 million updates. The Yahoo paper discusses how you can reduce the overall system cost (at the expense of performance) by combining a push and a pull approach. The decision to push or pull is based on the producer/consumer relation. In theory it’s a very interesting approach, I haven’t seen any app use this approach in production though.
Etsy’s feed architecture 2011 - by Dan McKinley (@mcfunley)
Etsy supports a very rich feature set for their newsfeed. They mix in popular content into the experience and they also do aggregation (rollups).
The presentation describes how they try to pre select a subset of follow relations based on affinity. From the presentation it looks as though they build the feed in the background using Gearman and subsequently store it in Memcached.
I have the feeling their approach changed over the years though and would love to see a post explaining the current architecture.
Instagram & Cassandra 2014 - by Rick Branson (@rbranson)
Instagram also uses a pure fanout approach but it’s based on Cassandra instead of Redis. They use their own tombstone type of approach to prevent race conditions when adding and removing activities. I hope Rick will post a more detailed approach about their setup soon.
Also see, messaging at scale and Cassandra at Instagram.
Our own infrastructure at getstream.io is similar to Instagram’s approach and based on Cassandra 2.0. If you think I’ve missed a great article be sure to add it in the comments! In the coming weeks we’ll write a post about when to use Cassandra and when to go for Redis when building your newsfeed.