Using DynamoDB for Activity Feeds
DynamoDB is a fully managed, distributed key-value storage whose capacity can be added/reduced on demand. Many of our customers have switched from DynamoDB-based feeds to Stream. In this post we’ll describe a few common challenges we see user encounter when building feeds on DynamoDB.
News Feeds on Dynamo using full fanout-on-write
DynamoDB is a NoSQL storage with scalable write capacity making it a good fit for a fanout-on-write based feed. For those not familiar with the terminology, news feed infrastructure is usually divided into three groups: full pull, full push, and push/pull.
Full push news feeds infrastructure is very similar to materialized views or indexes in SQL databases. In a full push news feed, every feed is pre-computed and stored on memory or disk. For this reason, reads are extremely fast and writes are very heavy. Every time an activity is added to a feed, it’s added directly and then propagated to all followers (fanout operation). Databases like Redis, Cassandra, and DynamoDB are often used as primary storage for feeds because they can handle intensive write traffic very efficiently.
Challenges with DynamoDB
API clients for DynamoDB are not as user-friendly as you would expect, especially in comparison to a SQL database. There is little abstraction available and understanding how DynamoDB works is essential for building a working integration.
While DynamoDB server infrastructure is managed by AWS, users still have to design and setup tables, access permissions, monitor, and scale up/down automations. This is not a trivial or light task and initial design will have a huge impact later. It is important to pick the right partitioning and clustering logic from the beginning. In addition, because DynamoDB challenges only arise at scale, it’s also important to validate that the initial setup will work with real world traffic.
DynamoDB capacity can be modified to meet the user’s current traffic requirements. Both read and write capacity can be changed to best match the traffic on the feed infrastructure. It’s important to note that DynamoDB charges based on the provisioned capacity and not on the capacity in use. Changes to provisioned capacity are also not immediate and can take several minutes to take effect.
In real life, your feed fanout traffic is not constant over time. Social networks are great examples. Most are super connected networks where a tiny fraction of the user base accounts for the vast majority of content creation. If you translate this into a full-fanout system, it means that every time any of these “super connected feeds” posts something new, a fanout operation with a large amount of targets is spawned. As you can see from this example, it can create big trouble for your feed infrastructure. Single events such as “Justin Bieber posted a new picture” can have a massive cost impact or even require a capacity increase.
In most cases your fanout infrastructure is made up of a pool of workers that each consume publishing tasks from a durable message queue. The async nature of processing these tasks typically introduces a few problems you’ll want to work around:
- Race conditions between adds and removes
- Race conditions between follow and unfollow
- Keep write capacity below a watermark (say 80%) requires coordinating fanout workers
The best way to explain pricing is with an another example. Let’s assume you have a user, topic or editorial feed that has 100,000 followers. Every time an activity is added to that feed, you need to process 100,001 writes.
If your DynamoDB cluster has 25,000 write units, it will take around 5 seconds for the update to be visible to everyone. During this interval, because you are exceeding your provisioned capacity by 4x, most writes will fail and have to be retried later. In case you are wondering, 100,000 write units cost north of $50,000/month and 25,000 units cost around $13,000/month.
You can work around those numbers a little bit by scaling capacity up and down - for instance, maybe you decrease capacity during the night. Note that typically you’ll also need a message broker (like SQS), worker instances (on EC2), and some sort of real-time infrastructure (Socket.IO) to send out the updates.
Even with optimizations in place, you’ll probably spend $4k-$5k a month on DynamoDB. That’s roughly 5x the price of Stream’s PRO plan (and Stream’s feeds typically update within 500ms). The more popular your social network is, the bigger these spikes become.
The pricing gap becomes even larger when you start to look at Stream’s enterprise plans since we specialize in powering the feeds for apps with millions and even hundreds of millions of users.
Concluding: DynamoDB & News Feeds
DynamoDB is a very impressive database and can be cost-effective solution for consistent and predictable traffic.
News feeds are, by nature, highly connected graphs and cause a tremendous spike in writes. This makes DynamoDB a very expensive option for building feeds. When provisioning extra capacity is too slow, your choices are to accept either degradation of service (eg. updates taking minutes or even hours to propagate) or very high costs due to over provisioned capacity.
In addition, with DynamoDB you also have to build and maintain more code to power the feeds. Features such as aggregation, notifications, ranking, real-time, analytics, and personalization will take time to get right.
While we may be slightly biased, the fact is that building feeds on Stream is a much more cost-effective and simplified process, especially in comparison to the other option out there today. Want to see for yourself? Well, you’re in luck. We have a free, quick, 5-minute tutorial that will help you wrap your head around Stream’s API technology.