We’ve all been there. Our incredibly witty, insightful post being shared worldwide by millions, or our painstakingly created meme being fought over by competing celebrity accounts.
What, wait, you haven't found online celebrity success? OK, so you might not have caused the “fail whale” on X back in the day, but we’ve all seen it happen. Every feed service faces the same nightmare scenario: a celebrity posts, a news story breaks, or a game-winning play happens, and millions of users hit refresh at once. When that happens, how do systems stop their feeds from falling over?
The engineering required to keep feeds responsive under that kind of load is substantial, spanning fan-out strategies, caching architectures, queue designs, and graceful degradation.
Here's how it actually works.
What Happens When Someone With Millions of Followers Posts?
Posts like this cause every feed engineer a headache. When MrBeast publishes a tweet, the platform needs to place it into the feeds of his 300+ million followers. That process, called fan-out, is the central scaling challenge of any activity feed service: one action by one user creates hundreds of millions of units of work.
The system has to decide when to do that work. Up front, the moment the post is created? Or later, when each follower actually opens their feed? That choice defines the two fundamental fan-out strategies, and each comes with a sharp tradeoff:
| Fan-out on write (push) | Fan-out on read (pull) | |
|---|---|---|
| How it works | Pre-compute every follower's feed at post time | Build each user's feed on demand at read time |
| Read speed | Very fast (~5ms), just a cache lookup | Slower, requires querying + ranking at request time |
| Write cost | Enormous for high-follower accounts | Minimal, store once |
| Viral spike behavior | Massive write amplification under load | Read load increases, but writes stay flat |
| Ranking flexibility | Hard to re-rank after materialization | Easy to A/B test new ranking models |
When Twitter was initially scaling, fan-out on write was the default. A single tweet from a user with 31 million followers triggered roughly 93 million Redis writes when replicated, and could take up to 5 minutes to fully propagate. Scale that to MrBeast's 300 million followers and the problem is an order of magnitude worse.
X's current architecture looks very different built around Thunder, an in-memory store that tracks recent posts from all users and enables sub-millisecond lookups without hitting an external database. When you open your feed, X pulls posts from accounts you follow out of Thunder and ranks them in real time using a Grok-based transformer called Phoenix, a design much closer to fan-out on read than the old push model.
So Which Approach Wins?
Neither, on its own. Fan-out on write is fast to read but devastating to write at celebrity scale. Fan-out on read avoids the write amplification but adds latency to every feed load.
The practical solution is a hybrid:
- Fan-out on write for regular users (fast reads, manageable write cost)
- Fan-out on read for high-follower accounts (avoids the MrBeast problem)
At read time, the system merges precomputed results with on-demand celebrity content. LinkedIn took the read-side further with their FollowFeed system, computing feeds entirely at request time using a two-pass ML ranking architecture, because it made relevance experimentation dramatically simpler.
Getting this split right is an ongoing operational problem, not a one-time architecture decision. It requires tuning follower thresholds, managing merge logic across both paths, and handling edge cases like inactive user eviction that quietly accumulate technical debt.
How Do Feeds Load in Milliseconds When There Are Billions of Posts?
The most important layer of a feed service isn’t the database; it is the cache. In these architectures, caching is the primary serving infrastructure, not an optimization layer.
Meta's TAO system serves over 10 billion requests per second with a historically measured 96.4% cache hit rate. Twitter's old timeline cache consumed roughly 40TB of Redis heap at 30 million QPS across 6,000+ instances in a single datacenter. At this scale, the database is the persistence layer of last resort, not the read path.
The architecture typically looks like this:
User request
→ L1: Application-level hot-key cache (local memory)
→ L2: Distributed cache (Redis / Memcached cluster)
→ L3: Database (only on cache miss, ~3-5% of requests)
Standard caching isn't enough during a traffic surge, though. Two specific failure modes emerge:
- The thundering herd/cache stampede. A popular cache entry expires, and thousands of simultaneous requests notice the missing key and all rush to the database to regenerate it. Meta solved this with a lease mechanism in which Memcache issues only one fill token per key every 10 seconds. Other requestors get a slightly stale value or a "retry shortly" signal. A hot key that would generate thousands of database queries without leases generates exactly one with leases.
- The identical-request flood. 50,000 users load the feed at the exact same second during a major event, and the cache entry just expired. Request collapsing (sometimes called "singleflight") lets the first request hit the database while the other 49,999 wait. Once it returns, all 50,000 are served from a single result. Probabilistic early expiration (the X-Fetch algorithm) complements this by re-fetching cache entries in the background before they expire, so the stampede never triggers in the first place.
Building and operating a multi-tier caching layer with hot-key detection, lease mechanisms, request collapsing, and cross-region consistency (Meta achieves ten nines of cache consistency using a dedicated monitoring system called Polaris) is one of the most operationally intensive parts of running a feed system at scale.
How Does the System Absorb a Traffic Spike Without Dropping Posts?
Message queues decouple the synchronous write path (accept the post, return success to the user) from the asynchronous work (fan out to followers, update search indexes, send notifications).
In Twitter's original architecture, a tweet was validated and persisted within 145 milliseconds, then placed in a queue, and the client connection was released. The async pathway then forked to independent pipelines for home timeline fan-out, search indexing, push notifications (iOS, Android, SMS), and interest-based email. This separation means a spike in posts doesn't directly spike feed reads, and vice versa. Each pipeline scales independently.
Kafka is the dominant choice here because of its retention-based model. With this model, messages persist for days regardless of consumption rate, so queues can't overflow during surges. They accumulate consumer lag instead. The operational playbook during a viral event is straightforward:
- Watch consumer lag. This is the primary health metric during a spike. If lag is growing, consumers are falling behind. If it's stable or shrinking, the system is keeping up.
- Scale consumers horizontally up to the partition count. Each partition can only be read by one consumer in a group, so the partition count is your ceiling. Best practice is to over-partition topics at 1.5–2× expected peak so you have headroom to add consumers without repartitioning.
- Let them drain. Unlike traditional message queues, Kafka doesn't lose messages when consumers fall behind. Once capacity catches up, the backlog clears itself.
LinkedIn processes over 7 trillion messages per day through Kafka using this pattern. Activities are partitioned to guarantee per-user ordering, so each user's feed stays consistent even as the system absorbs uneven load across partitions.
What Stops a Viral Moment From Crashing Everything?
Even with caching and queues, traffic can exceed capacity. The most important principle here: if you don't decide what to turn off during a spike, the system will decide for you, usually by turning off everything.
Surviving means defining your "degraded mode" in advance and layering two defense mechanisms:
First, traffic tiering and load shedding. Not all requests are equal. At the gateway layer, traffic gets classified into priority tiers:
| Tier | Feed system examples | During a spike... |
|---|---|---|
| Tier 1: Critical | Serving cached feeds, loading core timeline | Must succeed. Never shed. |
| Tier 2: Degradable | Computing fresh personalized feeds, search | Serve from stale caches if needed. |
| Tier 3: Non-essential | Recommendations, engagement counts, analytics | Can fail silently. Shed first. |
The key is making this automatic. Adaptive concurrency limits monitor downstream latency, and as soon as response times cross a threshold (e.g., 50ms), the system stops calling Tier 3 services entirely. Users see a slightly less personalized feed, but the core experience works.
Second, dependency isolation. In a feed service, the timeline service, user profile service, engagement counts, and recommendation engine often share infrastructure. Without isolation, a minor dependency can take down the core path.
Two patterns to help this:
- Bulkheads gives each dependency its own isolated resource pool, with concurrency limits that reject excess traffic instantly. If the recommendation engine gets overwhelmed, it can't consume the connections needed to serve cached timelines.
- Circuit breakers complement this: when a dependency's failure rate crosses a threshold, the system stops calling it entirely and falls back to cached or stale data instead of waiting for responses that won't come.
Feature flags tie both layers together. An on-call engineer can disable personalization, engagement counts, or real-time updates independently without deploying code, and each toggle defaults to the safe state if flag evaluation itself fails.
Can You Just Auto-Scale Your Way Through a Spike?
No. This is one of the most counterintuitive lessons from operating feeds at scale.
Kubernetes's Horizontal Pod Autoscaler checks metrics every 15 seconds. At scale, this auto-scaling is too reactive. By the time your cloud provider spins up new instances, your database connection pool is already exhausted.
What actually works for predictable events, such as the Super Bowl, election night, or a product launch, is:
- Pre-scale compute and cache capacity days or weeks in advance.
- Run game-day exercises simulating 150%+ of projected peak with injected failures. Test specific disaster scenarios: What happens if the primary Redis cluster vanishes? What if 5 million users log in within 60 seconds?
- Tier services by importance so non-essential features shed load automatically when core feed serving needs the capacity.
- Reclaim capacity after the event. Pinterest's platform reclaims 80% of capacity during non-peak hours, achieving 30% fewer instance-hours per day vs. static clusters.
For unpredictable viral events, the system relies on the caching, queueing, and degradation layers described above. These are what buy time for auto-scaling to catch up, or for the spike to pass. If you don't know the exact request rate that breaks your system, you aren't ready for the event.
This combination of hybrid fan-out, multi-tier caching with request collapsing, queue-based decoupling, progressive degradation with bulkhead isolation, and capacity planning is what keeps feeds responsive when traffic surges. Each layer is a substantial engineering investment on its own, and the real complexity is in making them work together across the edge cases that only surface under real production load.
Stream's Activity Feeds infrastructure handles these patterns as a managed service. The underlying architecture uses a Materialized Feed — a hybrid fan-out approach that prioritizes recently active feeds and skips inactive ones — with built-in cache stampede protection via singleflight, and a 99.999% uptime SLA. The hybrid fan-out logic, ranked feed delivery, and scaling infrastructure are all managed, so your engineering team can focus on the product experience rather than the distributed systems plumbing.