Architecture & Benchmark

Stream's video API is designed to scale WebRTC-based video calling to massive audiences while maintaining low latency and high quality. Here we explain the architecture that allows the video API to scale to 1 million+ participants with excellent performance.

Scaling WebRTC

Why WebRTC Has a Reputation for Not Scaling

WebRTC was originally designed for peer-to-peer communication. In a naive implementation, each participant would need to send their video to every other participant, creating an O(n²) scaling problem. For example, with just 10 participants, you'd need 90 direct connections. This approach quickly becomes impractical.

Additionally, WebRTC's real-time nature means you can't rely on buffering to hide network issues, making it challenging to maintain quality at scale.

How We Scale to 1M Participants

We've overcome these limitations through a combination of architectural decisions and optimizations:

SFU + SFU Cascading

Instead of peer-to-peer connections, we use Selective Forwarding Units (SFUs). An SFU receives media from each participant and selectively forwards it to others, reducing the connection complexity from O(n²) to O(n).

For very large calls, we cascade multiple SFUs together. This allows us to distribute participants across multiple servers while maintaining real-time communication between them. The cascading layer handles:

  • Forwarding video and audio streams between SFUs
  • Synchronizing call state across all instances
  • Optimizing routing to minimize latency

Automatic Subscription Management

Our SDKs automatically handle subscribing to the right video streams. If a participant isn't visible on screen, we don't download their video. This dramatically reduces bandwidth usage in large calls where you might only see a grid of 9-25 participants at a time.

Go for Performance

Like our chat and feeds infrastructure, our video backend is written in Go. Go's excellent concurrency primitives and low memory footprint make it ideal for handling thousands of simultaneous WebRTC connections per server.

Auto-Scaling & Performance

Thundering Herd Prevention

When a large event starts, thousands of users may join simultaneously. We've built protections against thundering herd problems that could overwhelm the system during these spikes.

Hotspot Prevention

Similar to our activity feeds architecture, we prevent database hotspots when updating timestamps and other frequently-changing data. This ensures that high-traffic calls don't create performance bottlenecks.

Redis Client Side Caching

We use Redis client side caching for optimal performance. For high traffic scenarios, this approach saves roundtrips to Redis and helps maintain low latency even under heavy load.

SFU Cascading Deep Dive

When a call grows beyond what a single SFU can handle, or when participants are geographically distributed, we cascade multiple SFUs together.

Video & State Forwarding

The cascading layer efficiently forwards:

  • Video streams: Only the streams that are needed on each SFU are forwarded
  • Audio streams: Mixed or selectively forwarded based on who's speaking
  • Call state: Participant lists, reactions, and other state are synchronized across all SFUs in real-time

Redundancy & Reliability

Infrastructure as Code

All infrastructure is defined in code, ensuring consistent deployments and easy disaster recovery. This approach allows us to:

  • Quickly spin up new capacity when needed
  • Maintain identical configurations across environments
  • Audit and version control all infrastructure changes

Multi-Datacenter & Multi-Provider

We run across multiple datacenters and hosting providers. This provides:

  • Geographic redundancy for disaster recovery
  • Lower latency by routing users to nearby servers
  • Protection against provider-specific outages

Ensuring High Quality Audio/Video

Audio Optimization

  • DTX (Discontinuous Transmission): Reduces bandwidth by not transmitting during silence
  • Opus RED (Redundant Encoding): Adds redundancy to audio packets, making audio more resilient to packet loss

Simulcast + Automatic Codec & Resolution Selection

When it's needed, participants upload high, medium and low quality. The system automatically selects the optimal codec and resolution based on:

  • Network conditions
  • Device capabilities
  • Number of participants
  • Available bandwidth

TURN on SFU

We run TURN directly on our SFU servers, including on port 443/TCP. This ensures connectivity even in restrictive network environments where UDP might be blocked.

UI Best Practices for Quality

Our SDKs include UI components that follow best practices for video quality:

  • Bad network indicator: Shows users when their connection quality is poor
  • Speaking while muted detection: Alerts users when they're trying to speak with their microphone muted
  • Mic input volume indicator: Visual feedback showing microphone input levels while talking
  • Speaker test button: Allows users to test their speaker selection before joining a call

Benchmark Results