“This is a very tough problem.”
That’s from the top answer on Stack Overflow for this question. Granted, the answer is over 15 years old, but the sentiment is still true. This is a very tough problem.
The problem stems from the fact that audio and video travel through completely separate pipelines in a real-time streaming system. They're captured by different hardware, encoded with different codecs, packetized into different RTP streams, and reassembled independently on the receiving end. That they arrive in sync ever is entirely down to some deliberate engineering choices at every layer of the stack.
Why Do Audio and Video Fall Out of Sync in the First Place?
Three categories of problems cause desynchronization.
1. Capture-Side Clock Differences
Cameras and microphones run on independent hardware clocks. Even on the same device, the audio sample clock and the video frame clock are not derived from the same oscillator.
In WebRTC, audio is typically sampled at 48kHz (48,000 samples per second), while video Real-time Transport Protocol (RTP) packets use a 90kHz clock rate (a legacy convention from MPEG systems, not the actual capture frame rate). These clock rates aren't the issue on their own. The problem is that the physical oscillators driving them are imperfect.
Consumer-grade oscillators (literal quartz on older or desktop hardware, MEMS on most modern phones and laptops) have a frequency accuracy of roughly 20 to 100 parts per million (ppm). At 50ppm, one clock runs 50 microseconds fast (or slow) for every second of real time.
Over a 10-minute call:
drift = 50 × 10⁻⁶ × 600 seconds = 30ms
That's 30ms of drift from a single clock. If the audio oscillator drifts +20ppm and the video oscillator drifts -30ppm in opposite directions, the relative drift between the two is 50ppm, and you accumulate 30ms of AV offset in 10 minutes without any network involvement at all. Over an hour, that becomes 180ms, well past the threshold where users notice.
This is why RTCP Sender Reports exist: they periodically re-anchor both streams to a shared NTP wall-clock reference, preventing drift from accumulating indefinitely (more on this in the next section).
2. Asymmetric Encoding Pipelines
Audio and video codecs operate on fundamentally different timescales, and the gap between them creates a timing asymmetry at the sender before packets even reach the network.
Opus (the standard WebRTC audio codec) encodes 20ms frames. Each frame compresses a fixed-size chunk of PCM samples, and the computational cost is predictable. A single Opus frame encodes in well under 1ms on modern hardware, and the output size is relatively stable (roughly 80 to 160 bytes per frame at typical bitrates). The result is a steady, metronomic stream of small packets leaving the encoder every 20ms.
Video encoding, on the other hand, is positively erratic. Consider a VP8 encoder producing 720p at 30fps. Each frame has a 33ms budget, but the actual encoding time and output size depend on what's in the frame. A static talking head compresses down to a few kilobytes per frame. A scene cut or sudden motion spike forces the encoder to spend more bits, producing frames that are 5 to 10x larger and take longer to encode.
Video codecs use two main frame types:
- Keyframes (I-frames) are large and expensive because they encode the full image independently, but are necessary for decoder recovery and periodic random access.
- P-frames, or predicted frames, are small and fast to encode because they only encode the differences from the previous frame.
Here's a concrete example of the problem. Imagine audio and video are captured simultaneously at T=0:
| Event | Audio | Video |
|---|---|---|
| Capture | T = 0ms | T = 0ms |
| Encode duration | ~0.5ms | ~8ms (P-frame) |
| Packet ready | T = 0.5ms | T = 8ms |
| Sender-side offset | 7.5ms |
That 7.5ms gap is the baseline for a normal P-frame. Now, suppose a scene change triggers a keyframe:
| Event | Audio | Video (keyframe) |
|---|---|---|
| Capture | T = 0ms | T = 0ms |
| Encode duration | ~0.5ms | ~22ms |
| Packet ready | T = 0.5ms | T = 22ms |
| Sender-side offset | 21.5ms |
The offset nearly tripled, and, again, we haven’t even touched the network. The variable-encoding latency results in a jittery departure pattern for the video stream, while the audio stream departs at a constant cadence.
3. Network Jitter
Once packets hit the network, audio and video compete for bandwidth on unequal terms.
- Audio packets are small (80 to 160 bytes per Opus frame) and arrive at a steady 50 packets per second.
- Video packets are far larger and burstier. A single encoded frame often exceeds the Maximum Transmission Unit (MTU), typically ~1200 bytes for WebRTC, so it is fragmented across multiple RTP packets. A 720p P-frame might split into 3-5 packets. A keyframe can be 10 to 50x larger and fragment into 30 to 100+ packets that arrive as a burst.
These bursts temporarily fill router queues, causing two problems: packets within the burst experience different delays, and the burst can push concurrent audio packets into later queue positions. The net effect is that video experiences more variable delay than audio, and that variability spikes at every keyframe interval.
All three sources compound. Clock drift creates a slow, steady pull. Encoding asymmetry adds variable, per-frame offset at the sender. Network jitter scrambles arrival times at the receiver. Without active synchronization, a stream that starts perfectly aligned will drift out of sync within minutes.
How Does WebRTC Handle AV Sync Internally?
WebRTC synchronizes audio and video using a combination of RTP timestamps and RTCP Sender Reports.
Each RTP packet carries a timestamp derived from the media clock (90kHz for video, 48kHz for typical Opus audio). These timestamps tell the receiver the relative ordering of packets within a single stream, but they don't directly relate audio timestamps to video timestamps because the two streams use independent clocks.
RTCP Sender Reports bridge this gap. Periodically, the sender transmits an SR for each media stream that contains two pieces of information:
- An NTP wall-clock timestamp (absolute time)
- The corresponding RTP timestamp at that moment
The receiver uses these pairs to build a mapping: for any given RTP timestamp in either stream, it can compute the equivalent NTP wall-clock time. Once both audio and video packets can be mapped to the same time reference, the receiver knows which audio sample corresponds to which video frame.
The relevant code path in a WebRTC implementation looks something like this conceptually:
1234audio_ntp = audio_rtp_to_ntp(audio_packet.rtp_timestamp) video_ntp = video_rtp_to_ntp(video_packet.rtp_timestamp) offset = video_ntp - audio_ntp // Adjust playout timing to minimize offset
In practice, this is handled entirely within the browser's media pipeline. The application code doesn't need to parse RTCP SRs directly, but understanding the mechanism is important for diagnosing sync issues.
What Are the Common Sync Strategies?
Three approaches exist, but most WebRTC implementations use a variant of the third:
| Strategy | Reference Clock | Tradeoff |
|---|---|---|
| Audio Master | Audio playout | Video frames are dropped or delayed to match audio timing. Prioritizes audio continuity. |
| Video Master | Video playout | Audio samples are stretched, compressed, or dropped to match video timing. Rarely used. |
| Adaptive | NTP-derived shared clock | Both streams adjust toward a common reference. Default WebRTC behavior. |
Audio Master is the most common strategy in media players and older streaming systems. The reasoning is physiological: humans are more sensitive to audio timing irregularities (clicks, gaps, pitch shifts) than to visual stutter. Dropping or repeating a video frame is less noticeable than a 10ms gap in audio.
Adaptive is what WebRTC actually does. Both streams are aligned to the NTP wall-clock reference derived from RTCP Sender Reports. The jitter buffer on each stream absorbs arrival-time variation independently, and the playout scheduler ensures that audio sample N and video frame N both render at their correct NTP-derived wall-clock time.
What Is a Jitter Buffer and Why Does It Matter for Sync?
Network jitter means packets don't arrive at perfectly regular intervals. A packet expected every 20ms might arrive at 18ms, then 25ms, then 15ms. Without buffering, this variation results in playback glitches.
The jitter buffer sits between the network receive path and the decoder/renderer. It holds packets temporarily to smooth out arrival time variations. When a packet arrives early, it waits in the buffer. When a packet arrives slightly late, it's still available because the buffer hasn't consumed it yet.
Both audio and video have independent jitter buffers, which is where the sync implications arise. If the audio jitter buffer adds 40ms of delay and the video jitter buffer adds 120ms, there's an 80ms offset at the playout stage, even if the packets were perfectly synchronized when they left the sender.
| Network condition | Audio JB depth | Video JB depth | Playout offset |
|---|---|---|---|
| Stable (low jitter) | 20ms | 30ms | 10ms |
| Moderate congestion | 40ms | 80ms | 40ms |
| Bursty video loss + retransmission | 40ms | 120ms | 80ms |
| Asymmetric route change | 25ms | 150ms | 125ms |
The offset is the gap between the two buffers, not their absolute size. A network that adds equal jitter to both streams won't cause sync problems, even if both buffers are large. The problem is asymmetry.
WebRTC's jitter buffer is adaptive. It grows when jitter increases and shrinks when the network stabilizes. You can observe this in real time through getStats():
123456789101112const report = await peerConnection.getStats(); report.forEach(stat => { if (stat.type === 'inbound-rtp' && stat.kind === 'audio') { // Differential computation for average JB delay const delayDelta = stat.jitterBufferDelay - previousDelay; const countDelta = stat.jitterBufferEmittedCount - previousCount; if (countDelta > 0) { const avgJBDelay = (delayDelta / countDelta) * 1000; // ms console.log(`Audio JB delay: ${avgJBDelay.toFixed(1)}ms`); } } });
Note the differential computation. jitterBufferDelay is a cumulative value (total seconds of delay added across all emitted frames), not an instantaneous measurement. You need to subtract the previous reading and divide by the difference in jitterBufferEmittedCount to get the current average delay per frame.
What getStats() fields should you monitor?
Everything described in this article, jitter buffering, packet loss, frame drops, playout timing, happens inside the browser's media pipeline where you have no direct access. getStats() is your only window into it.
For sync diagnosis, these are the fields that matter:
| Field | What It Tells You |
|---|---|
jitter | Inter-packet arrival variation (seconds, multiply by 1000 for ms). Compare audio vs. video: a large gap means asymmetric network conditions. |
jitterBufferDelay / jitterBufferEmittedCount | Cumulative jitter buffer delay. Use differential computation (described above) to get per-frame averages. The difference between audio and video JB delay is your best proxy for playout offset. |
packetsLost | Audio loss causes clicks and gaps. Video loss triggers keyframe requests, which create the bursts described earlier and further disrupt timing. |
framesDropped | Frames the renderer dropped because they arrived too late. A rising count means the video pipeline is consistently behind. |
When diagnosing a sync issue, start by comparing jitter and JB delay across both streams. If framesDropped is climbing, the video pipeline is under pressure and sync will degrade regardless of network conditions.
How Do SFUs Affect AV Sync?
Most production WebRTC deployments route media through a Selective Forwarding Unit (SFU) rather than using direct peer-to-peer connections. SFUs don't transcode, but they terminate and regenerate RTCP sessions, which has two consequences for sync.
First, the receiver never sees the original sender's RTCP Sender Reports. The SFU generates its own SRs with its own NTP timestamps, so the NTP-to-RTP mapping the receiver builds is based on the SFU's clock and processing behavior, not the sender's. Any asymmetry in how the SFU handles audio vs. video internally gets baked into that mapping.
Second, simulcast layer switches cause timestamp discontinuities. Each simulcast layer is a separate RTP stream with its own timestamp sequence. When the SFU switches a receiver from high to low resolution (or back), the receiver sees a jump in video RTP timestamps and must re-converge sync using the next Sender Report. That re-convergence window can last several seconds, during which AV offset may drift. If you're seeing offset spikes that correlate with resolution changes, this is likely the cause.
What Keeps It All in Sync (And What Breaks It)
AV sync in real-time streams depends on correct behavior at every stage: oscillator clocks, encoding pipelines, network transport, jitter buffering, and playout scheduling. WebRTC handles most of this internally through RTCP Sender Reports and adaptive jitter buffers, but that machinery becomes harder to reason about when SFUs, simulcast switching, and long-running sessions enter the picture.
The key takeaways:
- Clock drift, encoding asymmetry, and network jitter all compound to pull audio and video apart.
- RTCP Sender Reports map RTP timestamps to NTP wall-clock time, giving the receiver a shared reference for both streams.
- Jitter buffer asymmetry between audio and video causes sync problems, not absolute buffer size.
- Human perception tolerates roughly ±40ms of offset. Audio leading video is less tolerable than audio lagging.
getStats()is your only observability into the browser's media pipeline. Compare fields across both streams.- SFU deployments introduce SR rewriting and simulcast layer switches that don't exist in peer-to-peer connections.