How Do You Synchronize Audio and Video in Real-Time Streams?

“This is a very tough problem.”

That’s from the top answer on Stack Overflow for this question. Granted, the answer is over 15 years old, but the sentiment is still true. This is a very tough problem.

The problem stems from the fact that audio and video travel through completely separate pipelines in a real-time streaming system. They're captured by different hardware, encoded with different codecs, packetized into different RTP streams, and reassembled independently on the receiving end. That they arrive in sync ever is entirely down to some deliberate engineering choices at every layer of the stack.

Why Do Audio and Video Fall Out of Sync in the First Place?

Three categories of problems cause desynchronization.

1. Capture-Side Clock Differences

Cameras and microphones run on independent hardware clocks. Even on the same device, the audio sample clock and the video frame clock are not derived from the same oscillator.

In WebRTC, audio is typically sampled at 48kHz (48,000 samples per second), while video Real-time Transport Protocol (RTP) packets use a 90kHz clock rate (a legacy convention from MPEG systems, not the actual capture frame rate). These clock rates aren't the issue on their own. The problem is that the physical oscillators driving them are imperfect.

Consumer-grade oscillators (literal quartz on older or desktop hardware, MEMS on most modern phones and laptops) have a frequency accuracy of roughly 20 to 100 parts per million (ppm). At 50ppm, one clock runs 50 microseconds fast (or slow) for every second of real time.

Over a 10-minute call:

drift = 50 × 10⁻⁶ × 600 seconds = 30ms

That's 30ms of drift from a single clock. If the audio oscillator drifts +20ppm and the video oscillator drifts -30ppm in opposite directions, the relative drift between the two is 50ppm, and you accumulate 30ms of AV offset in 10 minutes without any network involvement at all. Over an hour, that becomes 180ms, well past the threshold where users notice.

This is why RTCP Sender Reports exist: they periodically re-anchor both streams to a shared NTP wall-clock reference, preventing drift from accumulating indefinitely (more on this in the next section).

2. Asymmetric Encoding Pipelines

Audio and video codecs operate on fundamentally different timescales, and the gap between them creates a timing asymmetry at the sender before packets even reach the network.

Opus (the standard WebRTC audio codec) encodes 20ms frames. Each frame compresses a fixed-size chunk of PCM samples, and the computational cost is predictable. A single Opus frame encodes in well under 1ms on modern hardware, and the output size is relatively stable (roughly 80 to 160 bytes per frame at typical bitrates). The result is a steady, metronomic stream of small packets leaving the encoder every 20ms.

Video encoding, on the other hand, is positively erratic. Consider a VP8 encoder producing 720p at 30fps. Each frame has a 33ms budget, but the actual encoding time and output size depend on what's in the frame. A static talking head compresses down to a few kilobytes per frame. A scene cut or sudden motion spike forces the encoder to spend more bits, producing frames that are 5 to 10x larger and take longer to encode.

Video codecs use two main frame types:

Keyframes (I-frames) are large and expensive because they encode the full image independently, but are necessary for decoder recovery and periodic random access.
P-frames, or predicted frames, are small and fast to encode because they only encode the differences from the previous frame.

Here's a concrete example of the problem. Imagine audio and video are captured simultaneously at T=0:

Event	Audio	Video
Capture	T = 0ms	T = 0ms
Encode duration	~0.5ms	~8ms (P-frame)
Packet ready	T = 0.5ms	T = 8ms
Sender-side offset		7.5ms

That 7.5ms gap is the baseline for a normal P-frame. Now, suppose a scene change triggers a keyframe:

Event	Audio	Video (keyframe)
Capture	T = 0ms	T = 0ms
Encode duration	~0.5ms	~22ms
Packet ready	T = 0.5ms	T = 22ms
Sender-side offset		21.5ms

The offset nearly tripled, and, again, we haven’t even touched the network. The variable-encoding latency results in a jittery departure pattern for the video stream, while the audio stream departs at a constant cadence.

3. Network Jitter

Once packets hit the network, audio and video compete for bandwidth on unequal terms.

Audio packets are small (80 to 160 bytes per Opus frame) and arrive at a steady 50 packets per second.
Video packets are far larger and burstier. A single encoded frame often exceeds the Maximum Transmission Unit (MTU), typically ~1200 bytes for WebRTC, so it is fragmented across multiple RTP packets. A 720p P-frame might split into 3-5 packets. A keyframe can be 10 to 50x larger and fragment into 30 to 100+ packets that arrive as a burst.

These bursts temporarily fill router queues, causing two problems: packets within the burst experience different delays, and the burst can push concurrent audio packets into later queue positions. The net effect is that video experiences more variable delay than audio, and that variability spikes at every keyframe interval.

All three sources compound. Clock drift creates a slow, steady pull. Encoding asymmetry adds variable, per-frame offset at the sender. Network jitter scrambles arrival times at the receiver. Without active synchronization, a stream that starts perfectly aligned will drift out of sync within minutes.

How Does WebRTC Handle AV Sync Internally?

WebRTC synchronizes audio and video using a combination of RTP timestamps and RTCP Sender Reports.

Each RTP packet carries a timestamp derived from the media clock (90kHz for video, 48kHz for typical Opus audio). These timestamps tell the receiver the relative ordering of packets within a single stream, but they don't directly relate audio timestamps to video timestamps because the two streams use independent clocks.

Building your own app? Get early access to our Livestream or Video Calling API and launch in days!

RTCP Sender Reports bridge this gap. Periodically, the sender transmits an SR for each media stream that contains two pieces of information:

An NTP wall-clock timestamp (absolute time)
The corresponding RTP timestamp at that moment

The receiver uses these pairs to build a mapping: for any given RTP timestamp in either stream, it can compute the equivalent NTP wall-clock time. Once both audio and video packets can be mapped to the same time reference, the receiver knows which audio sample corresponds to which video frame.

The relevant code path in a WebRTC implementation looks something like this conceptually:

javascript

1
2
3
4
audio_ntp = audio_rtp_to_ntp(audio_packet.rtp_timestamp)
video_ntp = video_rtp_to_ntp(video_packet.rtp_timestamp)
offset = video_ntp - audio_ntp
// Adjust playout timing to minimize offset

In practice, this is handled entirely within the browser's media pipeline. The application code doesn't need to parse RTCP SRs directly, but understanding the mechanism is important for diagnosing sync issues.

What Are the Common Sync Strategies?

Three approaches exist, but most WebRTC implementations use a variant of the third:

Strategy	Reference Clock	Tradeoff
Audio Master	Audio playout	Video frames are dropped or delayed to match audio timing. Prioritizes audio continuity.
Video Master	Video playout	Audio samples are stretched, compressed, or dropped to match video timing. Rarely used.
Adaptive	NTP-derived shared clock	Both streams adjust toward a common reference. Default WebRTC behavior.

Audio Master is the most common strategy in media players and older streaming systems. The reasoning is physiological: humans are more sensitive to audio timing irregularities (clicks, gaps, pitch shifts) than to visual stutter. Dropping or repeating a video frame is less noticeable than a 10ms gap in audio.

Adaptive is what WebRTC actually does. Both streams are aligned to the NTP wall-clock reference derived from RTCP Sender Reports. The jitter buffer on each stream absorbs arrival-time variation independently, and the playout scheduler ensures that audio sample N and video frame N both render at their correct NTP-derived wall-clock time.

What Is a Jitter Buffer and Why Does It Matter for Sync?

Network jitter means packets don't arrive at perfectly regular intervals. A packet expected every 20ms might arrive at 18ms, then 25ms, then 15ms. Without buffering, this variation results in playback glitches.

The jitter buffer sits between the network receive path and the decoder/renderer. It holds packets temporarily to smooth out arrival time variations. When a packet arrives early, it waits in the buffer. When a packet arrives slightly late, it's still available because the buffer hasn't consumed it yet.

Both audio and video have independent jitter buffers, which is where the sync implications arise. If the audio jitter buffer adds 40ms of delay and the video jitter buffer adds 120ms, there's an 80ms offset at the playout stage, even if the packets were perfectly synchronized when they left the sender.

Network condition	Audio JB depth	Video JB depth	Playout offset
Stable (low jitter)	20ms	30ms	10ms
Moderate congestion	40ms	80ms	40ms
Bursty video loss + retransmission	40ms	120ms	80ms
Asymmetric route change	25ms	150ms	125ms

The offset is the gap between the two buffers, not their absolute size. A network that adds equal jitter to both streams won't cause sync problems, even if both buffers are large. The problem is asymmetry.

WebRTC's jitter buffer is adaptive. It grows when jitter increases and shrinks when the network stabilizes. You can observe this in real time through getStats():

javascript

1
2
3
4
5
6
7
8
9
10
11
12
const report = await peerConnection.getStats();
report.forEach(stat => {
  if (stat.type === 'inbound-rtp' && stat.kind === 'audio') {
    // Differential computation for average JB delay
    const delayDelta = stat.jitterBufferDelay - previousDelay;
    const countDelta = stat.jitterBufferEmittedCount - previousCount;
    if (countDelta > 0) {
      const avgJBDelay = (delayDelta / countDelta) * 1000; // ms
      console.log(`Audio JB delay: ${avgJBDelay.toFixed(1)}ms`);
    }
  }
});

Note the differential computation. jitterBufferDelay is a cumulative value (total seconds of delay added across all emitted frames), not an instantaneous measurement. You need to subtract the previous reading and divide by the difference in jitterBufferEmittedCount to get the current average delay per frame.

What `getStats()` fields should you monitor?

Everything described in this article, jitter buffering, packet loss, frame drops, playout timing, happens inside the browser's media pipeline where you have no direct access. getStats() is your only window into it.

For sync diagnosis, these are the fields that matter:

Field	What It Tells You
`jitter`	Inter-packet arrival variation (seconds, multiply by 1000 for ms). Compare audio vs. video: a large gap means asymmetric network conditions.
`jitterBufferDelay` / `jitterBufferEmittedCount`	Cumulative jitter buffer delay. Use differential computation (described above) to get per-frame averages. The difference between audio and video JB delay is your best proxy for playout offset.
`packetsLost`	Audio loss causes clicks and gaps. Video loss triggers keyframe requests, which create the bursts described earlier and further disrupt timing.
`framesDropped`	Frames the renderer dropped because they arrived too late. A rising count means the video pipeline is consistently behind.

When diagnosing a sync issue, start by comparing jitter and JB delay across both streams. If framesDropped is climbing, the video pipeline is under pressure and sync will degrade regardless of network conditions.

How Do SFUs Affect AV Sync?

Most production WebRTC deployments route media through a Selective Forwarding Unit (SFU) rather than using direct peer-to-peer connections. SFUs don't transcode, but they terminate and regenerate RTCP sessions, which has two consequences for sync.

First, the receiver never sees the original sender's RTCP Sender Reports. The SFU generates its own SRs with its own NTP timestamps, so the NTP-to-RTP mapping the receiver builds is based on the SFU's clock and processing behavior, not the sender's. Any asymmetry in how the SFU handles audio vs. video internally gets baked into that mapping.

Second, simulcast layer switches cause timestamp discontinuities. Each simulcast layer is a separate RTP stream with its own timestamp sequence. When the SFU switches a receiver from high to low resolution (or back), the receiver sees a jump in video RTP timestamps and must re-converge sync using the next Sender Report. That re-convergence window can last several seconds, during which AV offset may drift. If you're seeing offset spikes that correlate with resolution changes, this is likely the cause.

What Keeps It All in Sync (And What Breaks It)

AV sync in real-time streams depends on correct behavior at every stage: oscillator clocks, encoding pipelines, network transport, jitter buffering, and playout scheduling. WebRTC handles most of this internally through RTCP Sender Reports and adaptive jitter buffers, but that machinery becomes harder to reason about when SFUs, simulcast switching, and long-running sessions enter the picture.

The key takeaways:

Clock drift, encoding asymmetry, and network jitter all compound to pull audio and video apart.
RTCP Sender Reports map RTP timestamps to NTP wall-clock time, giving the receiver a shared reference for both streams.
Jitter buffer asymmetry between audio and video causes sync problems, not absolute buffer size.
Human perception tolerates roughly ±40ms of offset. Audio leading video is less tolerable than audio lagging.
getStats() is your only observability into the browser's media pipeline. Compare fields across both streams.
SFU deployments introduce SR rewriting and simulcast layer switches that don't exist in peer-to-peer connections.