How Does WebRTC Power Bi-Directional Voice and Video in AI Agents?

WebRTC has become the standard transport layer for AI agents requiring real-time voice and video.

Originally designed for browser-to-browser video calls, WebRTC is a protocol stack that enables real-time audio and video communication over UDP. Because it prioritizes low latency over guaranteed delivery, it is ideal for the sub-500ms response times that natural conversation requires.

Here, we explain how WebRTC handles the hard problems of real-time AI streaming of voice and video: adaptive buffering, echo cancellation, encryption, and synchronization between audio and video streams.

Why WebRTC Instead of WebSockets?

The core constraint in voice AI is Total Turn-Around Time (T-TAT): the gap between a user finishing an utterance and the agent responding.

WebSockets use TCP, which guarantees ordered delivery. When a packet is lost, TCP halts all subsequent packets until retransmission completes. This Head-of-Line blocking introduces unpredictable latency. A 200ms wait to recover a lost syllable fragment creates silence that disrupts conversation far more than a brief audio glitch would.

WebRTC uses UDP instead, accepting packet loss in exchange for consistent low latency. But raw UDP is insufficient for intelligible communication, so WebRTC adds:

RTP encapsulates media with sequence numbers and timestamps for reordering and synchronization
Adaptive jitter buffers dynamically resize based on network conditions, expanding during instability and shrinking to minimize latency when stable
Congestion control estimates bandwidth continuously and signals encoders to reduce bitrate before packet loss occurs
Echo cancellation at the browser level subtracts the agent's output from microphone input, preventing feedback loops and false barge-in triggers

How Does an AI Agent Join a WebRTC Session?

WebRTC was designed for browser-to-browser communication. Making an AI agent participate requires it to terminate the connection server-side as a "robot peer."

This creates a challenge: WebRTC sessions are stateful (encryption keys, sequence numbers, buffer state must persist in memory), while ML inference is typically stateless. The solution is a dedicated worker process per session that holds the PeerConnection state, decodes incoming RTP to raw frames, and bridges to stateless inference APIs.

Server-side WebRTC libraries include pion (Go), aiortc (Python/asyncio), and werift (Node.js/TypeScript). These expose the raw media loop, letting you intercept RTP packets and extract PCM audio or video frames for ML processing.

Before media can flow, peers must exchange connection metadata through a signaling server. This happens over WebSockets and involves two types of messages:

SDP (Session Description Protocol) describes each peer's capabilities: supported codecs, media types, and connection parameters. The client creates an "offer," and the agent responds with an "answer."
ICE candidates contain network path information for NAT traversal. Each peer discovers its possible network routes and shares them with the other.

Here's the client-side flow:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Client creates offer and sends to agent via signaling server
async  createOffer(peerId)  {
  const  pc  =  this.createPeerConnection(peerId);

  // Add local media tracks to the connection
  if  (this.localStream)  {
    this.localStream.getTracks().forEach(track  =>  {
      pc.addTrack(track,  this.localStream);
    });
  }

  // Create SDP offer
  const  offer  =  await  pc.createOffer({
    offerToReceiveAudio:  true,
    offerToReceiveVideo:  true
  });
  await  pc.setLocalDescription(offer);

  // Send offer to peer via signaling server
  this.send({
    type:  'offer',
    targetId:  peerId,
    sdp:  pc.localDescription
  });
}

The agent receives the offer and responds:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Agent handles incoming offer (using werift)
async  handleOffer(clientId,  sdp)  {
  const  pc  =  await  this.createPeerConnection(clientId);

  await  pc.setRemoteDescription(sdp);

  const  answer  =  await  pc.createAnswer();
  await  pc.setLocalDescription(answer);

  // Send answer back to client
  this.send({
    type:  'answer',
    targetId:  clientId,
    sdp:  {
      type:  pc.localDescription.type,
      sdp:  pc.localDescription.sdp
    }
  });
}

ICE candidates are exchanged as they're discovered:

javascript

1
2
3
4
5
6
7
8
9
pc.onicecandidate  =  ({  candidate  })  =>  {
  if  (candidate)  {
    this.send({
      type:  'ice_candidate',
      targetId:  peerId,
      candidate:  candidate.toJSON()
    });
  }
};

Production deployments typically use a Selective Forwarding Unit (SFU) as an intermediary rather than direct P2P connections. The user connects to the SFU, and the AI agent joins the same room as another participant. The SFU handles bandwidth estimation, simulcast, and NAT traversal, shielding the agent from last-mile network instability. This also allows independent scaling: an SFU node might handle 500 connections while a GPU-heavy agent node handles 10.

What Does the Audio Pipeline Look Like?

Ingress transforms RTP packets into ML-ready data:

Parse RTP headers and reorder out-of-sequence packets using sequence numbers
Buffer packets in a jitter buffer (AI agents tune aggressively low, ~20-30ms versus 60-100ms for standard VoIP)
Decode Opus to raw PCM (typically 16-bit at 16-48kHz)
Run Voice Activity Detection to gate STT processing and detect end-of-turn

With werift, you subscribe to RTP packets on the incoming track:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Agent receives audio track when client connects
pc.ontrack  =  (event)  =>  {
  if  (event.track.kind  ===  'audio')  {
    this.processAudioTrack(clientId,  event.track);
  }
};

// Subscribe to raw RTP packets
processAudioTrack(clientId,  track)  {
  track.onReceiveRtp.subscribe((rtp)  =>  {
    // rtp.payload contains Opus-encoded audio
    // In production: decode Opus to PCM, run VAD, 
    // buffer until speech ends, then send to STT

    if  (this.aiProcessor)  {
      this.aiProcessor.onAudioPacket(clientId,  rtp.payload);
    }
  });
}

The code above shows how to access raw RTP packets, but you'd need to add Opus decoding, Voice Activity Detection, and buffering logic.

Building your own app? Get early access to our Livestream or Video Calling API and launch in days!

Here's what that looks like:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
const  {  OpusEncoder,  OpusDecoder  }  =  require('@discordjs/opus');

// Opus decoder: 48kHz sample rate, 1 channel (mono)
const  decoder  =  new  OpusDecoder(48000,  1);

// Buffer to accumulate audio between VAD events
let  audioBuffer  =  [];
let  silenceFrames  =  0;
const  SILENCE_THRESHOLD  =  25;  // ~500ms at 20ms frames

processAudioTrack(clientId,  track)  {
  track.onReceiveRtp.subscribe((rtp)  =>  {
    // Decode Opus to PCM (16-bit signed integers)
    const  pcm  =  decoder.decode(rtp.payload);

    // Simple energy-based VAD (production: use Silero VAD)
    const  energy  =  calculateEnergy(pcm);
    const  isSpeech  =  energy  >  0.01;

    if  (isSpeech)  {
      audioBuffer.push(pcm);
      silenceFrames  =  0;
    }  else  {
      silenceFrames++;

      // End of utterance: silence exceeded threshold
      if  (audioBuffer.length  >  0  &&  silenceFrames  >  SILENCE_THRESHOLD)  {
        const  fullUtterance  =  Buffer.concat(audioBuffer);
        this.sendToSTT(clientId,  fullUtterance);
        audioBuffer  =  [];
      }
    }
  });
}

function  calculateEnergy(pcm)  {
  let  sum  =  0;
  for  (let  i  =  0;  i  <  pcm.length;  i  +=  2)  {
    const  sample  =  pcm.readInt16LE(i)  /  32768;
    sum  +=  sample  *  sample;
  }
  return  sum  /  (pcm.length  /  2);
}

For production VAD, use Silero VAD, which runs as an ONNX model and provides much more accurate speech detection than energy-based approaches.

The processed audio feeds into either a cascaded pipeline (VAD → STT → LLM → TTS) or a native speech-to-speech model like GPT-4o that accepts audio tokens directly. Cascaded pipelines accumulate latency at each stage, often exceeding 1.5 seconds total. Native S2S models eliminate intermediate text conversion, reducing latency and preserving emotional nuance.

Egress requires encoding audio and managing RTP timestamps correctly:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
const  {  OpusEncoder  }  =  require('@discordjs/opus');

// Opus encoder: 48kHz, 1 channel, 20ms frames
const  encoder  =  new  OpusEncoder(48000,  1);
const  SAMPLES_PER_FRAME  =  960;  // 48000 Hz * 0.02s

class  AudioEgress  {
  constructor(track)  {
    this.track  =  track;
    this.sequenceNumber  =  0;
    this.timestamp  =  0;
    this.ssrc  =  crypto.randomUUID();  // Synchronization source ID
  }

  // Send PCM audio back to client
  sendAudio(pcmBuffer)  {
    // Process in 20ms chunks
    for  (let  offset  =  0;  offset  <  pcmBuffer.length;  offset  +=  SAMPLES_PER_FRAME  *  2)  {
      const  frame  =  pcmBuffer.slice(offset,  offset  +  SAMPLES_PER_FRAME  *  2);
      const  encoded  =  encoder.encode(frame);

      // Build RTP packet
      const  rtp  =  this.buildRtpPacket(encoded);
      this.track.writeRtp(rtp);

      // Increment for next packet
      this.sequenceNumber  =  (this.sequenceNumber  +  1)  %  65536;
      this.timestamp  +=  SAMPLES_PER_FRAME;  // 960 samples per 20ms frame
    }
  }

  buildRtpPacket(payload)  {
    // RTP header: 12 bytes fixed
    const  header  =  Buffer.alloc(12);
    header[0]  =  0x80;  // Version 2, no padding, no extension
    header[1]  =  111; // Payload type for Opus
    header.writeUInt16BE(this.sequenceNumber,  2);
    header.writeUInt32BE(this.timestamp,  4);
    header.writeUInt32BE(this.ssrc,  8);
    return  Buffer.concat([header,  payload]);
  }
}

For Opus at 48kHz with 20ms packets, each packet's timestamp must increment by exactly 960 (48000 × 0.02). Incorrect timestamps cause playback drift or gaps. If video is also generated, RTCP Sender Reports link both streams' timestamps to a common wall-clock time for lip-sync.

How Does Barge-In Work?

Full-duplex WebRTC enables natural interruption, but implementation requires coordinated state changes within ~300ms:

VAD detects user speech on echo-cancelled audio (critical: without AEC, the agent's output triggers VAD, causing self-interruption)
Server halts generation and clears its outgoing buffer
Server sends {"type": "interrupt"} via RTCDataChannel
Client flushes its local audio buffer (browsers buffer 200-500ms for smoothness)
Conversation context truncates to reflect only what was actually played

Advanced systems add semantic filtering to distinguish genuine interruption ("wait, stop") from backchanneling ("uh-huh").

How is Video Handled?

A 30fps stream cannot be sent entirely to a vision model. Agents decimate to ~1fps or trigger on motion/scene change.

Here's how to extract frames using FFmpeg:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
const  ffmpeg  =  require('fluent-ffmpeg');
const  {  Writable  }  =  require('stream');

class  VideoProcessor  {
  constructor(visionEndpoint)  {
    this.visionEndpoint  =  visionEndpoint;
    this.lastFrameTime  =  0;
    this.frameInterval  =  1000;  // Extract 1 frame per second
    this.rtpBuffer  =  [];
  }

  processVideoTrack(clientId,  track)  {
    track.onReceiveRtp.subscribe((rtp)  =>  {
      this.rtpBuffer.push(rtp.payload);

      const  now  =  Date.now();
      if  (now  -  this.lastFrameTime  >  this.frameInterval)  {
        this.extractAndAnalyzeFrame(clientId);
        this.lastFrameTime  =  now;
      }
    });
  }

  async  extractAndAnalyzeFrame(clientId)  {
    // Decode VP8/H.264 to raw frame using FFmpeg
    // In production: pipe RTP packets to FFmpeg stdin
    const  frameBuffer  =  await  this.decodeFrame(this.rtpBuffer);
    this.rtpBuffer  =  [];

    // Resize to model input (e.g., 512x512)
    const  resized  =  await  this.resizeFrame(frameBuffer,  512,  512);

    // Send to vision model
    const  base64Image  =  resized.toString('base64');
    const  analysis  =  await  this.analyzeWithVision(base64Image);

    return  analysis;
  }

  async  analyzeWithVision(base64Image)  {
    const  response  =  await  fetch(this.visionEndpoint,  {
      method:  'POST',
      headers:  {  'Content-Type':  'application/json'  },
      body:  JSON.stringify({
        model:  'gpt-4o',
        messages:  [{
          role:  'user',
          content:  [
            {  type:  'text',  text:  'Describe what you see.'  },
            {  type:  'image_url',  url:  `data:image/jpeg;base64,${base64Image}`  }
          ]
        }]
      })
    });
    return  response.json();
  }
}

RTP packets contain fragmented video data that must be reassembled before decoding. Libraries like werift provide some of this, but full implementation typically uses GStreamer or FFmpeg pipelines.

What About the Data Channel?

RTCDataChannel provides a bi-directional pipe using SCTP over DTLS, with configurable reliability:

Reliable mode for critical signals (session termination, function calls)
Unreliable mode for high-frequency ephemeral data (real-time transcription, detection bounding boxes)

The client (as the offerer) creates the data channel before generating the SDP offer:

javascript

1
2
3
4
5
6
7
8
9
// Client creates data channel before offer
const  dataChannel  =  pc.createDataChannel('ai-responses');

dataChannel.onmessage  =  (e)  =>  {
  const  response  =  JSON.parse(e.data);
  if  (response.type  ===  'transcript')  {
    displayTranscript(response.text);
  }
};

The agent receives it via the ondatachannel event:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Agent receives data channel
pc.ondatachannel  =  (event)  =>  {
  const  dataChannel  =  event.channel;
  this.dataChannels.set(clientId,  dataChannel);

  dataChannel.onopen  =  ()  =>  {
    dataChannel.send(JSON.stringify({
      type:  'status',
      message:  'AI Agent connected'
    }));
  };
};

// Send AI responses back to client
sendAIResponse(clientId,  response)  {
  const  dataChannel  =  this.dataChannels.get(clientId);
  if  (dataChannel?.readyState  ===  'open')  {
    dataChannel.send(JSON.stringify(response));
  }
}

Function calling typically flows through the data channel: the LLM emits a tool call, it's serialized to JSON and sent reliably, executed client or server-side, and the result feeds back into context.

The Core Tradeoff

WebRTC enables real-time AI agents by trading TCP's reliability for UDP's immediacy, then reconstructing synchronization at the application layer through RTP timestamps, adaptive jitter buffers, and RTCP reports. The agent terminates WebRTC server-side, bridging stateful media sessions to stateless inference.

The result is sub-500ms response times that match human conversational expectations.

How is WebRTC Used for Bi-Directional Voice and Video Streaming in AI Agents?

Why WebRTC Instead of WebSockets?

How Does an AI Agent Join a WebRTC Session?

What Does the Audio Pipeline Look Like?

How Does Barge-In Work?

How is Video Handled?

What About the Data Channel?

The Core Tradeoff