Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub ->

How is WebRTC Used for Bi-Directional Voice and Video Streaming in AI Agents?

Raymond F
Raymond F
Published January 7, 2026

WebRTC has become the standard transport layer for AI agents requiring real-time voice and video.

Originally designed for browser-to-browser video calls, WebRTC is a protocol stack that enables real-time audio and video communication over UDP. Because it prioritizes low latency over guaranteed delivery, it is ideal for the sub-500ms response times that natural conversation requires.

Here, we explain how WebRTC handles the hard problems of real-time AI streaming of voice and video: adaptive buffering, echo cancellation, encryption, and synchronization between audio and video streams. 

Why WebRTC Instead of WebSockets?

The core constraint in voice AI is Total Turn-Around Time (T-TAT): the gap between a user finishing an utterance and the agent responding.

WebSockets use TCP, which guarantees ordered delivery. When a packet is lost, TCP halts all subsequent packets until retransmission completes. This Head-of-Line blocking introduces unpredictable latency. A 200ms wait to recover a lost syllable fragment creates silence that disrupts conversation far more than a brief audio glitch would.

WebRTC uses UDP instead, accepting packet loss in exchange for consistent low latency. But raw UDP is insufficient for intelligible communication, so WebRTC adds:

  • RTP encapsulates media with sequence numbers and timestamps for reordering and synchronization

  • Adaptive jitter buffers dynamically resize based on network conditions, expanding during instability and shrinking to minimize latency when stable

  • Congestion control estimates bandwidth continuously and signals encoders to reduce bitrate before packet loss occurs

  • Echo cancellation at the browser level subtracts the agent's output from microphone input, preventing feedback loops and false barge-in triggers

How Does an AI Agent Join a WebRTC Session?

WebRTC was designed for browser-to-browser communication. Making an AI agent participate requires it to terminate the connection server-side as a "robot peer."

This creates a challenge: WebRTC sessions are stateful (encryption keys, sequence numbers, buffer state must persist in memory), while ML inference is typically stateless. The solution is a dedicated worker process per session that holds the PeerConnection state, decodes incoming RTP to raw frames, and bridges to stateless inference APIs.

Server-side WebRTC libraries include pion (Go), aiortc (Python/asyncio), and werift (Node.js/TypeScript). These expose the raw media loop, letting you intercept RTP packets and extract PCM audio or video frames for ML processing.

Before media can flow, peers must exchange connection metadata through a signaling server. This happens over WebSockets and involves two types of messages:

  • SDP (Session Description Protocol) describes each peer's capabilities: supported codecs, media types, and connection parameters. The client creates an "offer," and the agent responds with an "answer."

  • ICE candidates contain network path information for NAT traversal. Each peer discovers its possible network routes and shares them with the other.

Here's the client-side flow:

javascript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Client creates offer and sends to agent via signaling server async createOffer(peerId) { const pc = this.createPeerConnection(peerId); // Add local media tracks to the connection if (this.localStream) { this.localStream.getTracks().forEach(track => { pc.addTrack(track, this.localStream); }); } // Create SDP offer const offer = await pc.createOffer({ offerToReceiveAudio: true, offerToReceiveVideo: true }); await pc.setLocalDescription(offer); // Send offer to peer via signaling server this.send({ type: 'offer', targetId: peerId, sdp: pc.localDescription }); }

The agent receives the offer and responds:

javascript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Agent handles incoming offer (using werift) async handleOffer(clientId, sdp) { const pc = await this.createPeerConnection(clientId); await pc.setRemoteDescription(sdp); const answer = await pc.createAnswer(); await pc.setLocalDescription(answer); // Send answer back to client this.send({ type: 'answer', targetId: clientId, sdp: { type: pc.localDescription.type, sdp: pc.localDescription.sdp } }); }

ICE candidates are exchanged as they're discovered:

javascript
1
2
3
4
5
6
7
8
9
pc.onicecandidate = ({ candidate }) => { if (candidate) { this.send({ type: 'ice_candidate', targetId: peerId, candidate: candidate.toJSON() }); } };

Production deployments typically use a Selective Forwarding Unit (SFU) as an intermediary rather than direct P2P connections. The user connects to the SFU, and the AI agent joins the same room as another participant. The SFU handles bandwidth estimation, simulcast, and NAT traversal, shielding the agent from last-mile network instability. This also allows independent scaling: an SFU node might handle 500 connections while a GPU-heavy agent node handles 10.

What Does the Audio Pipeline Look Like?

Ingress transforms RTP packets into ML-ready data:

  1. Parse RTP headers and reorder out-of-sequence packets using sequence numbers

  2. Buffer packets in a jitter buffer (AI agents tune aggressively low, ~20-30ms versus 60-100ms for standard VoIP)

  3. Decode Opus to raw PCM (typically 16-bit at 16-48kHz)

  4. Run Voice Activity Detection to gate STT processing and detect end-of-turn

With werift, you subscribe to RTP packets on the incoming track:

javascript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Agent receives audio track when client connects pc.ontrack = (event) => { if (event.track.kind === 'audio') { this.processAudioTrack(clientId, event.track); } }; // Subscribe to raw RTP packets processAudioTrack(clientId, track) { track.onReceiveRtp.subscribe((rtp) => { // rtp.payload contains Opus-encoded audio // In production: decode Opus to PCM, run VAD, // buffer until speech ends, then send to STT if (this.aiProcessor) { this.aiProcessor.onAudioPacket(clientId, rtp.payload); } }); }

The code above shows how to access raw RTP packets, but you'd need to add Opus decoding, Voice Activity Detection, and buffering logic.

Building your own app? Get early access to our Livestream or Video Calling API and launch in days!

Here's what that looks like:

javascript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
const { OpusEncoder, OpusDecoder } = require('@discordjs/opus'); // Opus decoder: 48kHz sample rate, 1 channel (mono) const decoder = new OpusDecoder(48000, 1); // Buffer to accumulate audio between VAD events let audioBuffer = []; let silenceFrames = 0; const SILENCE_THRESHOLD = 25; // ~500ms at 20ms frames processAudioTrack(clientId, track) { track.onReceiveRtp.subscribe((rtp) => { // Decode Opus to PCM (16-bit signed integers) const pcm = decoder.decode(rtp.payload); // Simple energy-based VAD (production: use Silero VAD) const energy = calculateEnergy(pcm); const isSpeech = energy > 0.01; if (isSpeech) { audioBuffer.push(pcm); silenceFrames = 0; } else { silenceFrames++; // End of utterance: silence exceeded threshold if (audioBuffer.length > 0 && silenceFrames > SILENCE_THRESHOLD) { const fullUtterance = Buffer.concat(audioBuffer); this.sendToSTT(clientId, fullUtterance); audioBuffer = []; } } }); } function calculateEnergy(pcm) { let sum = 0; for (let i = 0; i < pcm.length; i += 2) { const sample = pcm.readInt16LE(i) / 32768; sum += sample * sample; } return sum / (pcm.length / 2); }

For production VAD, use Silero VAD, which runs as an ONNX model and provides much more accurate speech detection than energy-based approaches.

The processed audio feeds into either a cascaded pipeline (VAD → STT → LLM → TTS) or a native speech-to-speech model like GPT-4o that accepts audio tokens directly. Cascaded pipelines accumulate latency at each stage, often exceeding 1.5 seconds total. Native S2S models eliminate intermediate text conversion, reducing latency and preserving emotional nuance.

Egress requires encoding audio and managing RTP timestamps correctly:

javascript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
const { OpusEncoder } = require('@discordjs/opus'); // Opus encoder: 48kHz, 1 channel, 20ms frames const encoder = new OpusEncoder(48000, 1); const SAMPLES_PER_FRAME = 960; // 48000 Hz * 0.02s class AudioEgress { constructor(track) { this.track = track; this.sequenceNumber = 0; this.timestamp = 0; this.ssrc = crypto.randomUUID(); // Synchronization source ID } // Send PCM audio back to client sendAudio(pcmBuffer) { // Process in 20ms chunks for (let offset = 0; offset < pcmBuffer.length; offset += SAMPLES_PER_FRAME * 2) { const frame = pcmBuffer.slice(offset, offset + SAMPLES_PER_FRAME * 2); const encoded = encoder.encode(frame); // Build RTP packet const rtp = this.buildRtpPacket(encoded); this.track.writeRtp(rtp); // Increment for next packet this.sequenceNumber = (this.sequenceNumber + 1) % 65536; this.timestamp += SAMPLES_PER_FRAME; // 960 samples per 20ms frame } } buildRtpPacket(payload) { // RTP header: 12 bytes fixed const header = Buffer.alloc(12); header[0] = 0x80; // Version 2, no padding, no extension header[1] = 111; // Payload type for Opus header.writeUInt16BE(this.sequenceNumber, 2); header.writeUInt32BE(this.timestamp, 4); header.writeUInt32BE(this.ssrc, 8); return Buffer.concat([header, payload]); } }

For Opus at 48kHz with 20ms packets, each packet's timestamp must increment by exactly 960 (48000 × 0.02). Incorrect timestamps cause playback drift or gaps. If video is also generated, RTCP Sender Reports link both streams' timestamps to a common wall-clock time for lip-sync.

How Does Barge-In Work?

Full-duplex WebRTC enables natural interruption, but implementation requires coordinated state changes within ~300ms:

  1. VAD detects user speech on echo-cancelled audio (critical: without AEC, the agent's output triggers VAD, causing self-interruption)

  2. Server halts generation and clears its outgoing buffer

  3. Server sends {"type": "interrupt"} via RTCDataChannel

  4. Client flushes its local audio buffer (browsers buffer 200-500ms for smoothness)

  5. Conversation context truncates to reflect only what was actually played

Advanced systems add semantic filtering to distinguish genuine interruption ("wait, stop") from backchanneling ("uh-huh").

How is Video Handled?

A 30fps stream cannot be sent entirely to a vision model. Agents decimate to ~1fps or trigger on motion/scene change.

Here's how to extract frames using FFmpeg:

javascript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
const ffmpeg = require('fluent-ffmpeg'); const { Writable } = require('stream'); class VideoProcessor { constructor(visionEndpoint) { this.visionEndpoint = visionEndpoint; this.lastFrameTime = 0; this.frameInterval = 1000; // Extract 1 frame per second this.rtpBuffer = []; } processVideoTrack(clientId, track) { track.onReceiveRtp.subscribe((rtp) => { this.rtpBuffer.push(rtp.payload); const now = Date.now(); if (now - this.lastFrameTime > this.frameInterval) { this.extractAndAnalyzeFrame(clientId); this.lastFrameTime = now; } }); } async extractAndAnalyzeFrame(clientId) { // Decode VP8/H.264 to raw frame using FFmpeg // In production: pipe RTP packets to FFmpeg stdin const frameBuffer = await this.decodeFrame(this.rtpBuffer); this.rtpBuffer = []; // Resize to model input (e.g., 512x512) const resized = await this.resizeFrame(frameBuffer, 512, 512); // Send to vision model const base64Image = resized.toString('base64'); const analysis = await this.analyzeWithVision(base64Image); return analysis; } async analyzeWithVision(base64Image) { const response = await fetch(this.visionEndpoint, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model: 'gpt-4o', messages: [{ role: 'user', content: [ { type: 'text', text: 'Describe what you see.' }, { type: 'image_url', url: `data:image/jpeg;base64,${base64Image}` } ] }] }) }); return response.json(); } }

RTP packets contain fragmented video data that must be reassembled before decoding. Libraries like werift provide some of this, but full implementation typically uses GStreamer or FFmpeg pipelines.

What About the Data Channel?

RTCDataChannel provides a bi-directional pipe using SCTP over DTLS, with configurable reliability:

  • Reliable mode for critical signals (session termination, function calls)

  • Unreliable mode for high-frequency ephemeral data (real-time transcription, detection bounding boxes)

The client (as the offerer) creates the data channel before generating the SDP offer:

javascript
1
2
3
4
5
6
7
8
9
// Client creates data channel before offer const dataChannel = pc.createDataChannel('ai-responses'); dataChannel.onmessage = (e) => { const response = JSON.parse(e.data); if (response.type === 'transcript') { displayTranscript(response.text); } };

The agent receives it via the ondatachannel event:

javascript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Agent receives data channel pc.ondatachannel = (event) => { const dataChannel = event.channel; this.dataChannels.set(clientId, dataChannel); dataChannel.onopen = () => { dataChannel.send(JSON.stringify({ type: 'status', message: 'AI Agent connected' })); }; }; // Send AI responses back to client sendAIResponse(clientId, response) { const dataChannel = this.dataChannels.get(clientId); if (dataChannel?.readyState === 'open') { dataChannel.send(JSON.stringify(response)); } }

Function calling typically flows through the data channel: the LLM emits a tool call, it's serialized to JSON and sent reliably, executed client or server-side, and the result feeds back into context.

The Core Tradeoff

WebRTC enables real-time AI agents by trading TCP's reliability for UDP's immediacy, then reconstructing synchronization at the application layer through RTP timestamps, adaptive jitter buffers, and RTCP reports. The agent terminates WebRTC server-side, bridging stateful media sessions to stateless inference.

The result is sub-500ms response times that match human conversational expectations.

Integrating Video with your App?
We've built a Video and Audio solution just for you. Check out our APIs and SDKs.
Learn more ->