We recently open-sourced the Stream Video ESP32 SDK — an SDK that lets an ESP32-S3 or ESP32-P4 join a Stream Video call, capture camera and microphone input, encode H.264 + Opus in real-time, and publish it over WebRTC.
Someone on a browser or mobile device can then see and hear the ESP32 live. If you're building a video doorbell, a baby monitor, a security camera, or anything else that needs to stream video from an ESP32 to a web or mobile app — this is what the SDK is for.
Getting there was not straightforward. This post covers the real challenges we ran into during development and the techniques we used to debug and solve them. Whether you're an experienced ESP32 developer or new to embedded systems, we hope there's something useful here.
How Video Calls Work (A Quick Primer)
Before diving into the challenges, here's a brief overview of how real-time video calling works under the hood. If you're already familiar with WebRTC and SFUs, feel free to skip ahead.
WebRTC (Web Real-Time Communication) is the standard protocol for real-time audio and video on the internet. It's what powers video calls in browsers and most modern calling apps. A WebRTC connection involves several layers:
- SDP (Session Description Protocol) — Before two devices can exchange media, they need to agree on what to send: which codecs, what resolution, how many audio channels. SDP is the format used to describe these capabilities. Each side creates an SDP "offer" or "answer" and exchanges it with the other.
- ICE (Interactive Connectivity Establishment) — Devices on the internet are often behind NATs and firewalls. ICE is the process of discovering network paths between two endpoints. It uses STUN servers (which help a device discover its public IP) and optionally TURN servers (which relay traffic when direct connections aren't possible).
- DTLS / SRTP — Once a network path is established, DTLS handles the encryption key exchange, and SRTP encrypts the actual audio/video packets. All WebRTC media is encrypted by default.
SFU (Selective Forwarding Unit) — In a group call, you don't want every device sending media to every other device directly (that's O(n^2) connections). Instead, each device sends its media to a central server — the SFU — which then forwards it to the other participants. The SFU doesn't decode or re-encode the media; it just routes packets. Stream runs a global network of SFUs.
In a typical video call flow:
- A signalling server (often over WebSocket) coordinates the session — authentication, room management, and exchanging SDP offers/answers and ICE candidates between participants and the SFU.
- Once signalling is complete, media flows directly between each device and the SFU over WebRTC.
With that context, here's what our SDK does.
What We Built
At a high level, the SDK does this:
ESP32 → WiFi → Coordinator (WebSocket/JSON) → REST joinCall
→ SFU (WebSocket/protobuf) → WebRTC (DTLS-SRTP) → Browser
It connects to Stream's signaling infrastructure in two phases. First, a coordinator WebSocket authenticates the device and gets the SFU endpoint. Then a SFU WebSocket carries binary protobuf for WebRTC negotiation — SDP exchange, ICE candidates, health checks. Once negotiated, media flows directly over WebRTC to Stream's Selective Forwarding Unit (SFU).
All of this — JSON parsing, protobuf encoding/decoding, TLS, WebRTC, H.264 encoding, Opus encoding, camera DMA, I2S microphone capture — runs on a dual-core 240 MHz chip with 512 KB of internal SRAM and (thankfully) a few MB of PSRAM.
Challenge 1: Memory Is the Constraint, Not CPU
If you're coming from backend or mobile development, the first thing that hits you on ESP32 is memory. Not disk, not CPU — RAM. The ESP32-S3 has about 512 KB of internal SRAM. A single WebSocket message from our coordinator can be 8 KB. An SDP offer is easily 2 KB. The H.264 encoder's working memory dwarfs everything else on the system.
How We Solved It
PSRAM is non-negotiable. We require boards with PSRAM (typically 8 MB of external SPI RAM) and configure ESP-IDF to use it aggressively:
CONFIG_SPIRAM=y
CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL=256
CONFIG_SPIRAM_MALLOC_RESERVE_INTERNAL=8192
CONFIG_SPIRAM_TRY_ALLOCATE_WIFI_LWIP=y
CONFIG_MBEDTLS_EXTERNAL_MEM_ALLOC=y
The key insight is what goes in PSRAM vs. internal RAM. Small, latency-sensitive allocations (WiFi buffers, FreeRTOS internals) stay internal. Large, throughput-oriented allocations (TLS working memory, encoder stacks) go to PSRAM. We enforce this per-task:
123456if (strcmp(name, "venc_0") == 0) { cfg->stack_size = CONFIG_STREAM_VIDEO_VENC_TASK_STACK; // 128-192 KB cfg->stack_in_ext = true; // allocate stack in PSRAM cfg->priority = 2; cfg->core_id = 0; }
The video encoder task needs 128-192 KB of stack (yes, really — H.264 encoding has deep call chains). Putting that stack in PSRAM is the difference between the system running and a stack overflow on boot.
Cap everything. We set hard limits on every receive buffer: 8 KB for coordinator WebSocket messages, 8 KB for SFU HTTP responses, 2 KB for SDP strings. If something exceeds the cap, we drop it and log a warning rather than blowing up:
1234if (payload_len > 8192) { ESP_LOGW(TAG, "Payload too large (%d), skipping", payload_len); return; }
Grow buffers adaptively. For protobuf encoding, we start small and double the buffer up to 3 times rather than pre-allocating worst-case:
12345678size_t buffer_size = 2048; for (int attempt = 0; attempt < 3; ++attempt) { buffer = malloc(buffer_size); stream = pb_ostream_from_buffer(buffer, buffer_size); if (pb_encode(&stream, SfuRequest_fields, &request)) break; free(buffer); buffer_size *= 2; }
This matters because a join request with a short SDP might be 1.5 KB, but one with many ICE candidates could be 6 KB. Pre-allocating 8 KB every time wastes memory that the encoder needs.
Challenge 2: Protobuf on a Device That Barely Has malloc
Stream's SFU protocol uses Protocol Buffers (protobuf) — Google's binary serialization format for structured data. On a server, you'd use the standard protobuf library and not think twice. On ESP32, the standard library is too heavy — it uses dynamic allocation everywhere and has a code size that would consume a significant chunk of flash.
We use nanopb, a C implementation designed for embedded systems. But nanopb's sweet spot is small, fixed-size messages. Our SFU messages contain variable-length SDP strings that can be kilobytes long.
How We Solved It
Static allocation with sane defaults. Our .nanopb options file sets global limits:
*.*:max_size:256
*.*:max_count:10
*.*:allocation_mode:STATIC
This means every string field gets a 256-byte static buffer in the generated struct. For most fields (session IDs, track names) that's plenty. For SDP offers, it's way too small.
Callback-based decoding for large fields. For the SFU event stream, we use a two-pass decode pattern. First, a probe decode with default (capped) buffers to figure out which event type we received. Then a full decode with nanopb callbacks that write large string fields into stack-allocated buffers:
1234567// Stack buffers for large fields char sdp_buf[2048]; char ice_buf[1024]; // Wire up callbacks before decoding event.subscriber_offer.sdp.funcs.decode = &decode_string_cb; event.subscriber_offer.sdp.arg = sdp_buf;
This avoids having 2 KB buffers baked into every protobuf struct instance (there are many), while still handling the few fields that actually need them.
Challenge 3: WebRTC on a Microcontroller
WebRTC is a complex protocol stack: ICE for connectivity, DTLS for key exchange, SRTP for encrypted media, SDP for capability negotiation. Desktop browsers implement this in millions of lines of code. We needed it to work in a few hundred KB of flash and RAM.
We build on esp_peer, Espressif's WebRTC library for ESP32. It handles the core protocol machinery — but integrating it with a production SFU revealed several challenges.
TURN Doesn't Work (Yet)
TURN relay servers allow WebRTC connections to work behind restrictive NATs and firewalls. But we found that esp_peer's ICE agent has issues with TURN relay candidate prioritization — it would sometimes prefer a relay path over a direct one, or fail to negotiate altogether.
Our solution: default to STUN-only and make it a configurable toggle:
config STREAM_VIDEO_STUN_ONLY
bool "Use STUN only (ignore TURN servers)"
default y
help
Discard TURN URLs from the SFU ICE list and use only STUN.
Enabled by default because esp_peer's ICE agent has issues
with TURN relay candidate prioritization.
This works for most direct connections (same network or non-restrictive NAT). We expose the option to re-enable TURN for users who need it and can handle the quirks.
SDP Rewriting
The SFU sends SDP offers formatted for browser-class clients. Some assumptions don't hold on ESP32:
- Opus stereo: The SFU offers
opus/48000/2(stereo), but most ESP32 boards have a single mono microphone. When configured for mono, we detectopus/48000/2ina=rtpmap:lines and reconstruct the line withopus/48000/1:
1234567891011if (current_is_audio && STREAM_SFU_AUDIO_CHANNELS == 1 && strncmp(line, "a=rtpmap:", 9) == 0 && strstr(line, "opus/48000/2")) { char mono_line[128]; // Reconstruct: keep "a=rtpmap:<pt>" prefix, replace codec string snprintf(mono_line + prefix_len, sizeof(mono_line) - prefix_len, " opus/48000/1"); sfu_builder_append(§ion_builder, mono_line); }
- Direction: The SFU offers
sendrecvtracks, but ESP32 is publish-only (no screen to display received video). We match on the exact SDP attribute and substitute it:
1234if (strcmp(line, "a=sendrecv") == 0) { sfu_builder_append(§ion_builder, "a=sendonly\\r\\n"); continue; }
These are surgical per-line operations on the SDP — we parse it line by line, rewrite what's needed, and rebuild the result.
Dual Peer Connections
Our SFU protocol requires separate peer connections for subscribing and publishing. The subscriber peer receives the SFU's offer and provides an answer. The publisher peer creates its own offer to declare what it's sending. Each peer has its own DTLS session, SRTP keys, and ICE state — meaning two separate esp_peer instances, each with its own FreeRTOS task and stack:
12#define SFU_PEER_TASK_STACK_BYTES 16384 // subscriber #define SFU_PUB_TASK_STACK_BYTES 24576 // publisher (larger: encoding path)
Challenge 4: Keeping Two WebSockets Alive
The SDK maintains two concurrent WebSocket connections: one to the coordinator (JSON text frames) and one to the SFU (binary protobuf frames). On a server, that's trivial. On ESP32 over WiFi, connections drop, the device might roam, and there's no operating system catching errors for you.
Health Monitoring
Both connections run background tasks that send periodic health checks and detect silence:
1234567891011121314151617181920// Coordinator: 10s health interval, 30s silence threshold static void health_monitor_task(void *pvParameters) { const TickType_t check_interval = pdMS_TO_TICKS(10000); const TickType_t no_event_threshold = pdMS_TO_TICKS(30000); while (client->should_reconnect) { vTaskDelay(check_interval); if (client->state == STREAM_SIGNALING_STATE_CONNECTED) { send_health_check(client); TickType_t now = xTaskGetTickCount(); if ((now - client->last_event_tick) > no_event_threshold) { ESP_LOGW(TAG, "No coordinator events for 30s, reconnecting..."); esp_websocket_client_stop(client->ws_client); xEventGroupSetBits(client->event_group, WS_ERROR_BIT); } } } vTaskDelete(NULL); }
If 30 seconds pass with no events, we assume the connection is dead (even if TCP hasn't noticed yet) and force a reconnect. The SFU side has its own 5-second health interval with protobuf-encoded health requests.
We also set disable_pingpong_discon = true on both WebSocket clients. The default ESP-IDF behavior disconnects on missed ping/pong, but in our testing, this caused spurious disconnects on congested WiFi. We rely on our own application-level health checks instead.
Reconnection
Dedicated reconnect tasks wait on FreeRTOS event bits (disconnect or error), then retry with backoff:
12345678910static void reconnect_task(void *pvParameters) { while (client->should_reconnect) { xEventGroupWaitBits(client->event_group, WS_DISCONNECTED_BIT | WS_ERROR_BIT, pdTRUE, pdFALSE, portMAX_DELAY); vTaskDelay(pdMS_TO_TICKS(client->reconnect_interval_ms)); // attempt reconnect... } }
For the SFU, reconnects use a FAST WebSocket reconnect strategy — the server keeps state for a short window, so the device can resume its session without re-negotiating WebRTC from scratch.
Challenge 5: Debugging on Embedded (Our Approach)
This is where we want to spend extra time, because debugging techniques for embedded are fundamentally different from what most developers are used to. There's no attaching a debugger to a production call. There's no console.log() you can check in browser DevTools. When something goes wrong, you need to have already built the instrumentation in.
Philosophy: Menuconfig-Driven Debug Flags
Our core debugging philosophy is every debug feature is a Kconfig toggle. Kconfig is ESP-IDF's build configuration system — you run idf.py menuconfig to open a terminal UI where you can enable or disable features, set buffer sizes, choose hardware options, and toggle debug flags. These choices are resolved at compile time, so disabled features add zero overhead to the final binary. This means:
- Debug code is compiled out in production builds — zero runtime overhead
- Developers enable exactly the diagnostics they need via
idf.py menuconfig - Each flag is documented with what it does and any setup required
Here's our debug menu:
menu "Debug"
STREAM_VIDEO_UDP_STUN_TRACE — Log STUN UDP traffic
STREAM_VIDEO_RUN_RES_TEST — Run resolution/encoder test on boot
STREAM_AUDIO_LEVEL_PROBE — Probe mic level before publish
STREAM_AUDIO_DEBUG_MONITOR — Enable audio frame monitor
STREAM_AUDIO_DUMP_OPUS — Dump Opus frames to RAM
endmenu
Let's walk through each one and why it exists.
STUN UDP Trace: Seeing Through the Network Stack
When ICE negotiation fails, you're blind. Did the STUN binding request go out? Did a response come back? Was it from the right server?
Our UDP trace utility wraps lwIP's sendto and recvfrom at the linker level using GCC's --wrap feature:
12345678910111213ssize_t __wrap_lwip_sendto(int s, const void *dataptr, size_t size, int flags, const struct sockaddr *to, socklen_t tolen) { if (size >= 20) { const uint8_t *b = (const uint8_t *)dataptr; // Check for STUN magic cookie (0x2112A442) if (b[0] == 0x00 && b[1] == 0x01 && b[4] == 0x21 && b[5] == 0x12) { log_sockaddr("STUN send", to, tolen, size); log_stun_txid("STUN send txid", b, size); } } return __real_lwip_sendto(s, dataptr, size, flags, to, tolen); }
This intercepts every UDP send and receive on the system, checks if it looks like a STUN packet (by matching the magic cookie 0x2112A442), and logs the destination IP, port, packet size, and STUN transaction ID. The original function still runs — this is pure observation.
The linker wrap approach is powerful: it works at a level below the WebRTC library, so you see exactly what goes on the wire without modifying esp_peer's source code. The trade-off is that the application must add linker flags:
target_link_options(${PROJECT_NAME}.elf PRIVATE
"-Wl,--wrap=lwip_sendto"
"-Wl,--wrap=lwip_recvfrom"
)
We document this in the Kconfig help text so developers don't wonder why nothing shows up after enabling the flag.
Audio Level Probe: Is the Microphone Even Working?
When you're getting silence from the audio pipeline, the first question is: is the microphone capturing anything at all? Or is it capturing, but the data is getting lost in the encode/publish pipeline?
The mic level probe reads raw I2S samples before any processing and computes RMS and peak levels. (I2S — Inter-IC Sound — is the hardware protocol ESP32 uses to receive digital audio data from a microphone.)
config STREAM_AUDIO_LEVEL_PROBE
bool "Probe mic level before publish"
default n
help
Reads raw mic samples briefly and logs RMS/peak level.
When enabled, before starting the publish flow, the SDK reads raw microphone data for a configurable window (default 500 ms) and logs the signal level. If RMS is near zero, you know the I2S configuration or microphone hardware is the problem — not the Opus encoder or WebRTC stack.
Opus Frame Dump: Validating Encoded Audio
Sometimes the microphone works, encoding appears to succeed, but the remote side hears nothing or garbled audio. The Opus dump flag captures encoded frames to a RAM buffer for later inspection:
config STREAM_AUDIO_DUMP_OPUS
bool "Dump Opus frames to RAM"
default n
config STREAM_AUDIO_DUMP_SECONDS
int "Opus dump duration (seconds)"
default 30
config STREAM_AUDIO_DUMP_MAX_BYTES
int "Opus dump RAM cap (bytes)"
default 262144
Each frame is stored with a 4-byte length prefix. This lets you verify frame sizes are reasonable (Opus at 32 kbps with 10 ms frames should produce ~40-byte frames), detect if the encoder is emitting empty or oversized frames, and even extract the dump over JTAG for offline analysis.
Audio Frame Monitor: Is Data Flowing?
A lightweight probe that logs frame counts and byte counts for a few seconds after capture starts. It answers the question "is the pipeline running and producing data at the expected rate?" without the overhead of capturing actual frame content.
Resolution Test: Finding Stable Encoder Settings
Different ESP32 boards have different cameras with different sensor capabilities. What works on an OV2640 at 640x480 might fail on an OV3660, or succeed but only at 10 fps instead of 15. The resolution test iterates through common resolutions at boot, attempts capture + encode for each, and logs PASS/FAIL:
PASS: 320x240 @ 20fps
PASS: 640x480 @ 15fps
FAIL: 800x600 @ 15fps (encode timeout)
FAIL: 1280x720 @ 10fps (capture init failed)
Run this once on a new board to know exactly what your hardware can handle before spending hours debugging why a call drops frames.
Challenge 6: Audio Tuning on Constrained Hardware
Getting audio right on ESP32 was harder than video, surprisingly. The H.264 encoder "just works" once you give it enough stack. Audio requires careful tuning across the entire pipeline: I2S configuration, gain staging, encoder parameters, and FreeRTOS task priorities.
Task Priority Starvation
Early on, we gave every task a similar priority (around 5). Under load, the video encoder and WebRTC tasks would consume all available CPU time, and the audio source task — which reads raw samples from an I2S DMA buffer — wouldn't get scheduled in time. The DMA buffer would fill up, unread samples would be lost, and the result was audible glitches or silence on the remote side.
The fix was to raise the audio source task to priority 15 (in FreeRTOS, higher number = higher priority):
1234} else if (strcmp(name, "AUD_SRC") == 0) { // Keep audio source responsive; low priority can starve frames. cfg->priority = 15; }
This ensures the audio source always preempts lower-priority tasks when it needs to run. It's safe because the task does very little work per invocation — read a chunk from the DMA buffer, hand it to the encode pipeline, yield. It runs for microseconds each time, so it doesn't starve the video encoder or WebRTC tasks despite its high priority.
Opus Tuning for Embedded
The default Opus encoder settings are designed for desktop/mobile VoIP. On ESP32, we had to adjust:
12345opus_cfg.frame_duration = ESP_OPUS_ENC_FRAME_DURATION_10_MS; opus_cfg.application_mode = ESP_OPUS_ENC_APPLICATION_VOIP; opus_cfg.complexity = 7; opus_cfg.enable_dtx = false; opus_cfg.enable_vbr = false; // force CBR to avoid sparse frames
- 10 ms frames instead of 20 ms: lower latency, critical for real-time calls.
- Complexity 7 (out of 10): better quality than default 0 with acceptable CPU cost on ESP32-S3.
- DTX disabled: Discontinuous transmission saves bandwidth during silence, but causes issues sometimes.
- CBR (constant bitrate): VBR can produce very small frames during silence that confuse downstream RTP packetization. CBR keeps frame sizes predictable.
Challenge 7: TLS Everywhere, On a Microcontroller
Every connection uses TLS: coordinator WebSocket (WSS), SFU WebSocket (WSS), coordinator REST API (HTTPS), SFU HTTP signaling (HTTPS). That's four TLS handshakes, each requiring certificate validation, key exchange, and symmetric cipher setup.
We use ESP-IDF's certificate bundle (esp_crt_bundle_attach) which embeds a curated set of root CA certificates in flash. This avoids needing to ship individual certificates per server but adds ~64 KB to the binary.
The big memory optimization is routing mbedTLS allocations to PSRAM:
CONFIG_MBEDTLS_EXTERNAL_MEM_ALLOC=y
A single TLS handshake can temporarily allocate 30-40 KB. With four connections and the encoder running, this would exhaust internal RAM. PSRAM is slower, but the handshake is a one-time cost.
We also added detailed TLS error logging on WebSocket connection failures, because "TLS handshake failed" is the most common and least informative error in embedded development:
1234567case WEBSOCKET_EVENT_ERROR: if (data->error_handle.esp_tls_last_esp_err) { ESP_LOGE(TAG, "TLS error: 0x%x", data->error_handle.esp_tls_last_esp_err); } if (data->error_handle.esp_tls_stack_err) { ESP_LOGE(TAG, "TLS stack error: 0x%x", data->error_handle.esp_tls_stack_err); }
Lessons Learned
1. Build debug instrumentation from day one. Our Kconfig debug menu wasn't an afterthought — it grew alongside the SDK. Every time we spent more than an hour debugging something, we asked: "What toggle would have made this a 5-minute problem?" Then we built it.
2. Memory budgets, not memory hopes. On ESP32, you need to know exactly where every kilobyte goes. We cap buffers, we put stacks in PSRAM, and we log free heap at startup. A vague "it works on my board" is not good enough when PSRAM sizes and internal SRAM availability vary across boards.
3. Don't fight the hardware. TURN doesn't work well with esp_peer? Default to STUN-only. Stereo Opus doesn't make sense with a mono microphone? Rewrite the SDP. The ESP32 is powerful for its size, but it's not a laptop. Meet it where it is.
4. Health checks beat TCP keepalives. WiFi connections can be half-open — TCP thinks it's connected, but packets are being silently dropped. Application-level health checks with explicit timeouts caught problems that TCP never would.
5. Test the pipeline in pieces. Our debug flags exist because we learned the hard way: when audio doesn't work, the problem could be in I2S config, gain, encoding, packetization, WebRTC, or the SFU. The mic level probe, Opus dump, and frame monitor let you test each stage independently.
What's Next
The SDK currently supports publish-only — the ESP32 can send video and audio, but can't yet receive and render remote participants. Subscribe support, better reconnect behavior across WiFi roaming, and broader board support are on the roadmap.
The SDK is available on GitHub under the Stream License. Full documentation — installation, quickstart, SDK configuration, and API reference — is on the Stream Video ESP32 docs.
The Stream Video ESP32 SDK is built on ESP-IDF v5.4+ and targets ESP32-S3 and ESP32-P4. It uses esp_peer for WebRTC, nanopb for protobuf, and Espressif's capture/codec components for hardware-accelerated H.264 and Opus encoding.