Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub

The Building Blocks of Live Streaming Infrastructure

New
19 min read
Raymond F
Raymond F
Published February 23, 2026
The Building Blocks of Live Streaming Infrastructure cover image

Live streaming used to be simple, if not technically, at least conceptually. It was really only about streaming live video, such as a sports game or concert, to an audience watching passively. Video went in, video came out.

Now, people expect much more from a live stream. It is a programmable, real-time experience. A shopping platform lets viewers tap products mid-stream to purchase. An auction house runs bidding where a 3-second delay means missed bids and angry customers. A social app lets viewers "raise their hand" and join the broadcast as a guest.

These use cases share a common trait: the “video stream” itself is just one component. The core infrastructure for “low-latency video” still exists, but the product now includes much more: authentication, interactive UI, moderation tools, commerce hooks, analytics, and tight synchronization between what the host sees and what thousands of viewers experience.

What are the composable infrastructure building blocks that make up live streaming? The goal of this guide is to provide product teams with a mental model for making informed trade-offs: latency vs. scale, interactivity vs. simplicity, build vs. buy. We'll start with the fundamentals and then work through the key architectural decisions.

Live Streaming in Modern Apps

The phrase "live streaming" now covers a spectrum of experiences with different technical requirements.

  • One-to-many broadcast is the classic model: a single publisher, many viewers. Latency requirements vary. A few seconds is fine for passive viewing, but tighter sync matters for scheduled reveals or countdown moments.
  • Interactive live adds bidirectional elements. Viewers send chat messages, reactions, and poll responses. Hosts respond in real-time. The stream becomes a conversation, and latency directly affects how natural that conversation feels. When a host says, "share your answer in the chat," and responses arrive 15 seconds later, the experience breaks down.
  • Participatory live goes further. Viewers can join as guests, appearing on-screen alongside the host. This is common in podcasts, interviews, live shopping with guest experts, and social apps with "stages." The technical requirements shift dramatically: guests need sub-second latency and bidirectional media, while the broader audience might watch via a higher-latency delivery method.
  • Transactional live ties real-time events to business logic. Auctions, flash sales, live shopping, gambling, and betting all require tight synchronization. If viewer A sees a bid two seconds before viewer B, the auction is unfair. If a "buy now" button appears at different times for different viewers, conversion suffers.

The media pipeline is integral, but no longer sufficient. It's media and realtime and platform operations, and the architecture must account for all three.

The architecture also determines latency. Different use cases have different latency tolerances. Understanding these requirements upfront shapes every architectural decision that follows.

Use caseAcceptable latencyWhy
Passive broadcast (news, sports)5-30 secondsViewers are watching, not participating
Chat-driven interaction2-5 secondsResponses feel connected to the stream
Live shopping / product demos1-3 seconds"Buy now" must align with what viewers see
Auctions and bidding< 500msFairness requires tight synchronization
Call-ins and guest appearances150-300msConversation must feel natural
Cloud gaming60-100msInput lag destroys playability

These numbers matter because they determine protocol choices. You generally can't achieve 200ms latency with HLS. You generally can't scale to millions with pure WebRTC. Knowing your target helps you pick the right architecture.

Three Plane Model of Live Streaming Infrastructure: Media, Control, and Data

A useful mental model separates live streaming infrastructure into three planes:

PlaneWhat it handlesExamples
Media planeAudio/video capture, encoding, transport, and playbackWebRTC, HLS, RTMP ingest, transcoders, CDNs, SFUs
Control planeWho can do what, when, and howAuthentication, signaling, session state, permissions, lifecycle
Data planeInteractive features and real-time eventsChat, reactions, polls, viewer counts, commerce events

The control plane orchestrates the other two. It doesn't move bytes or messages itself, but it decides whether and how the media and data planes operate.

When a viewer joins a stream, the control plane authenticates them and checks permissions. If they're allowed in, it provides what they need to connect: WebRTC signaling information or an HLS signed URL. Only then does the media plane start delivering video. Simultaneously, the data plane opens a channel for chat and reactions, also gated by the control plane's auth.

When the host calls goLive(), that's a control plane state change. The media plane responds by starting delivery to viewers. The data plane enables chat. When the host mutes a disruptive viewer, the control plane updates permissions, and the media plane enforces it by dropping that viewer's audio track. When the stream ends, both planes tear down in response to the lifecycle change.

Three Plane Model of Live Streaming Infrastructure: Media, Control, and Data

Each plane scales, fails, and evolves differently.

The media plane is bandwidth-intensive and latency-sensitive. Adding a poll feature shouldn't require touching video infrastructure. The control plane is about state and coordination. Changing who can publish shouldn't require re-architecting media delivery. The data plane handles high-frequency, low-bandwidth events. Chat messages have different reliability requirements than video packets.

When teams conflate these planes, they end up with systems where adding a poll requires modifying the video pipeline, or where authentication changes risk playback stability. Clean separation means you can debug access issues without touching media infrastructure, add features to the data plane without risking video delivery, and scale each plane according to its actual bottlenecks.

Media Plane: Ingest, Processing, Delivery

The media plane is where video actually moves from camera to screen. It's also where you'll encounter an alphabet soup of protocols and technologies. Before diving in, let’s get the vocabulary out of the way.

TermWhat it isRole in the pipeline
RTMPReal-Time Messaging ProtocolIngest from encoders (OBS, vMix) to servers. TCP-based, mature, everywhere.
SRTSecure Reliable TransportIngest over unreliable networks. UDP with error correction.
WHIPWebRTC-HTTP Ingestion ProtocolIngest with WebRTC-level latency. HTTP signaling, UDP media.
WebRTCWeb Real-Time CommunicationEnd-to-end real-time delivery. UDP, sub-second latency.
HLSHTTP Live StreamingDelivery via HTTP. Video as segments, CDN-cacheable, 10-30s latency.
LL-HLSLow-Latency HLSHLS with partial segments. 2-5s latency.
DASHDynamic Adaptive Streaming over HTTPLike HLS, different manifest format. Similar tradeoffs.
SFUSelective Forwarding UnitServer that routes WebRTC packets without transcoding.
CDNContent Delivery NetworkEdge servers that cache and serve HLS/DASH segments.
GOPGroup of PicturesFrames between keyframes. Shorter = faster joins, higher bitrate.
OBSOpen Broadcaster SoftwareFree, popular streaming/recording app. Supports RTMP, SRT, WHIP.

You don't use all of these together. They combine in specific patterns depending on your latency and scale requirements. Most live streaming systems follow one of a few patterns:

Classic Broadcast: RTMP → HLS

Classic Broadcast: RTMP → HLS

The broadcaster uses OBS or similar software to send RTMP to your ingest server. The stream is transcoded into multiple bitrates, packaged into HLS segments, and distributed via CDN. Latency is long, but it scales to millions of viewers because CDNs are designed for this.

Low-Latency Broadcast: WHIP → WebRTC

Low-Latency Broadcast: WHIP → WebRTC

The broadcaster uses WHIP (now supported in OBS) to send WebRTC directly to an SFU. The SFU forwards packets to viewers without transcoding. Latency drops to hundreds of milliseconds, but each viewer needs a persistent connection. Scaling requires adding SFU capacity and cascading between instances.

In-App Real-Time: WebRTC End-to-End

In-App Real-Time: WebRTC End-to-End

For in-app publishing (mobile apps, web apps), the SDK captures media and sends it via WebRTC directly to the SFU. No protocol translation, lowest possible latency. This is the path for interactive features like "bring a viewer on stage."

Hybrid: WebRTC + HLS

Hybrid: WebRTC + HLS

The best of both worlds. Interactive participants (hosts, guests, engaged viewers) connect via WebRTC for sub-second latency. The SFU also feeds a transcoder that produces HLS for the mass audience. Everyone sees the same content; latency varies based on their connection.

Each stage in the pipeline adds delay, and all delays depend on encoding settings, segment duration, player buffering, and network conditions.

Video travels through several stages: your device captures and encodes the raw footage (capture + encode), it travels across the network to your server (network transit), your server processes it—either forwarding packets via SFU for WebRTC or creating segments for HLS (server processing), it's delivered to viewers—directly for WebRTC or via CDN for HLS (delivery), and finally the player buffers enough to ensure smooth playback (player buffer).

StageWebRTC pathHLS path
Capture + encode30-100ms30-100ms
Network transit50-150ms50-150ms
Server processing~10ms (SFU forwarding)2-10s (segment creation)
Delivery~50ms~1s (CDN propagation)
Player buffer50-150ms (jitter buffer)6-20s (2-3 segments)
Total200-500ms10-30s

The difference is structural. HLS latency comes from segmentation and buffering, which are features, not bugs. They enable CDN caching and resilient playback. WebRTC achieves sub-second latency because it avoids segmentation and multi-segment buffering, though this requires stateful infrastructure.

These architectures fail differently, too. When a CDN edge node goes down, traffic routes around it; the system handles partial failure gracefully. When an SFU instance dies, all connected viewers disconnect and must reconnect, often losing several seconds of content. Publisher failures (encoder crashes, uplink losses) affect everyone, regardless of architecture. And client-side failures happen constantly: network switches, backgrounded apps, device sleep. Build reconnection with exponential backoff into your client from the start.

The architecture patterns above describe the flow conceptually. To make this concrete, here's what WebRTC publishing looks like in code. This example assumes you have a signaling server and SFU running (managed services like Stream handle this infrastructure, or you can self-host using open source SFUs like Janus or mediasoup).

For the WebRTC path, the publisher captures media, creates a peer connection, and sends tracks to the SFU:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
async function startPublishing(signalUrl: string) { // Capture camera and mic const stream = await navigator.mediaDevices.getUserMedia({ video: { width: 1280, height: 720 }, audio: true, }); // Connect to SFU const pc = new RTCPeerConnection({ iceServers: [{ urls: 'stun:stun.l.google.com:19302' }], }); // Add tracks stream.getTracks().forEach(track => pc.addTrack(track, stream)); // Exchange signaling const offer = await pc.createOffer(); await pc.setLocalDescription(offer); const response = await fetch(signalUrl, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ sdp: offer.sdp }), }); const { sdp } = await response.json(); await pc.setRemoteDescription({ type: 'answer', sdp }); return { pc, stream }; }

The SFU handles distribution to viewers. For HLS, you'd configure your encoder (OBS, FFmpeg) to push RTMP to your ingest URL, and the transcoding/packaging pipeline takes over from there.

You can think about it like this when you need to choose your configuration:

If you need...Use this pattern
Maximum reach, latency doesn't matterRTMP → HLS
Low latency from broadcast toolsWHIP → WebRTC
In-app publishing, real-time interactionWebRTC end-to-end
Interactive hosts + large audienceHybrid (WebRTC + HLS)
Professional contribution over bad networksSRT → (then HLS or WebRTC)

Most production systems end up hybrid. The question is where to draw the line between your WebRTC tier (low latency, higher infrastructure cost) and your HLS tier (higher latency, lower cost at scale).

When planning capacity, size for peak concurrent viewers, not average. Know your SFU's per-instance viewer limit (it varies with video resolution and simulcast layers) and have a plan to spill excess viewers to HLS. Monitor utilization as you approach capacity, not after you've hit it.

Control Plane: Identity, Signaling, Lifecycle

The control plane coordinates everything that isn't raw media. It determines who can connect, establishes the connections, and tracks the stream's state. While the media plane moves bytes, the control plane decides whether those bytes should flow at all.

Authentication

How auth works depends on your delivery method.

For HLS, authentication typically happens once upfront. The viewer calls your API with a token (usually a JWT), your server validates it and checks permissions, then returns a signed URL with an expiration timestamp and HMAC signature. The CDN validates the signature before serving segments. No persistent connection needed; auth is baked into the URL.

Authentication Step in Control Plane: Identity, Signaling, Lifecycle

For WebRTC, authentication happens during signaling. The viewer connects to your signaling server (usually over WebSocket), and the token is validated before the server negotiates a connection. Auth and signaling are intertwined.

Signaling

Before any WebRTC media flows, participants exchange signaling messages. This involves a few protocols working together: SDP (Session Description Protocol) describes codec capabilities and connection parameters; ICE (Interactive Connectivity Establishment) handles NAT traversal to find a network path between peers. ICE uses STUN servers to discover public IPs, and falls back to TURN relay servers when direct connection fails (often ~15–30%, and varies by network).

Signaling Step in Control Plane: Identity, Signaling, Lifecycle

Signaling complexity scales with your feature set. A simple broadcast just needs a basic offer/answer exchange. Add "bring a viewer on stage" and signaling must handle permission grants, track negotiation, and state changes:

Signaling Step 2 in Control Plane: Identity, Signaling, Lifecycle

The signaling server often becomes the coordination point for everything that isn't raw media. Here's a signaling client that handles auth, SDP exchange, and ICE negotiation:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class SignalingClient { private ws: WebSocket; private pc: RTCPeerConnection; async connect(url: string, token: string): Promise<MediaStream> { this.ws = new WebSocket(`${url}?token=${token}`); await new Promise((res) => this.ws.onopen = res); this.pc = new RTCPeerConnection({ iceServers: [ { urls: 'stun:stun.l.google.com:19302' }, { urls: 'turn:turn.example.com', username: 'user', credential: 'pass' }, ], }); const stream = new MediaStream(); this.pc.ontrack = (e) => stream.addTrack(e.track); this.pc.onicecandidate = (e) => { if (e.candidate) { this.ws.send(JSON.stringify({ type: 'ice', candidate: e.candidate })); } }; this.pc.addTransceiver('video', { direction: 'recvonly' }); this.pc.addTransceiver('audio', { direction: 'recvonly' }); const offer = await this.pc.createOffer(); await this.pc.setLocalDescription(offer); this.ws.send(JSON.stringify({ type: 'offer', sdp: offer.sdp })); this.ws.onmessage = async (event) => { const msg = JSON.parse(event.data); if (msg.type === 'answer') { await this.pc.setRemoteDescription({ type: 'answer', sdp: msg.sdp }); } else if (msg.type === 'ice') { await this.pc.addIceCandidate(msg.candidate); } }; return stream; } }

The code shows how auth (token in WebSocket URL), signaling (SDP offer/answer), and ICE all flow through the same connection. This is the control plane in action: coordinating everything needed before media can flow.

Streams move through states: scheduled, backstage, live, paused, ended. Lifecycle state gates what's allowed.

Diagram showing how Stream livestreaming moves through states: scheduled, backstage, live, paused, ended

In backstage, the host can preview their camera, adjust settings, and invite guests, but viewers either can't join or see a waiting screen. In live, media flows to viewers, and interactive features are enabled. In paused, the stream is temporarily stopped, but the session remains active. Ended closes connections and triggers recording processing if enabled.

The state machine itself is simple. The hard part is enforcing it consistently: what happens if a host calls goLive() while the transcoder is still initializing? What if a viewer joins during the transition? These edge cases are where lifecycle management earns its complexity.

Data Plane: Realtime Events, Chat, Reactions

Most features users associate with "live streaming" aren't media at all. Chat, reactions, polls, viewer counts, product pins, and "raise hand" are real-time data layered on top of video. They travel through a separate path from media, with different infrastructure and different requirements.

Not All Features Are Created Equal

These features look similar from the UI, but they have different needs:

FeatureReliabilityLatencyPersistenceOrdering
Chat messagesHigh (no lost messages)Seconds OKYes (history)Strict
Reactions/emojiLow (losing some is fine)Sub-secondNoNone
PollsHigh (votes must count)Seconds OKYes (results)None
Viewer countLow (approximate OK)Seconds OKNoNone
Product pinsHighMust sync with videoYesTimed
Q&AHighSeconds OKYesModerated
Building your own app? Get early access to our Livestream or Video Calling API and launch in days!

Pushing all of this through a single channel with uniform guarantees means either over-engineering for reactions or under-engineering for chat. A well-designed data plane treats these differently: fire-and-forget for reactions, persistent ordered delivery for chat, and request-response for poll votes.

The data plane runs parallel to media, usually over WebSocket:

Data Plane: Realtime Events, Chat, Reactions

This separation means chat outages don't affect video playback, reaction spikes don't compete with media bandwidth, and you can scale each path independently.

You can build a data plane with WebSockets and a pub/sub system, but the edge cases add up: message ordering, delivery guarantees, presence tracking, reconnection handling, moderation hooks, and the sync problem above. Most teams find this is a "buy" decision, using a managed real-time service and focusing their engineering on the product experience rather than the plumbing.

The tricky part is keeping data events synchronized with video. Data plane events travel via WebSocket and arrive in near-real-time, but the video itself is delayed by each viewer's latency. Data arrives ahead of the video it's supposed to accompany.

Consider this scenario: a host holds up a product and says, "Tap the screen to buy this now," then triggers a product pin. The pin travels via the data plane and arrives at all viewers within 100ms. But the video of the host saying those words is still in transit:

  • The WebRTC viewer (300ms video latency) sees the pin, then 200ms later hears the host say "tap now."
  • The HLS viewer (15s video latency) sees the pin, then waits 15 seconds before the host says anything about it.
The HLS viewer in livestreaming (15s video latency) sees the pin, then waits 15 seconds before the host says anything about it

For WebRTC viewers, 200ms of skew is barely noticeable. For HLS viewers, a pin appearing 15 seconds before the host mentions it is confusing at best, and breaks the experience at worst.

Approaches to handle this:

  • Timestamp and delay. Attach a stream timestamp to each event. The client compares it to the current playback position and delays the display until the viewer catches up. Works, but requires tight coordination between the data plane and player.
  • Design for skew. Accept that sync won't be perfect and design interactions that tolerate it. "React to this segment" instead of "react now." Polls that stay open for 30 seconds rather than five. Less precise, but simpler.
  • Separate tiers by latency. WebRTC viewers get tightly synced interactions. HLS viewers get a more passive experience with less time-sensitive features. Match the interaction model to the delivery method.

That third approach is worth expanding into a design principle: don't fight the latency, design around it.

For WebRTC viewers (sub-second latency), you can build features that feel instantaneous: real-time reactions tied to specific moments, "tap now" prompts that sync with host actions, call-ins and go-on-stage flows, time-sensitive commerce like auctions and flash sales.

For HLS viewers (5-30 second latency), design features that don't depend on precise timing: chat that references the stream broadly rather than specific moments, polls with longer windows, reactions that aggregate into counts rather than showing individually, product links that persist rather than flash.

The key insight is to understand that HLS viewers aren't getting a worse experience; they're getting a different experience. Design features appropriate to each tier rather than trying to force real-time interactivity over a non-real-time transport.

Latency, and the Transport Tradeoffs

It's tempting to treat latency as a number to minimize. Lower is better, right? But latency in live streaming is a tradeoff. The features that make HLS "slow" are the same features that make it scale. The choices that make WebRTC fast are the same choices that make it expensive to operate at scale.

Why HLS Scales

HLS latency comes from deliberate design choices, not technical limitations.

  • Segmentation enables caching. HLS packages video into discrete files (segments) served over HTTP. This means standard CDN infrastructure can cache and serve them. A segment requested by one viewer is already cached for the next thousand. But a segment can't be served until it's complete, so a 6-second segment means at least six seconds of delay before the first viewer can receive it.
  • Buffering enables resilience. HLS players typically buffer 2-3 segments ahead. If the network hiccups, playback continues from the buffer while the player recovers. This makes HLS remarkably tolerant of variable network conditions, but it adds 12-18 seconds of delay on top of segmentation.
  • HTTP enables compatibility. HLS works over standard HTTPS, which passes through virtually any firewall, proxy, or corporate network. No special ports, no UDP, no NAT traversal headaches. This ubiquity has a cost: TCP's reliability guarantees add latency under packet loss, and there's no way to skip ahead when congested.

These are the tradeoffs that enable HLS to scale to millions of viewers on commodity infrastructure. LL-HLS claws back some latency with partial segments and smarter playlist updates, but it's still fundamentally segment-based and buffer-dependent.

Why WebRTC Feels Real-Time

WebRTC takes the opposite approach at every level.

  • No segmentation. Media is transmitted as a continuous stream of packets over UDP. There's no "wait for the segment to complete" step. Packets are sent as soon as they're encoded and forwarded as soon as they arrive.
  • Minimal buffering. WebRTC jitter buffers are typically 50-150ms, just enough to smooth out packet timing variations. Compare this to HLS buffers measured in seconds. Less buffer means lower latency, but also less tolerance for network variability. A bad network moment that HLS would sail through might cause visible glitches in WebRTC.
  • UDP enables speed (mostly). UDP doesn't wait for lost packets. If a packet doesn't arrive, the decoder handles the gap, usually by concealing the error or waiting for a keyframe. This avoids TCP's head-of-line blocking, where a single lost packet holds up everything behind it. But UDP doesn't traverse all networks cleanly. About 20% of connections need TURN relay servers to work at all.
  • Stateful connections. Each WebRTC viewer maintains a persistent connection to an SFU. The server tracks each viewer's state, handles their ICE negotiation, and makes per-viewer forwarding decisions. This enables features like per-viewer quality adaptation, but it means infrastructure that scales with viewer count, not cache hits.

The choice between HLS and WebRTC is more about what kind of infrastructure you want to operate and what kind of experience you're building than it is about latency.

HLS trades latency for stateless scale. Your origin produces segments, CDNs cache them, and adding viewers is nearly free. The infrastructure is simple to reason about and resilient to failure. But you can't build real-time interactions on 15 seconds of delay.

WebRTC trades scale simplicity for interactivity. Sub-second latency enables conversations, auctions, and "tap now" moments. But you need SFU infrastructure that scales with connections, not cache hits. Costs grow differently, failure modes are different, and operational complexity is higher.

Most production systems end up with both: WebRTC for participants who need interactivity, HLS for audiences who just need to watch. The question isn't "which is better" but "where do you draw the line between them."

How to Build with Stream’s Live Streaming SDK

Stream's live streaming SDK maps directly to the three-plane model. The Video SDK handles the media plane with WebRTC-based delivery. Authentication and call lifecycle form the control plane. Chat and custom events provide the data plane. You'll use Stream's server-side SDK (backend) to generate auth tokens and the Video React SDK (frontend) to build the streaming UI.

Here's how to build a complete livestream experience.

Setting up Authentication

Stream uses JWTs for authentication. Your backend generates tokens using a Stream server-side SDK, and clients use these tokens to connect. This keeps your API secret secure while giving clients the credentials they need.

javascript
1
2
3
4
5
6
7
8
9
import { StreamClient } from '@stream-io/node-sdk'; const client = new StreamClient(API_KEY, API_SECRET); app.get('/token', async (req, res) => { const userId = req.query.user_id; const token = client.generateUserToken({ user_id: userId }); res.json({ token, userId }); });

On the client, you initialize the video client with this token. The tokenProvider callback handles automatic token refresh when needed:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import { StreamVideoClient, type User } from '@stream-io/video-react-sdk'; const user: User = { id: 'host-user', name: 'Livestream Host', image: 'https://example.com/avatar.png', }; const client = new StreamVideoClient({ apiKey: API_KEY, user, token: await getToken(user.id), tokenProvider: () => getToken(user.id), });

Creating and Joining a Livestream (Host)

The host creates a call with the livestream type. The create: true flag creates the call if it doesn't exist. Members and custom metadata can be set during creation:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
const call = client.call('livestream', 'my-stream-id'); await call.join({ create: true, data: { members: [{ user_id: 'host-user', role: 'host' }], custom: { title: 'My Livestream', description: 'A demo livestream powered by Stream', }, }, });

After joining, the host is in backstage mode. They can preview their camera, adjust settings, and prepare before going live. Camera and microphone are enabled explicitly:

typescript
1
2
await call.camera.enable(); await call.microphone.enable();

Going Live

The transition from backstage to live is a single API call. The start_hls: true option enables broadcasting via HLS output for viewers who need it:

typescript
1
await call.goLive({ start_hls: true });

This is the control plane in action. The goLive() call changes the stream's lifecycle state, which triggers the media plane to start delivering video to viewers and enables data plane features like chat.

Host UI with State Hooks

Stream's React SDK provides hooks that automatically update when the call state changes. This eliminates manual state synchronization:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import { useCallStateHooks, ParticipantView } from '@stream-io/video-react-sdk'; function HostUI() { const { useIsCallLive, useParticipantCount, useLocalParticipant, useCameraState, useMicrophoneState, } = useCallStateHooks(); const isLive = useIsCallLive(); const participantCount = useParticipantCount(); const localParticipant = useLocalParticipant(); const { camera, isMute: isCameraMute } = useCameraState(); const { microphone, isMute: isMicMute } = useMicrophoneState(); return ( <div> {isLive ? <span>LIVE</span> : <span>BACKSTAGE</span>} <span>{participantCount - 1} viewers</span> {localParticipant && ( <ParticipantView participant={localParticipant} trackType="videoTrack" /> )} <button onClick={() => camera.toggle()}> {isCameraMute ? 'Camera Off' : 'Camera On'} </button> <button onClick={() => microphone.toggle()}> {isMicMute ? 'Mic Off' : 'Mic On'} </button> </div> ); }

Viewer Join Flow

Viewers have a different flow. They need to handle the case where the stream hasn't started yet. The pattern is to check if the call is live before joining:

typescript
1
2
3
4
5
6
7
8
9
10
11
const streamCall = client.call('livestream', callId); // Get call state without joining const callState = await streamCall.get(); const isLive = callState.call?.session?.live_started_at != null; if (isLive) { await streamCall.join(); } else { // Show waiting screen, poll for live status }

For streams that aren't live yet, poll periodically until the host goes live:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
useEffect(() => { if (viewerState !== 'waiting') return; const interval = setInterval(async () => { const callState = await streamCall.get(); const isLive = callState.call?.session?.live_started_at != null; if (isLive) { await streamCall.join(); setViewerState('joined'); } }, 3000); return () => clearInterval(interval); }, [viewerState]);

Displaying the Host Video

Once joined, viewers find the host participant and render their video:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
function ViewerUI() { const { useParticipants } = useCallStateHooks(); const participants = useParticipants(); const host = participants.find( (p) => p.roles?.includes('host') || p.userId === 'host-user' ); return ( <div className="video-container"> {host ? ( <ParticipantView participant={host} trackType="videoTrack" /> ) : ( <div>Waiting for host video...</div> )} </div> ); }

Adding Chat From the Data Plane

Chat runs parallel to video through Stream's Chat SDK. Initialize it alongside the video client:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import { StreamChat } from 'stream-chat'; import { Chat, Channel, MessageList, MessageInput } from 'stream-chat-react'; const chatClient = StreamChat.getInstance(API_KEY); await chatClient.connectUser( { id: userId, name: userName }, token ); // Create or join the chat channel const channel = chatClient.channel('livestream', callId, { name: 'Livestream Chat', }); await channel.watch();

The chat channel is independent of the video call. This separation means chat continues working even if video encounters issues, and you can scale each independently.

html
1
2
3
4
5
6
<Chat client={chatClient} theme="str-chat__theme-dark"> <Channel channel={channel}> <MessageList /> <MessageInput placeholder="Send a message..." /> </Channel> </Chat>

Custom Events for Reactions

For features like emoji reactions that don't need persistence, use custom events. They're fire-and-forget with minimal latency:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Send a reaction await call.sendCustomEvent({ type: 'reaction', emoji: '❤️', }); // Listen for reactions useEffect(() => { const unsubscribe = call.on('custom', (event) => { if (event.custom?.type === 'reaction') { showFloatingEmoji(event.custom.emoji); } }); return () => unsubscribe(); }, [call]);

This maps to the data plane's different reliability requirements. Chat messages use persistent delivery with history. Reactions use ephemeral custom events, where losing a few doesn't matter.

Ending the Stream

When the host is done, they stop the live session and leave:

typescript
1
2
await call.stopLive(); await call.leave();

Viewers automatically see the stream end through the state hooks. useIsCallLive() returns false, and the UI can show an "ended" state.

Most Systems End Up Hybrid

Live streaming infrastructure has evolved from "push video and hope" to a set of composable building blocks spanning media, control, and data planes. Understanding these building blocks lets you make informed tradeoffs.

The core decisions come down to:

  • Latency vs. scale: WebRTC for interactivity, HLS for massive audiences, hybrid for both
  • Build vs. buy: Infrastructure is complex and operational; managed solutions let you focus on your app
  • Interactivity architecture: Separate media from real-time features, design for synchronization

For most teams, the right approach is to use a managed platform that handles the infrastructure complexity while providing the flexibility to build differentiated experiences. Stream's live streaming is designed for exactly this: low-latency, scalable, interactive live video with the APIs and SDKs to build whatever your product requires.

Integrating Video with your App?
We've built a Video and Audio solution just for you. Check out our APIs and SDKs.
Learn more