Building real-time video is the single hardest infrastructure decision most product teams will face, and the one they're most likely to underestimate.
Why? The underlying concept of most real-time video, WebRTC, is well documented and well supported. A basic peer-to-peer video demo can be hacked together in a morning with AI, with nothing more than a single prompt and a few clarifying questions.
But this creates a dangerous illusion of simplicity.
In this guide, we provide the framework to think through the build vs. buy decision, the costs involved, and the concepts you need to understand to make it well.
Why Video is Harder Than it Looks
Maybe your Claude-coded video app can connect your laptop and phone, but can it connect thousands of simultaneous users across the globe? Maybe it holds a stable connection on your office WiFi, but can it gracefully adapt when a user on a train in Mumbai drops to 3G mid-call? Maybe it has a camera toggle and a mute button, but can it serve as the basis for a live-shopping experience?
Fundamentally, there is more to video than compressed frames, and even delivering frames correctly is hard.
In production, video infrastructure demands understanding codec negotiation, adaptive bitrate algorithms, and echo cancellation across thousands of device combinations. You need to think about TURN and STUN servers, ICE candidates, NATs, and recording pipelines, and the operational burden never stops. Every Chrome release can break your implementation.
Is Video Your Product, or a Feature of Your Product?
If you're building a Zoom competitor, the implementation is different from when you are building a medtech app or a live shopping experience. If video is the product, full ownership of the video stack may be justified because every millisecond of latency and every quality optimization directly impacts your competitive position.
This isn’t the reality for the vast majority of applications. For most teams, video is a feature. A medtech app adds video consultations. A SaaS platform adds video onboarding. An edtech product adds live classes. A social app adds video calls. In all these cases, and more, video needs to work reliably, but it's not what users are paying for. It enables the experience without being the experience.
(And even when video is the product, full ownership is often too complex for most businesses.)
Here are the questions that matter:
- Will users choose your product over a competitor specifically because of your video quality?
- Does your product require video capabilities that aren't available in any API?
- Do you have WebRTC specialists on staff today, or would you need to hire them?
- Is your team prepared to maintain video infrastructure indefinitely, including on-call rotations?
If you answered "no" to most of these, the build path carries significantly more risk than reward. Here's what teams are really choosing between.
| Dimension | Building In-House (+ AI Tools) | Using a Third-Party API |
|---|---|---|
| Primary strength | Full architectural control | Production-grade quality from day one |
| Best for | Products where video IS the product | Products where video is a feature |
| Time to v1 | 6-8 months | Days to weeks |
| Time to production-grade | 12-18 months | Days to weeks |
| Scalability | Must be engineered and maintained, case-by-case | Built in |
| Recording | Separate infrastructure project | Included or add-on |
| AI features (noise cancellation, background blur) | Requires a dedicated ML team | Included |
| Cross-platform support | Each platform is a separate build | SDKs for all major platforms |
| Cost profile | Low infrastructure cost, high engineering cost | Predictable, usage-based |
| Ongoing maintenance | 3-5 dedicated engineers minimum | Largely outsourced |
What Makes Video Infrastructure Uniquely Hard To Build
Video is not "just another real-time feature." It requires simultaneously solving problems in networking, signal processing, machine learning, codec engineering, and distributed systems, all under hard real-time constraints measured in milliseconds.
Here’s a complete video pipeline for real-time video at scale:
In fact, we simplified this image, but it is the very basics of what’s needed for a robust video pipeline.
Really, we can break this down into four specific problems:
1. The Signaling Problem
WebRTC is the foundation of real-time video on the web. The spec handles media capture, encoding, and transport, but it intentionally leaves signaling (how peers discover and connect) entirely to the application developer. You have to build that yourself.
Connection establishment requires ICE negotiation, which orchestrates STUN and TURN servers to find the best network path between participants. Roughly 20-25% of real-world connections can't be established peer-to-peer and require TURN relay servers to forward all media traffic, adding both latency and significant bandwidth cost.
Peer-to-peer connections grow quadratically:
A 5-person call requires 10 connections. A 10-person call requires 45, with every device encoding and uploading 9 simultaneous streams. At 25 participants, you're at 300 connections. At 1000, almost half a million connections.
| Participants | Connections required | Streams each device sends |
|---|---|---|
| 2 | 1 | 1 |
| 3 | 3 | 2 |
| 5 | 10 | 4 |
| 10 | 45 | 9 |
| 25 | 300 | 24 |
| 100 | 4,950 | 99 |
| 1,000 | 499,500 | 999 |
This is the point at which teams are forced into an SFU (Selective Forwarding Unit) architecture, in which a central server receives one stream from each participant and selectively forwards it to the others. The SFU solves the bandwidth problem on the client, but introduces server-side media routing, global load balancing, and distributed state management on the backend.
2. The Codec Problem
Video codecs are a moving target. H.264 has universal browser support and hardware acceleration, but it's royalty-encumbered. VP9 delivers better compression with SVC support for WebRTC. AV1 achieves 20-30% better compression than H.265 but is 5-10x slower in software encoding, limiting real-time use to devices with hardware encoders (Apple M3+, recent NVIDIA and Intel GPUs).
Safari only supports AV1 on M3+ Macs and iPhone 15 Pro+, meaning you always need H.264 fallback for older Apple devices. Every codec transition requires testing across your full device matrix, and you need to support multiple codecs simultaneously.
3. The Audio Problem
Echo cancellation is the hardest audio problem in real-time communication. Google invested years developing AEC3, their delay-agnostic echo cancellation algorithm. Each browser implements audio processing differently. Chrome, Firefox, and Safari all handle echo cancellation, noise suppression, and automatic gain control with different approaches and different failure modes.
On Android, audio hardware abstraction layer implementations vary by device vendor, meaning echo cancellation only works correctly if the manufacturer supported it properly. Most product teams discover this problem after launch when users report echo on specific headset and browser combinations they never tested.
4. The Recording Problem
Recording seems simple until you actually build it. Client-side recording has fundamental limitations: you don't know the user's available storage, a one-hour session produces roughly a gigabyte of data, and synchronizing recordings across multiple clients is an unsolved problem.
Server-side recording requires either composite recording (MCU-based, which mixes all participants into a single file but is computationally intensive) or individual recording (SFU-based, which captures each participant separately but requires post-processing). Both approaches reduce your media server's concurrent session capacity and incur substantial storage and CDN costs that scale with the number of participants, duration, and resolution.
The takeaway from all of that is you must, at least partly, rely on established infrastructure and technologies. Unless you have some specific audio, video, or networking challenge, off-the-shelf is always the best way forward.
The Ongoing Costs After You Build
Even if you get a working video feature shipped, the operational burden is what teams most consistently underestimate. Video infrastructure is never "done."
Video infrastructure has an unusually large number of external dependencies you don't control. Your video feature sits atop browser engines, operating systems, network conditions, device hardware, codec standards, and user expectations, all of which change on their own schedules. Each change can break something. And unlike most software, the failure mode is immediately visible: the call drops mid-sentence, audio echoes back at the speaker, video freezes on an unflattering frame, or a reconnect spinner appears at exactly the wrong moment.
This creates three compounding pressures that grow over the lifetime of your video feature.
1. The Platform Keeps Moving Underneath You
Chrome ships a new version roughly every four weeks. Each release can change how WebRTC handles statistics, SDP negotiation, audio processing, or media capture. Some of these changes are documented. Many aren't. Firefox and Safari follow their own schedules with their own implementation decisions. Android device manufacturers each implement audio hardware abstraction differently.
You can't control any of this. You can only react, test, and patch permanently.
| What changed | What broke | Engineering cost |
|---|---|---|
| Chrome M107 changed WebRTC statistics identifiers from descriptive to random | Every application that parsed stats by identifier type | Rewrite stats collection across all clients |
| Chrome M109 deprecated track/stream report removal | Twilio's SDK broke, forced Chrome to roll back | Months of workaround and migration |
| Plan B → Unified Plan SDP transition | Every WebRTC implementation's signaling layer | Refactoring effort industry-wide |
| Firefox DC offset bug in microphone signal | Persistent echo on specific headset + browser combos | Debugging across hardware/software matrix |
These are a few representative examples from the last few years. A telehealth startup engineer building on raw WebRTC described the experience: frequent refactoring with every Chrome release, and after 10 hours of unsuccessful Android library compilation, “a mental breakdown.”
2. User Expectations Grow Faster Than Your Feature Set
When you launch video, users compare it to whatever they used last: Zoom, Google Meet, FaceTime. Those products have hundreds of engineers working on quality. The baseline for "acceptable" keeps rising.
- What users expected in 2020: The call connects. Audio and video work.
- What users expect in 2026: Noise cancellation filters out background sounds automatically. Background blur hides their messy room. Transcription runs in real time. The call adapts seamlessly when bandwidth drops.
- What users will expect in 2027: AI agents that participate actively in calls — answering questions, surfacing relevant data mid-conversation. Emotion analytics that flag when participants are confused or disengaged. Persistent AI memory that carries context across dozens of previous calls.
Each of these represents a significant engineering investment. Noise cancellation alone requires training deep neural networks on massive audio datasets. Background blur requires running ML segmentation models on the client side. Real-time transcription requires GPU infrastructure or a third-party API. And each one has to work across every browser and device combination you support.
The teams building Zoom, Meet, and Teams are adding these features with dedicated ML teams. Your team has to match their output while also maintaining the infrastructure you already built.
3. Scale Introduces Problems That Didn't Exist at Launch
A video feature that works for 100 concurrent users can fail in non-obvious ways at 10,000. Each order-of-magnitude jump surfaces a new class of problem.
| Scale | What works fine | What breaks |
|---|---|---|
| 100 concurrent users | Single SFU server, simple routing | — |
| 1,000 concurrent users | Basic load balancing | Session stickiness failures, TURN server capacity, recording storage fills up |
| 10,000 concurrent users | Regional deployment | SFU cascading required, distributed state sync, inter-region latency, and auto-scaling race conditions |
| 100,000 concurrent users | Multi-region mesh | Bandwidth costs spike non-linearly, monitoring infrastructure is overwhelmed, and the on-call team is undersized. |
Each row in this table is a separate infrastructure project. And you can't skip ahead: the problems at 10,000 users aren't visible at 1,000, so you discover them in production under load when the cost of failure is highest.
The Compounding Effect
These three pressures multiply. A Chrome update breaks your stats collection (pressure 1) at the same time users are asking why you don't have background blur yet (pressure 2), and your concurrent sessions just crossed a threshold that requires a new SFU region (pressure 3). Your video team, which you thought would shrink after launch, needs to grow.
Industry research consistently shows that maintenance and enhancement consume anywhere from 50-80% of total software lifecycle costs. Browser compatibility updates, codec transitions, device testing, scaling, and on-call coverage require a permanent team.
The Cost of Build vs. Buy
Let’s get into the numbers.
Below are two scenarios for a product team adding in-app video. These are approximate but should give a good ballpark estimate of the costs of building vs. buying video in 2026.
Scenario 1: Building In-House (With AI-Assistance)
Let’s say we’re talking about a moderate-scale product (B2B SaaS or consumer app). The engineering team you need depends on what you're building.
We'll use two examples: a telehealth platform (simple 1:1 calls) and a live shopping app (one-to-many streaming). These days, in-house engineers will be using AI-assisted development tools such as Claude, Codex, or similar.
Initial engineering and tooling costs might look something like this:
| Role | Telehealth (1:1 video) | Live shopping (1-to-many) |
|---|---|---|
| Backend engineers | 2-3 ($380K-$570K) | 3-4 ($570K-$760K) |
| Frontend engineers | 1-2 ($190K-$380K) | 2 ($380K) |
| DevOps / infrastructure | 1 ($190K) | 1 ($190K) |
| Product / QA (partial allocation) | 0.5-1 ($95K-$190K) | 1 ($190K) |
| AI tooling (Claude Max + API usage) | $8K-$20K | $8K-$20K |
| Year 1 total | $863K-$1.35M | $1.34M-$1.54M |
Note: Salaries here are based on the median US software engineer total compensation of $190K. This is a lower-bound estimate as WebRTC specialists command premium compensation ($250K+) precisely because of the difficulty and scarcity of excellent video engineers. For AI tooling, we’ll assume Claude Max subscriptions across the board at $100-$200/month per seat (which includes Claude Code), plus additional API usage for CI/CD integration and automated testing.
Live shopping needs more backend engineers because you're building an RTMP ingest pipeline, a transcoding layer for adaptive bitrate delivery, CDN integration, and a real-time interaction layer (chat, reactions, product overlays) on top of the video stream.
But that’s just the people. You’ll also need infrastructure to run. That infrastructure depends on your architecture. 1:1 calls (telehealth) can use peer-to-peer WebRTC. Media flows directly between participants, so you don't need an SFU or pay for server-side egress on most calls.
Note: One thing teams consistently underestimate is that infrastructure costs sting even before you reach meaningful scale. Hosting providers like AWS and Cloudflare reserve their best pricing tiers for high-volume usage — so in the early months, you're paying retail rates on relatively modest workloads.
You need:
- Signaling server. Lightweight, handles session negotiation. A small EC2 instance at ~$50-$100/month.
- TURN relay. ~20% of sessions can't establish a direct connection and need a relay server. Self-host coturn on AWS c5.xlarge at ~$124/month, with 5-10 instances for global coverage. Or use managed Cloudflare TURN at $0.05/GB.
- Recording. P2P has no server in the media path to capture the stream, so recorded sessions need to be routed through a media server. At 20% of calls recorded, this is a small compute + S3 storage cost.
One-to-many streaming (live shopping) needs a full media pipeline. The host's stream is ingested via RTMP, transcoded into multiple quality tiers (adaptive bitrate), packaged as HLS/DASH, and served through a CDN to hundreds or thousands of viewers.
The dominant costs are:
- Transcoding compute. Encoding a 2K stream into an ABR ladder is CPU-intensive. You need dedicated instances during live events.
- CDN egress. This is the big number. 1,000 viewers each pulling a 2K stream (~8 Mbps) for 60 minutes generates ~3.6 TB per session. At 30 sessions/month, that's ~108 TB of CDN delivery. CloudFront pricing tiers range from $0.085/GB (for the first 10 TB) to $0.060/GB (for the next 100 TB).
- Monitoring and real-time interaction. Chat, reactions, and product overlays run alongside the video stream and need their own infrastructure.
Let’s work through a couple of examples. With telehealth, let’s say you have 50K MAU for 1:1 consultations of 30 min at 720p. 20% of consultations need to be recorded for compliance.
| Component | Monthly cost |
|---|---|
| Signaling server | ~$50-$100 |
| TURN relay (~20% of sessions) | ~$250-$750 |
| Recording (20% of calls routed through media server + S3) | ~$200-$500 |
| Total | ~$500-$1,350/month |
The infrastructure cost alone is ~$6K-$16K/year. But live shopping infrastructure costs are much higher. In this scenario, we’ll have one 1-hour livestream per day with 1,000 viewers at 2k (so they can see what they’re buying).
| Component | Monthly cost |
|---|---|
| Transcoding compute (ABR encoding during events) | ~$200-$500 |
| CDN egress (~108 TB/month via CloudFront) | ~$7,000-$8,000 |
| Signaling + interaction infrastructure | ~$300-$500 |
| Total | ~$7,500-$9,000/month |
Here, the infrastructure cost is ~$90K-$108K per year.
For ongoing maintenance, expect to keep 3-5 engineers on the telehealth platform and 4-6 on live shopping, at $190K each. Add infrastructure costs that scale with usage.
So, the 3-year totals for the in-house build are:
| Period | Telehealth | Live shopping |
|---|---|---|
| Year 1 (build + infra) | $869K-$1.37M | $1.43M-$1.65M |
| Year 2 (maintenance + infra) | $576K-$966K | $850K-$1.25M |
| Year 3 (maintenance + infra) | $576K-$966K | $850K-$1.25M |
| 3-year total | $2.0M-$3.3M | $3.1M-$4.15M |
Let’s see how that compares to the “buy” option.
Scenario 2: Using a Third-Party Video API
Instead of building infrastructure, you pay per minute of usage. Here's what both examples cost using Stream’s Video pricing.
Telehealth on Stream
Stream prices video calls on participant minutes. At 720p with 2 participants, each user receives one video track (921,600 pixels), which falls in the HD tier at $1.50 per 1,000 participant minutes. Recording is billed separately at $6 per 1,000 call minutes.
50K monthly sessions × 2 participants × 30 minutes = 3M participant minutes/month.
| What You Pay For | Annual Cost |
|---|---|
| Video calling (3M participant min/month × $1.50/1,000) | $54K |
| Recording (300K call min/month × $6/1,000) | $21.6K |
| Integration engineers (1-2 × $190K) | $190K-$380K |
| Annual total (before potential volume discount) | $266K-$456K |
HIPAA compliance is included in Stream's enterprise plans at no additional cost.
Live Shopping on Stream
Stream prices live streaming on participant minutes at the viewer's resolution tier. At 2K quality, that's $4.00 per 1,000 participant minutes.
30 monthly events × 1,000 viewers × 60 minutes = 1.8M participant minutes/month.
| What You Pay For | Annual Cost |
|---|---|
| 2k / 1440pm live streaming (1.8M participant min/month × $4.00/1,000) | $86.4K |
| Integration engineers (1-2 × $190K) | $190K-$380K |
| Annual total | $276K-$466K |
The 3-year totals don’t change with the API approach:
| Period | Telehealth | Live shopping |
|---|---|---|
| Year 1 | $266K-$456K | $276K-$466K |
| Year 2 | $266K-$456K | $276K-$466K |
| Year 3 | $266K-$456K | $276K-$466K |
| 3-year total | $798K-$1.37M | $828K-$1.4M |
Let’s look at the side-by-side comparisons. The infrastructure costs for building in-house can look manageable, especially for simple architectures like P2P. But infrastructure isn't what drives the total. The team is. Even with AI-assisted development, you need 4-8 engineers to build, and most of them will be needed permanently for maintenance, on-call support, browser updates, and new features.
When you buy, that team is baked into the per-minute price. You get ongoing maintenance, scaling, cross-platform SDKs, and new capabilities like noise cancellation without hiring for them.
| Dimension | In-House | Stream Video |
|---|---|---|
| 3-year cost (telehealth) | $2.0M-$3.3M | $798K-$1.37M |
| 3-year cost (live shopping) | $3.1M-$4.15M | $828K-$1.4M |
| Time to first call/stream | 3-6 months | Days to weeks |
| Dedicated engineers required | 4.5-8 | 1-2 |
| Recording | Separate project | $6/1,000 call minutes |
| Noise cancellation, background blur | Separate project | Included |
| Cross-platform SDKs | Must build per platform | React, React Native, iOS, Android, Flutter |
| HIPAA compliance | Your responsibility | Included (enterprise) |
| On-call burden | Your team | Stream |
The gap is driven by engineering, not infrastructure. The time saved can go toward the features that set your product apart.
The Opportunity Cost: What You're Not Building
The cost tables above capture dollars, but the bigger loss is time. Every engineer maintaining video infrastructure is an engineer, not shipping the features that differentiate your product.
Consider the telehealth example. Building in-house requires 4.5-7 engineers in year 1 and 3-5 permanently after that. Those aren't junior hires. Video infrastructure demands engineers who understand low-level networking, codec internals, and browser behavior—your most senior, most expensive people. With Stream, one engineer handles the integration in days. The rest of the team ships product from week one.
Those 3-6 freed-up engineers could be building:
- EHR integration. Pull patient history, medications, and lab results directly into the consultation view so providers don't switch between systems mid-call.
- Clinical decision support. Surface relevant care guidelines and drug interaction warnings in real time based on the patient's record and the provider's notes.
- Automated patient intake. Let patients complete insurance verification, symptom questionnaires, and consent forms before the visit, cutting 10-15 minutes of admin time per session.
The same goes for the live shopping example:
- Checkout-in-stream purchasing. Let viewers buy featured products without leaving the stream, reducing the drop-off between "I want that" and "order confirmed."
- Post-event replay with shoppable timestamps. Tag products to specific moments in the stream so viewers who missed the live event can still browse and buy.
- Audience segmentation and retargeting. Track which products each viewer engaged with during the stream and feed that data into your CRM for follow-up campaigns.
What starts as "just video calling" also expands. Users expect screen sharing, then recording, then transcription, then live streaming to larger audiences. Each addition is another project on top of the infrastructure you already maintain. With a vendor, those features ship as API updates.
How Real-World Team Buy Great Video
It's Complicated: Telehealth-Grade Video for Therapy Sessions
It's Complicated is a Berlin-based therapy platform connecting therapists with clients. They started on Twilio's programmable video but ran into reliability problems that were especially damaging in their context: a dropped call during a therapy session disrupts the therapeutic relationship, not just the meeting.
They needed reliable 1:1 and multi-participant calls (for family and group therapy), HIPAA compliance for US market expansion, and session tools like background blur and whiteboarding. Building all of that in-house would have consumed their small engineering team.
After integrating Stream's Video API, they got improved call stability, integrated chat, multi-participant support, and HIPAA compliance out of the box. As Head of Product, Robbie Hollis put it:
"HIPAA compliance is crucial for us... Stream helped us achieve this compliance and offered competitive pricing and flexible packages that fit our startup budget."
The team could focus on therapy-specific features rather than on video infrastructure.
Campus Buddy: 1:1 Video Calls in 72 hours
Campus Buddy is a university-focused social platform where students discover clubs, events, and connections on their campus. Founder James Mtendamema wanted to add real-time video and audio calling, but had no interest in building it from scratch.
"Adding video and audio from scratch is extremely hard in my experience. I knew we needed something that would just plug in and work."
One developer integrated Stream's video and audio SDKs across Android and iOS in 72 hours. Android was fully functional within a few hours. iOS took slightly longer due to Apple-specific configuration, not Stream's SDK. The team used AI tools alongside Stream's documentation to stay in flow and ship fast.
The result: 1:1 video and audio calls available to all students as core functionality, not a premium feature. Campus Buddy could then plan group calling for student clubs and organizations, built on the same Stream integration.
How Stream Video Solves the Build vs. Buy Dilemma
Stream Video is purpose-built for teams that need production-grade video without the infrastructure burden. Here's what's included:
- Integrated platform. Stream provides a unified platform for video, chat, and activity feeds. If your product already uses Stream Chat, adding video requires minimal additional integration. Shared user models, permissions, and moderation work across both.
- Pre-built UI components. Native SDKs for React, React Native, iOS (SwiftUI), Android (Jetpack Compose), and Flutter. Ship a complete video calling experience without building UI from scratch.
- Production-grade quality. Noise cancellation, background blur, adaptive bitrate streaming, and support for thousands of concurrent participants. SFU infrastructure is managed globally with automatic region selection.
- Recording included. Server-side recording with composite and individual options, without building a separate recording pipeline.
- AI-ready. Built-in support for real-time transcription, meeting summaries, and extensibility for custom AI features.
- Compliance. HIPAA-eligible plans with BAA support, SOC 2 certification, GDPR compliance, and encryption at rest and in transit.
- Predictable pricing. Usage-based pricing at $0.0015 per participant-minute with no hidden infrastructure costs. Your video bill scales linearly with usage, not exponentially with engineering headcount.
The Build vs. Buy Decision Framework
After working through the analysis above, here's a framework to guide your final decision.
Consider building in-house if:
- Video quality and customization are your primary competitive advantage
- You have 3+ WebRTC specialists on staff today
- You're prepared to staff a permanent 4-5 person video infrastructure team
- You need capabilities that no existing API provides
But buy if:
- Video enables your product, but it isn't the core value proposition
- Your engineering team is better spent on product differentiation
- You need cross-platform support (iOS, Android, web) without separate builds
- You want recording, AI features, and compliance included
- You want to ship in days or weeks, not months or years
- Predictable, usage-based cost is more attractive than open-ended engineering investment
For most products, the advantages in cost, speed, and feature completeness make a video API the clear choice. The teams that benefit most from building in-house are rare cases where video is the entire product, and deep architectural control directly translates into competitive advantage.
The build vs. buy decision for video comes down to one question: Is video infrastructure the best use of your best engineers' time?
AI-assisted development has made it easier to get a video demo working. But the demo is the only easy part. The hard part is the 18 months after launch: browser updates, device fragmentation, codec transitions, the recording pipeline, echo cancellation edge cases, on-call rotations, compliance requirements, and AI features your users now expect as defaults.
For the vast majority of products, the answer is clear. Let someone else handle the infrastructure. Ship video in days. And spend your engineering time on the features that actually differentiate your product.
