Building A Conferencing App With WebRTC: P2P, SFU, or MCU

6 min read

Explore the possibilities involved in building a conferencing application in WebRTC vis-a-vis the communication between the video call participants.

Deven J.
Deven J.
Published October 11, 2022

When dealing with relatively simple real-time data, WebSockets and gRPC have become the go-to mechanisms for most developers. These are relatively simple to work with, and the existing documentation makes building larger applications mostly trivial. When dealing with the WebRTC protocol, however, things are more complicated and nuanced - especially with video and audio. Video conference calling, in particular, is a difficult use-case to build for since there are various participants involved and multiple architectures that developers can use to build to communicate between them.

This article explores the possibilities involved in building a conferencing application in WebRTC vis-a-vis the communication between the video call participants.

To get started, let’s explore the simplest possibility - all participants share their data directly with each other - better known as Peer-To-Peer (P2P). In the following section, we explore ways to achieve P2P communication between devices, the limitations of it, and investigate how we can have an even better approach for larger conference calls.

Understanding Peer-to-Peer Networks

In a video conference call, all members have data such as audio and video that need to be transmitted to all members. In a Peer-to-Peer network, there is no single entity that takes preference over the other. There is no centralized server that the peers on the network transmit data through.

It is easy to see the limitations of this kind of connection between devices. Each device needs to connect with all other devices and maintain it. At any time, devices can be added on or removed from the network - which all devices must acknowledge and deal with. If you imagine a mesh network of these devices, we can do the math of the number of connections involved in the network. The maximum possible connections between n objects is *n (n - 1) / 2**.

We can infer this chart from the formula:

DevicesConnections
21
33
510
1045
501,225
1004,950

Note: This is a simplistic overview. You can also count these in outgoing/incoming streams, which are even higher than this chart.

The equivalent of a large company Zoom call would take an enormous number of connections. This is not to say P2P connections are all bad - just that they have strict limits that should be understood. Outside of a conference call, when only a few participants are involved, P2P will likely be the optimal solution to use.

An added advantage is that these connections consume very few resources on the server side. Since the data from peers are sent directly, there is no need for costly servers to be in between.

While it may seem like there is no need for servers altogether, this is not true. Some signaling servers, such as STUN servers, help mediate the connection between peers. These do not directly transfer any media. Rather, they help two devices find each other over a network and establish a connection via a public IP address.

Additionally, there are TURN servers which are used when a direct connection between two peers cannot be established. These servers transfer media but are only used in cases where a direct P2P connection cannot be established.

For testing, you can use STUN servers that Google hosts:

jsx
stun.l.google.com:19302
stun1.l.google.com:19302
stun2.l.google.com:19302
stun3.l.google.com:19302
stun4.l.google.com:19302

For deploying your own implementation of a STUN and TURN server, a good option is Coturn which is a widely-used open-source implementation for the aforementioned servers.

Peer-to-Peer connections are good for small calls and where lesser server resource use is needed. However, for larger calls, you need to reduce the number of connections that exist between devices to scale the network. This brings us to the next type of connection type, the MCU.

What is the Multipoint Control Unit (MCU)?

The P2P connection mechanism described in the last section made peers in the network transfer data directly to others. A better solution to scale the network and reduce the number of connections involved is to have a central server that mediates the connection and data transfer to all devices. This reduces the total connections to the number of devices on the network. This kind of linear growth can help easily scale the network to a large number of devices. This is how the Multipoint Control Unit or MCU architecture helps solve the previous scale issue of the P2P networks.

MCUs have been the preferred solution for larger conference calling providers for a long time. This has been important since both devices and internet connections have been relatively low-powered in the past. The MCU receives encrypted Streams from all devices connected to it. These streams are then decoded and a single stream is provided to all the devices on the network - so all the devices only need to decode and display only a single stream.

This is not a trivial workload, however. Handling and compositing the output streams takes heavy computational resources, and the entire burden is on the server. Cost-wise, this is more expensive than the other approaches. The more devices that join, the harder it is to create a single stream to send to all devices.

The advantages, some could argue, outweigh the costs. This process requires low bandwidth on the device since it has one outgoing stream and one incoming stream. One critical advantage of this is that supporting older systems is much easier since it is the least computationally intensive and supplies the stream in a single format.

Medooze hosts a media-server repository from which you can use the MCU wrapper to create your own MCU solution.

With P2P connections, you can communicate between devices. With MCUs, it’s possible for a larger amount of devices to connect and share information while increasing compatibility between them - albeit at a higher cost. Is there a way you can decrease the burden on the server while still keeping most of the advantages?

Enter: The Selective Forwarding Unit (SFU)

The Selective Forwarding Unit or SFU is currently the most popular way to connect devices on a conference call. In some ways, it is half-way between the MCU and a P2P network. While it has a single outgoing stream - meaning it does not broadcast its own stream to all devices - it has multiple input streams from all the other devices. The main difference here is that the input streams from all the devices are forwarded from the server and not provided by the devices via a P2P connection.

In some ways, SFUs are a step back from the MCU approach. Multiple streams are coming in to each device which it needs to decode individually. This requires more compute power on the device compared to the MCU approach, since the server does not process the streams anymore - but less compute power compared to P2P as each WebRTC Agent only has to encode and upload their video once.

This also limits scalability to some extent since the more devices, the more streams that each device needs to decode. Computationally, this is far better than a P2P mesh but worse than an MCU. However, this greatly reduces the central server load since there is now no processing happening - the streams are simply being forwarded to each endpoint.

Livekit has a great open-source SFU which you can find here.

Additionally, OSSRS has an SFU you can use. There are additional docs on how to set up the SFU here.

A problem with SFUs is that if one of the devices in the network has a lower quality connection, all others would be degraded to the same quality. Simulcast SFUs - a new version of SFUs - solve this by creating multiple streams from a single device that comprise of different qualities. This allows the higher bandwidth devices to enjoy the better streams while allowing the lower quality devices to only load the other lower quality streams. While this sounds relatively straightforward theoretically, a significant amount of effort is needed to create a good SFU that is performant and can adequately deal with any errors that pop up and handle all the different potential video codecs.

Ready to Start Streaming?

While building your video conferencing application, all of these approaches are worth considering. Each one is valid depending on the number of people, the type of connections, the bandwidth available, etc. But here’s a TLDR - if you have a small number of devices and the application will never scale beyond it, P2P gives you the best value. For a reasonable amount of users and a slightly higher cost, the SFU has the best application. For a significant number of users and/or when dealing with legacy systems, an MCU may still be the best approach, but the cost should be accurately estimated before you deploy your final system.