Introduction to WebRTC for Unity Developers

What is WebRTC?

Web Real-Time Communication (WebRTC) is a powerful technology enabling real-time video, voice, and data exchange between peers over the network. WebRTC is an open-source technology freely available for both personal and commercial use. While most modern web browsers natively support WebRTC, its core is a C++ library that integrates with many programming languages, including C# for Unity.

Initially designed for P2P (Peer-to-Peer) connections, WebRTC is often used in conjunction with specialized media servers like SFU (Selective Forwarding Unit) or MCU (Multiport Control Units) in commercial settings, where the limitations of P2P (such as scalability and reliability concerns) are not acceptable.

Peer-to-Peer vs Client-Server

In a P2P (Peer-to-Peer) architecture, two or more computers on a network share data directly without needing a central server to facilitate the exchange. In contrast, most popular multiplayer games use a client-server architecture where an authoritative server handles all data transmitted by players.

P2P architectures are organically scalable compared to client/server because participants use their own resources like bandwidth, storage, and processing power. As peers join the network, more resources become available. With the client-server architecture, a server has a limited capacity of participants it can process. As the number of peers increases, more servers are required to facilitate the connection. This difference drastically impacts the running costs, making the P2P approach incomparably cheaper.

When it comes to WebRTC and media transport, P2P connections encounter scalability issues due to the high bandwidth requirements of video streams. In scenarios with multiple participants streaming video, bandwidth can quickly become a bottleneck, leading to poor video and audio quality. Media servers, used within a client-server architecture, can dynamically adjust video resolution based on network conditions and individual participant needs, ensuring a smoother experience for all.

The NAT/Firewall Problem

Sending data packets directly to a computer on the internet can be tricky or sometimes even impossible. To understand why, let's look at how the Internet functions. Devices online are identified by a unique public IP (Internet Protocol) address. This address helps us find a path through the network to send data to the right machine. However, it's common for an entire network of devices, like those in homes or offices, to share a single public IP. This is known as a private network.

Take the simple action of visiting a website: The URL of the website you visit is translated into a public IP address of the server. Your web browser sends a request, and the server with that IP sends back the website you requested. But when many devices use one public IP, we need a way to ensure responses from the server get back to the device that asked for them. That's where NAT comes in.

A NAT (Network Address Translation) is a router-based technology that rewrites packets passing through it. When a device belonging to a private network sends packets to another device outside its network, a NAT will map the private IP address of the sender to a specific outbound port and rewrite the packets to include the public IP of the router. The mapping is stored in the NAT Table. The packets that are received back will be initially addressed to the router's public IP and a port number, and the NAT will rewrite them back to include the private IP of the initial sender based on its internal port to private IP address mapping.

The diagram below visualizes this concept:

Private networks are crucial for conserving the limited number of IPv4 addresses available and for providing an added layer of security by isolating the internal network from the public internet, but they pose a challenge for direct peer-to-peer (P2P) connections.

First, devices with private IP addresses don't know their own public IP and can only receive packets if a record in the NAT Table maps their private IP to an outbound port of the router.

Second, some NAT/Firewall setups are very strict, blocking incoming packets unless a device inside the network has initiated the contact. This is common with symmetric NATs, which can prevent P2P connections entirely, especially if both devices are behind this type of NAT.

In the upcoming sections, we will delve into how STUN and TURN protocols help navigate these hurdles.

Signaling

Before peers can stream media to one another, they must exchange information necessary to establish the connection and agree on the format for transmitting multimedia data. This preparatory phase is known as signaling. Even with a peer-to-peer model,
a connection will only be possible with the use of a signaling server that will facilitate the exchange of connection information between the peers.

The WebRTC protocol does not define how the signaling data should be transferred between the peers, and it is left up to the developer to pick a suitable solution. However, WebSockets are frequently used for their efficiency and reliability in real-time communication.

Signaling typically involves sharing network details such as public IP addresses and ports, and session descriptions that outline how the media should be encoded and transferred. The signaling process includes the exchange of 'SDP Offer' and 'SDP Answer' messages, as well as 'ICE Candidates' that inform the peers about how they can connect to each other.

SDP

SDP (Session Description Protocol) is a standard format for describing the parameters of multimedia communication sessions. SDPs are exchanged during the signaling phase and contain information about the media type, transport protocol, IP address, port number, and other details that describe a network endpoint for transferring media.

The SDP exchange follows the Offer-Answer model. One peer initiates the session by sending an "offer,” and the other responds with an "answer.”

To better understand this, let's examine video encoding, one aspect that WebRTC peers negotiate in detail.

Like all internet-transferred data, video streams are encoded into bytes and packaged into network packets. This encoding is managed by video codecs such as VP8 or H264. Video codecs define how video frames are compressed into bytes for efficient transmission and, subsequently, how these bytes are decompressed back into video frames at the destination. Peers must agree on a codec to ensure the video session is successful.

During signaling, Peer A will generate an SDP Offer listing the video codecs it supports. Peer B will then select codecs supported by both and reply with an SDP Answer. This agreement on video codecs is crucial for video communication.

SDP not only addresses media codecs but also conveys additional session-related details that are critical for establishing and maintaining efficient communication. This includes the setup of media attributes that dictate whether the stream is send-only, receive-only, or bidirectional. It encapsulates transport addresses, indicating where the media should be sent, and specifies the type of media (audio, video, generic data).

Moreover, SDP outlines network-level considerations, such as the use of multiplexing for different media types over a single transport link. Each piece of information in an SDP is essential for the peers to agree upon the parameters of their interaction, ensuring that the media exchange is not only possible but optimized for the conditions of their network environments.

Here are the steps for the SDP offer-answer exchange:

Peer A creates the SDP Offer as the session initiator.
Peer A sets its own SDP Offer in its "PeerConnection" object.
Peer A sends the offer to Peer B via the signaling server.
Peer B receives the SDP Offer.
Peer B sets Peer A's offer in its "PeerConnection" object.
Peer B creates an SDP Answer.
Peer B sends the Answer back to Peer A.
Peer A sets the SDP Answer in its "PeerConnection" object.

The diagram below visualizes the steps:

ICE (Interactive Connectivity Establishment)

Interactive Connectivity Establishment (ICE) is a protocol framework used in WebRTC to facilitate discovering and negotiating the most efficient network paths for multimedia traffic. During the setup of a WebRTC connection, both sides generate several ICE candidates, each representing a potential method for establishing the connection.

During the signaling phase, peers exchange their ICE candidates and, through a series of connectivity checks, determine the best candidate to use. The ICE framework prioritizes these candidates based on network conditions, speed, and reliability. The chosen path is then used to establish the connection for the media stream.

Moreover, ICE is dynamic; it continues to listen for and test the viability of alternate network paths. If a better option arises, even after the initial connection has been made, ICE can switch to the new path to improve the quality of the communication.

This continuous monitoring and adaptation make WebRTC robust in handling real-time media over the unpredictable terrain of the internet.

ICE Candidate

An ICE candidate represents a potential network endpoint that WebRTC can use to establish a connection.

There are three main types of ICE candidates: host candidates, which are derived from the device's physical network interfaces; server reflexive candidates, which are public IP addresses discovered by the STUN server to allow connections across NATs; and relayed candidates, which are provided by TURN servers when firewalls or NATs block direct connection paths.

Each candidate includes the IP address and port number for a particular connection pathway and information about the transport protocol, such as UDP or TCP.

ICE Trickle

Finding all possible endpoints between two peers, known as ICE candidates, can be time-consuming, taking seconds or even minutes. A technique called ICE Trickling is employed to facilitate a fast connection between peers. This method involves incrementally sending ICE Candidates to the remote peer as soon as they are discovered, rather than waiting for the entire gathering process to complete.

This approach allows the WebRTC connection to be initiated with the earliest viable path, enhancing user experience by significantly reducing the time required to connect. If a more optimal path becomes available, WebRTC will seamlessly switch to this better connection, ensuring the most efficient communication throughout the session.

STUN

STUN (Session Traversal Utilities for NAT) is a protocol that assists peers within private networks in discovering their public IP addresses and ports. When behind a NAT, a peer knows only its private IP. To establish connections with devices outside its network, it must learn its public IP and the port assigned by the NAT. This is where STUN comes into play: peers send a request to a STUN server, which then replies with the peer's public IP address and port, as seen from the wider Internet.

STUN is a relatively simple protocol, so many organizations provide public STUN servers for free. Google, for instance, offers a range of STUN servers:

Hostname	Port
stun4.l.google.com	19302
stun3.l.google.com	19302
stun2.l.google.com	19302
stun1.l.google.com	19302
stun.l.google.com	19302

It's important to note that while STUN works for many NAT environments, it may not be effective for all, particularly with symmetric NAT, where each outbound connection has a different public endpoint. Another approach, such as TURN, may be necessary in such cases.

TURN

TURN (Traversal Using Relays around NAT) is a protocol designed to facilitate connection in scenarios where peer-to-peer (P2P) communication is blocked by NAT or firewall restrictions. When peers cannot communicate directly, the TURN server acts as an intermediary, relaying traffic between them. This process requires more bandwidth and server processing power than STUN, resulting in TURN being used primarily as a fallback option.

Due to its resource-intensive nature, TURN can introduce additional latency and cost. Moreover, unlike public STUN servers, TURN services typically require authentication, ensuring authorized access to the relay service.

WebRTC solutions for Unity Engine

WebRTC is not natively supported by the Unity Engine, unlike in web browsers, where it is part of the browser's API. The WebRTC library, written in C++, requires a C# wrapper/bridge to be used in Unity. This bridging is made possible through Interoperability, allowing for the integration of C++ libraries within C# projects.

Creating a C# wrapper for the C++ WebRTC library can be complex, especially when aiming for cross-platform support. For efficiency and reliability, it's often recommended to use established libraries. However, the range of available packages for this purpose is limited.

Unity WebRTC: Developed and maintained by Unity Technologies, this official package simplifies the incorporation of WebRTC functionalities into Unity applications. It supports video and audio communication, as well as data channels, offering a comprehensive solution for developers seeking to add real-time communication capabilities to their projects. This is the most recommended approach, offering broad platform support, including Windows, MacOS, Linux, iOS, and Android.

Mixed Reality Toolkit-WebRTC (MRTK-WebRTC) from Microsoft: Although it has been marked as deprecated, MRTK-WebRTC still continues to receive positive feedback from users for its functionality. It is designed to support real-time communication in mixed-reality applications and provides audio and video streaming capabilities. Despite its deprecated status—which warrants caution concerning future maintenance and updates—MRTK-WebRTC could still be a suitable choice for projects where its features meet the developers' needs and where the absence of ongoing official support is not an issue.

While other attempts to port WebRTC to C# exist, along with some paid assets, Unity WebRTC and MRTK-WebRTC stand out as the primary solutions endorsed and utilized by the broader community.

Coding WebRTC in Unity

For developers looking to integrate WebRTC within Unity applications using Unity's official WebRTC package, we've prepared a step-by-step guide on creating a basic video streaming application using a peer-to-peer model. I highly recommend taking a hands-on approach to apply this knowledge practically, enhancing your understanding and skills in real-time communication development.

Conclusion

This article has covered the key components of the WebRTC framework, including its use in various architectures like peer-to-peer and client-server models, the challenges posed by NATs and firewalls, and the crucial role of signaling, SDP, and ICE in establishing connections. By leveraging technologies such as STUN and TURN, WebRTC navigates the complexities of modern networks, ensuring seamless, high-quality communication across the web.