Introduction to MCU Architecture
Multipoint Control Units (MCUs) have long been the preferred solution for larger conference calling providers. Both devices and internet connections in the past were significantly weaker than they are today.
Due to slower internet speeds, individual devices could not get as much data from the network. Additionally, available compute power was low, and heavy processing in situ was rarely an option. How do you connect devices that cannot download or process many video streams? The MCU tried to solve this problem.
The P2P connection mechanism described in the last lesson made peers in the network transfer data directly to others. However, scaling was a major issue for P2P network architectures. A better solution to scale the network and reduce the connections involved could be to have a central server to which all devices on the network connect. This reduces the total connections to the number of devices on the network since every device only has one incoming and one outgoing connection to manage. This linear growth can help easily scale the network to many devices. This is how the Multipoint Control Unit or MCU architecture helps solve the previous scale issue of the P2P networks.
The MCU receives encrypted streams from all devices connected to it. These streams are then decoded, and a single stream is provided to all the devices on the network - so all the devices only need to decode and display a single stream. There are various processes associated with the server:
- Decoding incoming streams: Decodes the streams from every device to transform them.
- Media mixing: Mixes all incoming media (video, audio, etc) to composite a single stream.
- Layout and composition: Handles how the mixed media streams are arranged on the screen. It can control aspects like speaker detection, active speaker highlighting, and selecting the appropriate layout based on the number of participants.
- Encoding streams: Encodes the single outgoing stream for every device.
This means that every device has an outgoing stream comprising its own data and an incoming stream comprising all devices connected in the call. This is not a trivial workload, however. Handling and compositing the output streams takes heavy computational resources, and the entire burden is on the server.
Scalability
Through the use of MCU architecture, almost all the processing of streams is left to the server rather than an individual device. This allows any number of devices to join since there is only one incoming and outgoing stream on each device. However, the bottleneck now becomes the MCU since each device has additional processing.
Note: Strictly speaking, the MCU architecture has the highest scalability in the context of devices connected since there is little to no load to the individual devices. However, cost and the server compute power can quickly become a limiting factor.
Performance
In theory, the MCU architecture can support the most number of devices since there is no additional overhead (for each user) for any additional user added to the call. There is not much computing power needed on the device. Additionally, there are the lowest number of streams since all streams are combined into one by the MCU and sent to every device. Each device has to decode only a single stream rather than the streams of all connected devices. However, the MCU needs to be able to handle the entire computation burden, and hence, there are quite high computer power requirements.
Connected devices: Fixed number of streams sent to the server, one stream returned. Additional users do not need increased computing power on the connected member side.
MCU: Linear increase of network connections with increasing number of users on a call. For each user, transcoding (encoding/decoding) takes much computing power, less so on the stream-forwarding side. Unlike other architectures, the MCU also has to work on compositing and layout: every stream may need to be added to the resultant outgoing stream, leading to some added computational work for every new user added.
Cost
The cost of the MCU architecture is the highest to the developer since a backend with significant computing power needs to be set up.
Infrastructure: MCUs have many computation tasks on the incoming and outgoing streams, leading to costs often far higher than other options. These costs also depend on the number of users; larger enterprise calls may require significantly powerful machines to handle them.
Bandwidth: The bandwidth cost is the lowest for the MCU architecture (apart from P2P) since there are fewer connections than other architectures. All devices have a single set of outgoing streams and a single set of incoming streams. All video streams are mixed into a single stream emitted to all connected devices, leading to far smaller bandwidth requirements.
Advantages
Here are the advantages associated with using MCU architecture:
- It can scale much better than P2P regarding the number of devices and local computing power needed for members (but worse than an SFU).
- Supports legacy devices.
- Single uplink and downlink are needed, and network connections increase linearly.
Disadvantages
Here are the disadvantages of using MCU architecture:
- The cost of hosting an MCU server can be significantly higher than any other architecture. This is the main reason why the MCU architecture is no longer popular.
- Since all media streams are decoded on the server, there may be additional overhead ensuring the server is secure.
- The local device does not have much (if any) flexibility in controlling the call view presented to the user.
- Processing of video and audio adds latency.
When to use
MCU architecture is no longer popular due to the costs and restrictions it places on the user experience. Here are the cases in which using MCU architecture is favorable:
- Integration with other platforms requires you to do video or audio processing.
- The changes you must make for audio & video are too heavy to run on the local device.
- Users don’t need control over the call UI.
- Typically, MCU architecture is used when running calls on older devices under poor network conditions.