Model Context Protocol (MCP)

The Model Context Protocol (MCP) is an emerging open standard that defines how language-model prompts, tool calls, and context travel together through an AI pipeline.

MCP is a transport-agnostic contract that keeps providers, tools, and data in sync.

Why Does MCP Matter?

Modern AI workflows involve multiple components: transcription, function calling, embeddings, vector stores, and more. Without a shared protocol, every component needs custom glue code.
It’s a standardized way to communicate between components so with MCP you can swap providers or insert new tools without rewriting everything.
Because it’s just JSON, you can inspect, replay and debug any step. This is critical for safety and observability.

Key Concepts

Message: the atomic unit that flows through the pipeline. A message can contain user text, tool responses, audio, or structured data.
Capabilities: declarations such as function_calling, speech_to_text, or embeddings. A component advertises what it can do, and downstream agents decide how to route messages.
Context: metadata (timestamps, speaker IDs, language codes, etc.) that travels with every message so nothing gets lost between hops.
Tools: domain-specific functions that an LLM can invoke, described with JSON Schema. Example: get_weather with a single location parameter.

How Does MCP Work in a Stream Video Call?

A user speaks in a Stream video call “e.g. “what’s the weather like in London?”
The Python AI SDK turns raw PCM audio into an audio message (MCP type audio/pcm).
A Speech-to-Text component converts that into a transcript message.
The transcript is routed to an LLM that supports the function_calling capability. The LLM sees the user intent and returns a function_call message: e.g. { name: "get_weather", args: { location: "London" } }.
Your backend receives the call, calls a weather API to get the information and sends back a function_response message containing the result.
Finally, a TTS component may turn the response into an audio message so everyone in the call hears the answer, e.g. “the weather in London is sunny with a temperature of 23 degrees”.

Throughout this journey, every message retains its context - who spoke, when, confidence scores etc., making downstream processing deterministic and debuggable.

graph LR
    user((User))
    sdk[Python AI SDK]
    stt[STT]
    llm[LLM]
    backend[Backend]

    user -->|audio| sdk
    sdk  -->|PCM| stt
    stt  -->|transcript| sdk
    sdk  -->|transcript| llm
    llm -->|function_call| backend
    backend -->|function_response| llm

When Should You Use MCP?

Here are a few reasons you might consider using MCP.

You want interchangeable AI providers (e.g. switch from OpenAI to Anthropic without rewriting prompts).
You need a clean boundary between the app layer and your tool plugins. E.g. if you have your own business logic you need to use in a video call and want to use your existing code.
You care about auditability. MCP gives you a log of every function call the LLM attempted.
You plan to orchestrate multi-step workflows (“transcribe → translate → summarize → store”) and need a single contract.

Relationship to Other AI Technologies

Speech to Text (STT): STT often produces the first MCP message (transcript).
Text to Speech (TTS) - TTS can consume MCP function_response messages to speak results.
Voice Activity Detection (VAD): VAD can generate activity_start / activity_end messages that bracket other MCP traffic.
Speech to Speech (STS): Some STS providers can call functions as part of their workflows.

I’m Sold, Let’s Build with MCP

Great! Look at our how-to guide which will walk you through an example.

Voice Activity Detection (VAD)

Moderating Video Calls