graph LR
user((User))
sdk[Python AI SDK]
stt[STT]
llm[LLM]
backend[Backend]
user -->|audio| sdk
sdk -->|PCM| stt
stt -->|transcript| sdk
sdk -->|transcript| llm
llm -->|function_call| backend
backend -->|function_response| llm
Model Context Protocol (MCP)
The Model Context Protocol (MCP) is an emerging open standard that defines how language-model prompts, tool calls, and context travel together through an AI pipeline.
MCP is a transport-agnostic contract that keeps providers, tools, and data in sync.
Why Does MCP Matter?
- Modern AI workflows involve multiple components: transcription, function calling, embeddings, vector stores, and more. Without a shared protocol, every component needs custom glue code.
- It’s a standardized way to communicate between components so with MCP you can swap providers or insert new tools without rewriting everything.
- Because it’s just JSON, you can inspect, replay and debug any step. This is critical for safety and observability.
Key Concepts
- Message: the atomic unit that flows through the pipeline. A message can contain user text, tool responses, audio, or structured data.
- Capabilities: declarations such as
function_calling
,speech_to_text
, orembeddings
. A component advertises what it can do, and downstream agents decide how to route messages. - Context: metadata (timestamps, speaker IDs, language codes, etc.) that travels with every message so nothing gets lost between hops.
- Tools: domain-specific functions that an LLM can invoke, described with JSON Schema. Example:
get_weather
with a singlelocation
parameter.
How Does MCP Work in a Stream Video Call?
- A user speaks in a Stream video call “e.g. “what’s the weather like in London?”
- The Python AI SDK turns raw PCM audio into an
audio
message (MCP typeaudio/pcm
). - A Speech-to-Text component converts that into a
transcript
message. - The
transcript
is routed to an LLM that supports thefunction_calling
capability. The LLM sees the user intent and returns afunction_call
message: e.g.{ name: "get_weather", args: { location: "London" } }
. - Your backend receives the call, calls a weather API to get the information and sends back a
function_response
message containing the result. - Finally, a TTS component may turn the response into an
audio
message so everyone in the call hears the answer, e.g. “the weather in London is sunny with a temperature of 23 degrees”.
Throughout this journey, every message retains its context - who spoke, when, confidence scores etc., making downstream processing deterministic and debuggable.
When Should You Use MCP?
Here are a few reasons you might consider using MCP.
- You want interchangeable AI providers (e.g. switch from OpenAI to Anthropic without rewriting prompts).
- You need a clean boundary between the app layer and your tool plugins. E.g. if you have your own business logic you need to use in a video call and want to use your existing code.
- You care about auditability. MCP gives you a log of every function call the LLM attempted.
- You plan to orchestrate multi-step workflows (“transcribe → translate → summarize → store”) and need a single contract.
Relationship to Other AI Technologies
- Speech to Text (STT): STT often produces the first MCP message (
transcript
). - Text to Speech (TTS) - TTS can consume MCP
function_response
messages to speak results. - Voice Activity Detection (VAD): VAD can generate
activity_start
/activity_end
messages that bracket other MCP traffic. - Speech to Speech (STS): Some STS providers can call functions as part of their workflows.
I’m Sold, Let’s Build with MCP
Great! Look at our how-to guide which will walk you through an example.
- I'm working with the Stream Video Python AI SDK and would like to ask questions about this documentation page: https://getstream.io/video/docs/python-ai/ai-technologies/mcp.md
- View as markdown
- Open in ChatGPT
- Open in Claude