Model Context Protocol (MCP)

The Model Context Protocol (MCP) is an emerging open standard that defines how language-model prompts, tool calls, and context travel together through an AI pipeline.

MCP is a transport-agnostic contract that keeps providers, tools, and data in sync.

Why Does MCP Matter?

  • Modern AI workflows involve multiple components: transcription, function calling, embeddings, vector stores, and more. Without a shared protocol, every component needs custom glue code.
  • It’s a standardized way to communicate between components so with MCP you can swap providers or insert new tools without rewriting everything.
  • Because it’s just JSON, you can inspect, replay and debug any step. This is critical for safety and observability.

Key Concepts

  • Message: the atomic unit that flows through the pipeline. A message can contain user text, tool responses, audio, or structured data.
  • Capabilities: declarations such as function_calling, speech_to_text, or embeddings. A component advertises what it can do, and downstream agents decide how to route messages.
  • Context: metadata (timestamps, speaker IDs, language codes, etc.) that travels with every message so nothing gets lost between hops.
  • Tools: domain-specific functions that an LLM can invoke, described with JSON Schema. Example: get_weather with a single location parameter.

How Does MCP Work in a Stream Video Call?

  1. A user speaks in a Stream video call “e.g. “what’s the weather like in London?”
  2. The Python AI SDK turns raw PCM audio into an audio message (MCP type audio/pcm).
  3. A Speech-to-Text component converts that into a transcript message.
  4. The transcript is routed to an LLM that supports the function_calling capability. The LLM sees the user intent and returns a function_call message: e.g. { name: "get_weather", args: { location: "London" } }.
  5. Your backend receives the call, calls a weather API to get the information and sends back a function_response message containing the result.
  6. Finally, a TTS component may turn the response into an audio message so everyone in the call hears the answer, e.g. “the weather in London is sunny with a temperature of 23 degrees”.

Throughout this journey, every message retains its context - who spoke, when, confidence scores etc., making downstream processing deterministic and debuggable.

graph LR
    user((User))
    sdk[Python AI SDK]
    stt[STT]
    llm[LLM]
    backend[Backend]

    user -->|audio| sdk
    sdk  -->|PCM| stt
    stt  -->|transcript| sdk
    sdk  -->|transcript| llm
    llm -->|function_call| backend
    backend -->|function_response| llm

When Should You Use MCP?

Here are a few reasons you might consider using MCP.

  • You want interchangeable AI providers (e.g. switch from OpenAI to Anthropic without rewriting prompts).
  • You need a clean boundary between the app layer and your tool plugins. E.g. if you have your own business logic you need to use in a video call and want to use your existing code.
  • You care about auditability. MCP gives you a log of every function call the LLM attempted.
  • You plan to orchestrate multi-step workflows (“transcribe → translate → summarize → store”) and need a single contract.

Relationship to Other AI Technologies

I’m Sold, Let’s Build with MCP

Great! Look at our how-to guide which will walk you through an example.

© Getstream.io, Inc. All Rights Reserved.