Build Voice Agents With MCP: The Top 4 Frameworks and APIs

Voice AI technologies have recently become central to communication between customers, small businesses, and enterprises. To extend the capabilities of these systems, the Model Context Protocol (MCP) becomes a must-have. Utilizing MCP can enhance the capabilities of voice systems to ensure they provide users with satisfactory responses.

Continue reading to discover the APIs, open-source frameworks, SDKs, and tools you can use to build voice AI apps with out-of-the-box MCP support.

Overview of MCP Tools for Voice Agents

Model Context Protocol) is an open standard created by Anthropic to allow LLMs and AI agents to access external toolkits and fulfill users' prompts with accurate, non-hallucinated answers. In the context of conversational voice AI, this can give an agent a visual capability to see your desktop computer or web browser, helping to accomplish tasks like booking a flight on your behalf.

In the next section, we will examine how MCP works with voice and video calls.

Refer to our Top 7 MCP-Supported AI Frameworks to learn more about building AI apps with built-in MCP support.

How MCP Works in Voice AI Pipelines

Let's reference a user-AI video call to illustrate how MCP works with voice agents.

When a user initiates a voice/video call and asks, for example, "What's the current temperature in Amsterdam?", the agentic voice system will capture and convert the user's raw audio and send it to its speech-to-text (STT) component for transcription. The transcribed message is sent to the underlying agent (LLM with a tool) for function calling to get real-time information about the weather in a specific location.

Once the backend infrastructure receives the request, it calls an external MCP weather tool for accurate information and returns the transcribed result to the agent. Finally, the voice agent system's text-to-speech (TTS) component converts the response to an audio message and sends it back to the user.

Note: The above diagram is a realtime agentic voice system (speech-to-speech), like using the OpenAI Realtime API or Gemini Live API. In a voice system consisting of a traditional STT and TTS pipeline, similar interactions occur between voice components.

Why Use MCP Tools for Voice Agents?

The videos below illustrate voice interaction between an AI assistant and a user for two scenarios. The first is an agent that lacks an implementation of the MCP tools. When the user asked for weather information about a specific location, the agent was unable to provide a satisfactory response since it could not access and retrieve external, real-time data related to the user's query.

Using the same agentic workflow above, you can integrate a weather MCP server for the voice assistant to access and provide accurate weather information for any time zone, as demonstrated in the video below.

You can equip voice agents with MCP tools for several other reasons aside from providing a weather forecast for different places. The following highlights some of the motives.

Automatic External Tool Integration: MCP eliminates the need to wire up external application integration manually.
Easily Extend LLMs' Capabilities: Once you set up an MCP server with a particular agentic voice system, the underlying LLM for the system can easily reach out for assistance from external sources whenever a user’s query is outside what the data it is trained on.
Model Provider Agnostic: Switch easily from one LLM provider to another without rewriting your code.
Agent Monitoring: Get logs and traces of LLM's function calls, web search, and code execution.
MCP provides a standard method for exchanging data between third-party APIs and voice agent systems.

The Top 4 MCP-Supported Voice Agent APIs & Frameworks

There are several APIs, libraries, and frameworks that you can use to build speech AI applications and integrate MCP servers for scalability and extended agent-based use cases. Let's discuss Python and TypeScript-based voice AI creation platforms with first-class MCP citizenship.

1. Use Vision Agents With Built-In MCP Support

Vision Agents is an open-source Python framework for building voice and video AI experiences.

While many speech and audio AI platforms focus solely on voice, Vision Agents' approach extends voice AI with video calling. It enables users to engage in real-time audio conversations and live video chats with AI agents from any model provider. Developers can utilize text and audio generation models from OpenAI, Google Gemini, Anthropic, xAI, Mistral, Qwen, and speech-focused providers like ElevenLabs, Deepgram, Cartesia, and Assembly AI.

Advantages of Using Vision Agents Over Other Platforms

Although all the AI platforms mentioned in this article offer possibilities for building voice AI services, Vision Agents makes it seamless and effortless to integrate with and work with any AI provider you prefer.

Video-First Approach: Interacting with an AI system doesn't have to be limited to voice communication, as seen in ChatGPT's Voice Mode. Vision Agents enables developers to integrate voice and real-time video communication, creating multimodal AI experiences that enhance user engagement.
Excellent for Physical and Robotic Projects: The platform is ideal for developing vision AI services that require real-time video and image processing. You can install ready-made plugins, such as Roboflow, Moondream, and Ultralytics, for robotic and physical AI applications.
Low Latency Responses: Communicating with your voice agent feels like having human-to-human conversations because the agent responds with minimal delay. Calling an external MCP service within the platform is also instant.
AI Platform Agnostic: It features built-in integration with speech-to-speech APIs, including Gemini Live, OpenAI Realtime, and Amazon Nova Sonic. While you can bring your preferred LLMs from any of the above providers, you can also implement a custom-made traditional voice pipeline for speech-to-text and text-to-speech.
Flexibility: Build a voice-only AI app or combine video, speech, and vision to create multi-modal, real-time conversational experiences powered by Stream’s global edge network of servers.
Create a Custom Plugin: The framework has out-of-the-box plugin support for STT, TTS, STS, and voice activity detection (VAD). It also allows developers to build custom plugins that extend the capabilities provided by the SDK. The custom plugins help developers connect the Python AI SDK to third-party AI services.

MCP in Vision Agents: Quick Start in Python

This sample demo below demonstrates how to integrate the official GitHub MCP server with Vision Agents using your GitHub personal access token and API key from any LLM provider. The GitHub MCP server provides access to its API via MCP.

The integration allows voice/video AI apps powered by Vision Agents to interact with repositories, create issues, and pull requests.

Getting started with MCP in Vision Agents requires the following steps.

Configure Your Python Environment

Integrate LLMs fast! Our UI components are perfect for any AI chatbot interface right out of the box. Try them today and launch tomorrow!

Begin by creating and setting up your GitHub personal access token, API key from your preferred AI provider, and initializing a Python project with uv. This example uses OpenAI as the model provider. However, you can use other LLMs from different vendors. Additionally, you will need a Stream API key and secret to work with Vision Agents.

bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Initialize a Python project with uv
uv init

# Install the core Vision Agents SDK
uv add vision-agents

# Install Vision Agents Plugins
uv add "vision-agents[getstream, openai, elevenlabs, deepgram, silero]"

# Add your API credentials in a .env file
STREAM_API_KEY=...
STREAM_API_SECRET=...
EXAMPLE_BASE_URL=https://pronto-staging.getstream.io

# OpenAI
OPENAI_API_KEY=...

# GitHub Access Token
GITHUB_PATH=...

Alternatively, you can run the export YOUR_API_KEY=... command in the Terminal to store any of the required API keys above or save them in your shell profile. On macOS, they can be stored permanently in a .zprofile or .zshrc file. These two files are hidden on the Mac. To reveal them, press shift + cmd + h followed by shift + cmd + ..

Integrate the GitHub MCP Server With Vision Agents

At the root of your Python project, rename main.py to, for example, github_mcp_demo.py, and replace its content with this sample code.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import asyncio
import logging
import os
from uuid import uuid4
from dotenv import load_dotenv

from vision_agents.core.agents import Agent
from vision_agents.core.mcp import MCPServerRemote
from vision_agents.plugins.openai.openai_llm import OpenAILLM
from vision_agents.plugins import elevenlabs, deepgram, silero, getstream
from vision_agents.core.events import CallSessionParticipantJoinedEvent
from vision_agents.core.edge.types import User

# Load environment variables from .env file
load_dotenv()

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

async def start_agent():
    """Demonstrate GitHub MCP server integration."""

    # Get GitHub PAT from environment
    github_pat = os.getenv("GITHUB_PAT")
    if not github_pat:
        logger.error("GITHUB_PAT environment variable not found!")
        logger.error("Please set GITHUB_PAT in your .env file or environment")
        return

    # Create GitHub MCP server
    github_server = MCPServerRemote(
        url="https://api.githubcopilot.com/mcp/",
        headers={"Authorization": f"Bearer {github_pat}"},
        timeout=10.0,  # Shorter connection timeout
        session_timeout=300.0,
    )

    # Get OpenAI API key from environment
    openai_api_key = os.getenv("OPENAI_API_KEY")
    if not openai_api_key:
        logger.error("OPENAI_API_KEY environment variable not found!")
        logger.error("Please set OPENAI_API_KEY in your .env file or environment")
        return

    # Create OpenAI LLM
    llm = OpenAILLM(model="gpt-4o", api_key=openai_api_key)

    # Create real edge transport and agent user
    edge = getstream.Edge()
    agent_user = User(name="GitHub AI Assistant", id="github-agent")

    # Create agent with GitHub MCP server and OpenAI LLM
    agent = Agent(
        edge=edge,
        llm=llm,
        agent_user=agent_user,
        instructions="You are a helpful AI assistant with access to GitHub via MCP server. You can help with GitHub operations like creating issues, managing pull requests, searching repositories, and more. Keep responses conversational and helpful.",
        processors=[],
        mcp_servers=[github_server],
        tts=elevenlabs.TTS(),
        stt=deepgram.STT(),
        vad=silero.VAD(),
    )

    logger.info("Agent created with GitHub MCP server")
    logger.info(f"GitHub server: {github_server}")

    try:
        # Connect to GitHub MCP server with timeout
        logger.info("Connecting to GitHub MCP server...")

        # Check if MCP tools were registered with the function registry
        logger.info("Checking function registry for MCP tools...")
        available_functions = agent.llm.get_available_functions()
        mcp_functions = [f for f in available_functions if f["name"].startswith("mcp_")]

        logger.info(
            f"✅ Found {len(mcp_functions)} MCP tools registered in function registry"
        )
        logger.info("MCP tools are now available to the LLM for function calling!")

        # Set up event handler for when participants join
        @agent.subscribe
        async def on_participant_joined(event: CallSessionParticipantJoinedEvent):
            await agent.say(
                f"Hello {event.participant.user.name}! I'm your GitHub AI assistant with access to {len(mcp_functions)} GitHub tools. I can help you with repositories, issues, pull requests, and more!"
            )

        # Create a call
        call = agent.edge.client.video.call("default", str(uuid4()))

        # Have the agent join the call/room
        logger.info("🎤 Agent joining call...")
        with await agent.join(call):
            # Open the demo UI
            logger.info("🌐 Opening browser with demo UI...")

            await agent.edge.open_demo(call)
            logger.info("✅ Agent is now live! You can talk to it in the browser.")
            logger.info(
                "Try asking: 'What repositories do I have?' or 'Create a new issue'"
            )

            # Run until the call ends
            await agent.finish()

    except Exception as e:
        logger.error(f"Error with GitHub MCP server: {e}")
        logger.error("Make sure your GITHUB_PAT and OPENAI_API_KEY are valid")
        import traceback

        traceback.print_exc()

    # Clean up
    await agent.close()
    logger.info("Demo completed!")

if __name__ == "__main__":
    asyncio.run(start_agent())

In summary, we connect the voice agent to the GitHub MCP server using the personal access token and integrate OpenAI's gpt-4o as the underlying LLM. The ElevenLabs plugin handles text-to-speech, while Deepgram handles speech-to-text. The Silero plugin manages voice activity detection.

When you execute the Python script with uv run github_mcp_demo.py, it will launch a Stream Video web UI ready for you to interact with the GitHub MCP agent through a real-time video call, as demonstrated in the preview below.

From the sample code used for this demo, we can easily swap the model from OpenAI to, for example, Grok, without changing the logic or code implementation.

2. Integrate OpenAI Realtime API Into Your Voice AI Service

The OpenAI Realtime API allows developers to build enterprise-grade voice applications and services. The August 2025 update to the API provides built-in support for remote MCP servers.

To enable MCP support for a voice agent session, you should specify your remote MCP server URL. Once you point an agent's session to a particular MCP server, switching to another server to extend an agent's capabilities becomes simple.

Get Started With MCP Using OpenAI Realtime API

The following code snippet defines a configuration that points a session to Stripe's MCP server.

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// POST /v1/realtime/client_secrets
{
  "session": {
    "type": "realtime",
    "tools": [
      {
        "type": "mcp",
        "server_label": "stripe",
        "server_url": "https://mcp.stripe.com",
        "authorization": "{access_token}",
        "require_approval": "never"
      }
    ]
  }
}

The OpenAI Realtime API is an excellent choice for building speech-to-speech AI integrations that require low-latency responses. Working with the real-time API alone, you will be vendor-locked into OpenAI. However, if you want to use the API but with an open-source, local, or closed model other than what OpenAI provides, you should start with the OpenAI Agents SDK for TypeScript.

typescript

1
2
3
4
5
6
7
8
9
10
11
12
13
import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";

const agent = new RealtimeAgent({
    name: "Assistant",
    instructions: "You are a helpful assistant.",
});

const session = new RealtimeSession(agent);

// Automatically connects your microphone and audio output
await session.connect({
    apiKey: "<client-api-key>",
});

3. Build Voice Agents Using the Gemini Live API With MCP Tools

Similar to the OpenAI Realtime API, the Gemini Live API provides developers with toolkits to create video and audio AI applications using Gemini. The Live API provides developers with real-time (speech-to-speech) infrastructures for building low-latency voice experiences that surpass the use of traditional pipelines for speech-to-text and text-to-speech.

Aside from benefiting from the API's low-latency feature, it also offers:

Voice Activity Detection: A built-in feature that helps to know when a user has started speaking or stopped, when interacting with a voice agent.
Two Natural Audio Generation Architectures: It supports Native Audio, allowing developers to create realistic-sounding speech experiences for multilingual use cases. There is also a Half-Cascade Audio approach that facilitates the generation of audio for production use cases.
A playground to try and prototype and talk to Gemini Live.
Function Calling and Tool Use: For accessing external APIs and agent toolkits.

To go beyond utilizing the built-in tool use and function calling approach for extending the capabilities of a voice agent, you can use MCP tools with Gemini Live API via this open-source MCP client.

4. Use Amazon Nova Sonic With MCP Tools

Nova Sonic is a speech-to-speech foundation model for building multimodal and multilingual audio and video applications. With its MCP support, you can create speech-to-speech conversational AI apps that can easily access external in-depth information whenever needed. To experiment with the MCP feature of Nova Sonic, clone the sample-nova-sonic-mcp app from the AWS GitHub repository.

Like the OpenAI Realtime and Gemini Live APIs, Nova Sonic integrates seamlessly with Vision Agents. For the GitHub MCP demo we created in one of the previous sections of this article, we can implement Nova Sonic by swapping the OpenAI model in the sample code and adding AWS API credentials.

Use MCP Servers With Traditional Voice Pipelines

OpenAI Realtime, Gemini Live, and Amazon Nova Sonic are all speech-to-speech AI solutions that provide developers with audio and text generation models, as well as voice interruption management and voice activity detection, to create speech-enabled AI services with MCP support.

In some voice AI use cases, you may want a platform that integrates MCP but utilizes custom, traditional speech-to-text and text-to-speech approaches. In such use cases, you can use ElevenLabs MCP for building your voice AI services. Vapi also features an excellent MCP client that enables developers to create agents that integrate seamlessly with existing MCP servers, providing context and access to external tools.

Creating Conversational Agents With MCP Tools: Security Considerations

When using MCP servers with voice-enabled AI services, safety and security must be considered carefully. Ensure that your API credentials and other sensitive information required to run your application are not exposed or embedded directly in the source code. If you need to implement a voice application in production environments, your team should consider enabling HTTPS. Additionally, code execution and tool calling through MCP servers must always be subject to robust security restrictions.

Final Thoughts on MCP for Agentic Voice AI

Enabling an MCP for speech-enabled AI applications helps improve conversational user-AI interactions and experiences.

Depending on the use case, development teams can utilize platforms such as ElevenLabs, Deepgram, Kokoro, Assembly AI, and others to create speech-to-text, text-to-speech, and conversational speech-to-speech interfaces. For out-of-the-box support for MCP, you can use the OpenAI Realtime and Gemini Live APIs and integrate them with your voice AI service. However, using this approach, you will be vendor-locked into specific models and AI providers.

For a seamless, integrated, and AI provider agnostic experience, a framework like Vision Agents is an excellent choice for your team's voice AI use cases. It enables developers to add any state-of-the-art model, as well as AI vendors they prefer. The video AI framework has built-in support for MCP and handles both traditional voice pipelines (STT/TTS) and real-time (STS) approaches, giving developers more control over the speech architectures they want to implement, robust agent tracing, and more customization options.