How to Integrate ElevenLabs Text-to-Speech with Vision Agents

ElevenLabs delivers some of the most lifelike and expressive text-to-speech voices out there.

Its natural intonation, emotion, and multilingual support make your AI agents sound genuinely human.

And, with the ElevenLabs plugin for Vision Agents, integration is a one-liner affair: import, initialize (with optional voice/model tweaks), and pass it to your agent. No messing around with extra code.

In this demo, you’ll see how to create an agent that welcomes new participants the moment they join ("Hello, thank you so much for the welcome. I'm glad to be here. What's on the agenda today?"), cracks programming jokes when asked ("Why do Java developers wear glasses? Because they don't see sharp"), and keeps the back-and-forth feeling easy.

Follow along to add ElevenLabs TTS to any Vision Agents app.

What You’ll Build

An agent that speaks with ultra-realistic, emotionally nuanced voices from ElevenLabs
Instant customization of voice, model, or stability/style settings
Event-triggered speech (e.g., auto-greet when someone joins a call)
Text-to-speech that fits into your existing agent flow without breaking anything

The Stack

TTS → ElevenLabs (plugin: vision-agents-plugins-elevenlabs)
LLM → Any (OpenAI, Gemini, Grok, etc.)
STT → Any (Deepgram, ElevenLabs Scribe, etc.)
Turn Detection → Smart-Turn
Transport → Stream WebRTC
Framework → Vision Agents (open-source)

Requirements

ElevenLabs API key
Stream API key & secret
LLM/STT keys (as needed for your agent)

Step 1: Install the Plugin

bash

1
2
3
4
uv add vision-agents
uv add "vision-agents[getstream,elevenlabs,smart-turn]"
# Add your LLM/STT plugins, e.g.:
uv add "vision-agents[gemini,deepgram]"

Step 2: Add ElevenLabs TTS

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Import and initialize (defaults are fine, or customize)
from vision_agents.plugins import ( elevenlabs )

tts = elevenlabs.TTS(
    # Optional: automatically loaded if called  ELEVENLABS_API_KEY
    api_key=os.getenv("ELEVENLABS_API_KEY"),
    # Optional: pick a specific voice and model
    voice_id="EXAVITQu4vr4xnSDxMaL",          # e.g., Rachel
    model_id="eleven_multilingual_v2"         # or "eleven_turbo_v2.5" for faster latency
)

# Pass it to your Agent — that's it
agent = Agent(
    # ... your other config ...
    tts=tts,                          # ← One line addition
    # ...
)

Step 3: React to Events (e.g., Greet on Join)

python

1
2
3
4
5
6
@agent.subscribe("participant_joined")
async def greet_new_participant(event):
    await agent.say(
        "Hello! Thanks for joining the call. "
        "I'm glad you're here—what's on your mind today?"
    )

Step 4: Minimal Full Example to Run

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from typing import Any, Dict

from dotenv import load_dotenv
from vision_agents.core import Agent, AgentLauncher, Runner, User
from vision_agents.core.utils.examples import get_weather_by_location
from vision_agents.plugins import (
    deepgram,
    elevenlabs,
    gemini,
    getstream,
)

async def create_agent(**kwargs) -> Agent:
    llm = gemini.LLM("gemini-2.5-flash-lite")
    tts = elevenlabs.TTS(api_key=os.getenv("ELEVENLABS_API_KEY"))
    deepgram.STT()

    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="My happy AI friend", id="agent"),
        instructions="You are a friendly, witty assistant. Tell jokes when asked and greet warmly.",
        llm=llm,
        tts=tts,
        stt=stt
    )

    @agent.subscribe("participant_joined")
    async def greet_new_participant(event):
        await agent.say(
            "Hello! Thanks for joining the call. "
            "I'm glad you're here—what's on your mind today?"
      )

    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)

    # Have the agent join the call/room
    async with agent.join(call):
        # Use agent.simple response or...
        await agent.simple_response("tell me something interesting in a short sentence")

        # run till the call ends
        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(
        create_agent=create_agent, 
        join_call=join_call
    )).cli()

bash

1
2
3
4
5
6
7
export ELEVENLABS_API_KEY=...
export STREAM_API_KEY=...
export STREAM_API_SECRET=...
export DEEPGRAM_API_KEY=...
export OPENAI_API_KEY=...

uv run main.py run

Join the call in your browser, and listen as the agent speaks and responds to you in an ElevenLabs-created voice.

Why We Love This Integration

The ElevenLabs plugin is true plug-and-play. One import, one object, drop it in.

You get immediate access to dozens of voices, multilingual models, turbo variants for speed, and even voice cloning, all with almost zero additional code.

Vision Agents handles the streaming, interruption logic, and event system, so TTS just works naturally in any agent flow.

Open-source framework + proprietary TTS magic \= best of both worlds.

Links & Resources

Swap out voices, tweak models, and add your own events. Happy integrating! 🗣️

ElevenLabs with Vision Agents: Add Text-to-Speech in a Few Lines of Code

What You’ll Build

The Stack

Requirements

Step 1: Install the Plugin

Step 2: Add ElevenLabs TTS

Step 3: React to Events (e.g., Greet on Join)

Step 4: Minimal Full Example to Run

Why We Love This Integration

Links & Resources