Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub

ElevenLabs with Vision Agents: Add Text-to-Speech in a Few Lines of Code

New
3 min read

Import once, tweak the voice or model, and watch your agent speak like a real person.

Stefan B.
Stefan B.
Published February 24, 2026

ElevenLabs delivers some of the most lifelike and expressive text-to-speech voices out there.

Its natural intonation, emotion, and multilingual support make your AI agents sound genuinely human.

And, with the ElevenLabs plugin for Vision Agents, integration is a one-liner affair: import, initialize (with optional voice/model tweaks), and pass it to your agent. No messing around with extra code.

In this demo, you’ll see how to create an agent that welcomes new participants the moment they join ("Hello, thank you so much for the welcome. I'm glad to be here. What's on the agenda today?"), cracks programming jokes when asked ("Why do Java developers wear glasses? Because they don't see sharp"), and keeps the back-and-forth feeling easy.

Follow along to add ElevenLabs TTS to any Vision Agents app.

What You’ll Build

  • An agent that speaks with ultra-realistic, emotionally nuanced voices from ElevenLabs
  • Instant customization of voice, model, or stability/style settings
  • Event-triggered speech (e.g., auto-greet when someone joins a call)
  • Text-to-speech that fits into your existing agent flow without breaking anything

The Stack

  • TTS → ElevenLabs (plugin: vision-agents-plugins-elevenlabs)
  • LLM → Any (OpenAI, Gemini, Grok, etc.)
  • STT → Any (Deepgram, ElevenLabs Scribe, etc.)
  • Turn Detection → Smart-Turn
  • Transport → Stream WebRTC
  • Framework → Vision Agents (open-source)

Requirements

  • ElevenLabs API key
  • Stream API key & secret
  • LLM/STT keys (as needed for your agent)

Step 1: Install the Plugin

bash
1
2
3
4
uv add vision-agents uv add "vision-agents[getstream,elevenlabs,smart-turn]" # Add your LLM/STT plugins, e.g.: uv add "vision-agents[gemini,deepgram]"

Step 2: Add ElevenLabs TTS

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Import and initialize (defaults are fine, or customize) from vision_agents.plugins import ( elevenlabs ) tts = elevenlabs.TTS( # Optional: automatically loaded if called ELEVENLABS_API_KEY api_key=os.getenv("ELEVENLABS_API_KEY"), # Optional: pick a specific voice and model voice_id="EXAVITQu4vr4xnSDxMaL", # e.g., Rachel model_id="eleven_multilingual_v2" # or "eleven_turbo_v2.5" for faster latency ) # Pass it to your Agent — that's it agent = Agent( # ... your other config ... tts=tts, # ← One line addition # ... )

Step 3: React to Events (e.g., Greet on Join)

python
1
2
3
4
5
6
@agent.subscribe("participant_joined") async def greet_new_participant(event): await agent.say( "Hello! Thanks for joining the call. " "I'm glad you're here—what's on your mind today?" )

Step 4: Minimal Full Example to Run

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from typing import Any, Dict from dotenv import load_dotenv from vision_agents.core import Agent, AgentLauncher, Runner, User from vision_agents.core.utils.examples import get_weather_by_location from vision_agents.plugins import ( deepgram, elevenlabs, gemini, getstream, ) async def create_agent(**kwargs) -> Agent: llm = gemini.LLM("gemini-2.5-flash-lite") tts = elevenlabs.TTS(api_key=os.getenv("ELEVENLABS_API_KEY")) deepgram.STT() agent = Agent( edge=getstream.Edge(), agent_user=User(name="My happy AI friend", id="agent"), instructions="You are a friendly, witty assistant. Tell jokes when asked and greet warmly.", llm=llm, tts=tts, stt=stt ) @agent.subscribe("participant_joined") async def greet_new_participant(event): await agent.say( "Hello! Thanks for joining the call. " "I'm glad you're here—what's on your mind today?" ) return agent async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: call = await agent.create_call(call_type, call_id) # Have the agent join the call/room async with agent.join(call): # Use agent.simple response or... await agent.simple_response("tell me something interesting in a short sentence") # run till the call ends await agent.finish() if __name__ == "__main__": Runner(AgentLauncher( create_agent=create_agent, join_call=join_call )).cli()
bash
1
2
3
4
5
6
7
export ELEVENLABS_API_KEY=... export STREAM_API_KEY=... export STREAM_API_SECRET=... export DEEPGRAM_API_KEY=... export OPENAI_API_KEY=... uv run main.py run

Join the call in your browser, and listen as the agent speaks and responds to you in an ElevenLabs-created voice.

Why We Love This Integration

The ElevenLabs plugin is true plug-and-play. One import, one object, drop it in.

You get immediate access to dozens of voices, multilingual models, turbo variants for speed, and even voice cloning, all with almost zero additional code.

Vision Agents handles the streaming, interruption logic, and event system, so TTS just works naturally in any agent flow.

Open-source framework + proprietary TTS magic \= best of both worlds.

Swap out voices, tweak models, and add your own events. Happy integrating! 🗣️

Integrating Video With Your App?
We've built a Video and Audio solution just for you. Check out our APIs and SDKs.
Learn more