ElevenLabs delivers some of the most lifelike and expressive text-to-speech voices out there.
Its natural intonation, emotion, and multilingual support make your AI agents sound genuinely human.
And, with the ElevenLabs plugin for Vision Agents, integration is a one-liner affair: import, initialize (with optional voice/model tweaks), and pass it to your agent. No messing around with extra code.
In this demo, you’ll see how to create an agent that welcomes new participants the moment they join ("Hello, thank you so much for the welcome. I'm glad to be here. What's on the agenda today?"), cracks programming jokes when asked ("Why do Java developers wear glasses? Because they don't see sharp"), and keeps the back-and-forth feeling easy.
Follow along to add ElevenLabs TTS to any Vision Agents app.
What You’ll Build
- An agent that speaks with ultra-realistic, emotionally nuanced voices from ElevenLabs
- Instant customization of voice, model, or stability/style settings
- Event-triggered speech (e.g., auto-greet when someone joins a call)
- Text-to-speech that fits into your existing agent flow without breaking anything
The Stack
- TTS → ElevenLabs (plugin: vision-agents-plugins-elevenlabs)
- LLM → Any (OpenAI, Gemini, Grok, etc.)
- STT → Any (Deepgram, ElevenLabs Scribe, etc.)
- Turn Detection → Smart-Turn
- Transport → Stream WebRTC
- Framework → Vision Agents (open-source)
Requirements
- ElevenLabs API key
- Stream API key & secret
- LLM/STT keys (as needed for your agent)
Step 1: Install the Plugin
1234uv add vision-agents uv add "vision-agents[getstream,elevenlabs,smart-turn]" # Add your LLM/STT plugins, e.g.: uv add "vision-agents[gemini,deepgram]"
Step 2: Add ElevenLabs TTS
1234567891011121314151617# Import and initialize (defaults are fine, or customize) from vision_agents.plugins import ( elevenlabs ) tts = elevenlabs.TTS( # Optional: automatically loaded if called ELEVENLABS_API_KEY api_key=os.getenv("ELEVENLABS_API_KEY"), # Optional: pick a specific voice and model voice_id="EXAVITQu4vr4xnSDxMaL", # e.g., Rachel model_id="eleven_multilingual_v2" # or "eleven_turbo_v2.5" for faster latency ) # Pass it to your Agent — that's it agent = Agent( # ... your other config ... tts=tts, # ← One line addition # ... )
Step 3: React to Events (e.g., Greet on Join)
123456@agent.subscribe("participant_joined") async def greet_new_participant(event): await agent.say( "Hello! Thanks for joining the call. " "I'm glad you're here—what's on your mind today?" )
Step 4: Minimal Full Example to Run
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051from typing import Any, Dict from dotenv import load_dotenv from vision_agents.core import Agent, AgentLauncher, Runner, User from vision_agents.core.utils.examples import get_weather_by_location from vision_agents.plugins import ( deepgram, elevenlabs, gemini, getstream, ) async def create_agent(**kwargs) -> Agent: llm = gemini.LLM("gemini-2.5-flash-lite") tts = elevenlabs.TTS(api_key=os.getenv("ELEVENLABS_API_KEY")) deepgram.STT() agent = Agent( edge=getstream.Edge(), agent_user=User(name="My happy AI friend", id="agent"), instructions="You are a friendly, witty assistant. Tell jokes when asked and greet warmly.", llm=llm, tts=tts, stt=stt ) @agent.subscribe("participant_joined") async def greet_new_participant(event): await agent.say( "Hello! Thanks for joining the call. " "I'm glad you're here—what's on your mind today?" ) return agent async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: call = await agent.create_call(call_type, call_id) # Have the agent join the call/room async with agent.join(call): # Use agent.simple response or... await agent.simple_response("tell me something interesting in a short sentence") # run till the call ends await agent.finish() if __name__ == "__main__": Runner(AgentLauncher( create_agent=create_agent, join_call=join_call )).cli()
1234567export ELEVENLABS_API_KEY=... export STREAM_API_KEY=... export STREAM_API_SECRET=... export DEEPGRAM_API_KEY=... export OPENAI_API_KEY=... uv run main.py run
Join the call in your browser, and listen as the agent speaks and responds to you in an ElevenLabs-created voice.
Why We Love This Integration
The ElevenLabs plugin is true plug-and-play. One import, one object, drop it in.
You get immediate access to dozens of voices, multilingual models, turbo variants for speed, and even voice cloning, all with almost zero additional code.
Vision Agents handles the streaming, interruption logic, and event system, so TTS just works naturally in any agent flow.
Open-source framework + proprietary TTS magic \= best of both worlds.
Links & Resources
Swap out voices, tweak models, and add your own events. Happy integrating! 🗣️