Realistic text-to-speech was one of the hardest parts of building voice agents.
Most models either sounded robotic, introduced noticeable latency, or required complex integration that slowed down prototyping.
Cartesia Sonic 3 changes that equation. Released late 2025, it combines sub-200 ms first-chunk latency, strong emotional expressiveness, multilingual support, and the ability to clone voices from short audio samples.
The Cartesia plugin for Vision Agents makes the switch even easier: a couple of imports, one object instantiation, and your agent speaks with noticeably more human intonation and timing.
In this demo, we’ll walk you through how to build an agent that responds conversationally using Sonic 3's emotionally nuanced voice.
Follow along below.
What You’ll Build
- A voice agent that speaks with realistic, low-latency, emotionally rich voices powered by Cartesia Sonic 3
- Handles questions about its own features ("If you have a question about Cartesia Sonic 3, you can ask me specific questions about its features, capabilities, limitations, settings, or troubleshooting.")
- Offers instant customization of model, voice, sample rate, or even cloned voices
- Text-to-speech integration works in any agent pipeline for natural-sounding conversations
The Stack
- TTS → Cartesia Sonic 3 (plugin: vision-agents-plugins-cartesia)
- LLM → Any (example uses Gemini)
- STT → Deepgram
- Turn Detection → Smart-Turn
- Transport → Stream WebRTC
- Framework → Vision Agents (open-source)
Requirements
- Cartesia API key
- Stream API key & secret
- Gemini API key (or your LLM)
- Deepgram API key
Step 1: Install the Plugin
1234# Installation uv add vision-agents uv add "vision-agents[getstream,cartesia,deepgram,smart-turn,gemini]"
Step 2: Initialize the Plugin
123from vision_agents.plugins import cartesia tts = cartesia.TTS()
Step 3: Run a Minimal Full Example
1234567891011121314151617181920212223242526272829303132333435363738import logging from dotenv import load_dotenv from vision_agents.core import Runner from vision_agents.core.agents import Agent, AgentLauncher from vision_agents.core.edge.types import User from vision_agents.plugins import cartesia, getstream, gemini, deepgram logger = logging.getLogger(__name__) load_dotenv() async def create_agent(**kwargs) -> Agent: # Create agent with TTS agent = Agent( edge=getstream.Edge(), agent_user=User(name="TTS Bot", id="agent"), instructions="I'm a TTS bot that greets users when they join.", stt=deepgram.STT(), llm=gemini.LLM("gemini-2.0-flash"), tts=cartesia.TTS(model_id="sonic-3"), ) return agent async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: # ensure the agent user is created await agent.create_user() # Create a call call = await agent.create_call(call_type, call_id) # Join call and wait async with agent.join(call): await agent.simple_response("tell me something interesting in a short sentence") await agent.finish() if __name__ == "__main__": Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
123456789# In terminal: export CARTESIA_API_KEY=... export GEMINI_API_KEY=... export DEEPGRAM_API_KEY=... export STREAM_API_KEY=... export STREAM_API_SECRET=... EXAMPLE_BASE_URL=https://demo.visionagents.ai python main.py
When you join the call in your browser, you can speak to the agent and hear Sonic 3 respond with realism and speed.
Why We Love This Integration
Cartesia Sonic 3 gives your agents noticeably faster speech onset and more human-like rhythm than many alternatives, which matters a lot when users are waiting for replies.
The plugin keeps the integration lightweight so you can focus on prompt engineering and agent logic instead of audio buffering or chunk handling.
We use it when we want low-latency TTS that still sounds expressive and consistent across long turns.
Links & Resources
Try it out! Clone a voice, tweak the model, and see for yourself how lifelike it sounds. 🎧