Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub

Add Text-to-Speech to Apps with Cartesia Sonic 3 & Vision Agents

New
2 min read

With Cartesia Sonic 3, you get expressive, low-latency TTS that handles multilingual text and voice cloning.

Amos G.
Amos G.
Published February 26, 2026

Realistic text-to-speech was one of the hardest parts of building voice agents.

Most models either sounded robotic, introduced noticeable latency, or required complex integration that slowed down prototyping.

Cartesia Sonic 3 changes that equation. Released late 2025, it combines sub-200 ms first-chunk latency, strong emotional expressiveness, multilingual support, and the ability to clone voices from short audio samples.

The Cartesia plugin for Vision Agents makes the switch even easier: a couple of imports, one object instantiation, and your agent speaks with noticeably more human intonation and timing.

In this demo, we’ll walk you through how to build an agent that responds conversationally using Sonic 3's emotionally nuanced voice.

Follow along below.

What You’ll Build

  • A voice agent that speaks with realistic, low-latency, emotionally rich voices powered by Cartesia Sonic 3
  • Handles questions about its own features ("If you have a question about Cartesia Sonic 3, you can ask me specific questions about its features, capabilities, limitations, settings, or troubleshooting.")
  • Offers instant customization of model, voice, sample rate, or even cloned voices
  • Text-to-speech integration works in any agent pipeline for natural-sounding conversations

The Stack

Requirements

  • Cartesia API key
  • Stream API key & secret
  • Gemini API key (or your LLM)
  • Deepgram API key

Step 1: Install the Plugin

shell
1
2
3
4
# Installation uv add vision-agents uv add "vision-agents[getstream,cartesia,deepgram,smart-turn,gemini]"

Step 2: Initialize the Plugin

python
1
2
3
from vision_agents.plugins import cartesia tts = cartesia.TTS()

Step 3: Run a Minimal Full Example

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import logging from dotenv import load_dotenv from vision_agents.core import Runner from vision_agents.core.agents import Agent, AgentLauncher from vision_agents.core.edge.types import User from vision_agents.plugins import cartesia, getstream, gemini, deepgram logger = logging.getLogger(__name__) load_dotenv() async def create_agent(**kwargs) -> Agent: # Create agent with TTS agent = Agent( edge=getstream.Edge(), agent_user=User(name="TTS Bot", id="agent"), instructions="I'm a TTS bot that greets users when they join.", stt=deepgram.STT(), llm=gemini.LLM("gemini-2.0-flash"), tts=cartesia.TTS(model_id="sonic-3"), ) return agent async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: # ensure the agent user is created await agent.create_user() # Create a call call = await agent.create_call(call_type, call_id) # Join call and wait async with agent.join(call): await agent.simple_response("tell me something interesting in a short sentence") await agent.finish() if __name__ == "__main__": Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
bash
1
2
3
4
5
6
7
8
9
# In terminal: export CARTESIA_API_KEY=... export GEMINI_API_KEY=... export DEEPGRAM_API_KEY=... export STREAM_API_KEY=... export STREAM_API_SECRET=... EXAMPLE_BASE_URL=https://demo.visionagents.ai python main.py

When you join the call in your browser, you can speak to the agent and hear Sonic 3 respond with realism and speed.

Why We Love This Integration

Cartesia Sonic 3 gives your agents noticeably faster speech onset and more human-like rhythm than many alternatives, which matters a lot when users are waiting for replies.

The plugin keeps the integration lightweight so you can focus on prompt engineering and agent logic instead of audio buffering or chunk handling.

We use it when we want low-latency TTS that still sounds expressive and consistent across long turns.

Try it out! Clone a voice, tweak the model, and see for yourself how lifelike it sounds. 🎧

Integrating Video With Your App?
We've built a Video and Audio solution just for you. Check out our APIs and SDKs.
Learn more