Build a Voice AI App with Cartesia Sonic 3 Text-to-Speech

Realistic text-to-speech was one of the hardest parts of building voice agents.

Most models either sounded robotic, introduced noticeable latency, or required complex integration that slowed down prototyping.

Cartesia Sonic 3 changes that equation. Released late 2025, it combines sub-200 ms first-chunk latency, strong emotional expressiveness, multilingual support, and the ability to clone voices from short audio samples.

The Cartesia plugin for Vision Agents makes the switch even easier: a couple of imports, one object instantiation, and your agent speaks with noticeably more human intonation and timing.

In this demo, we’ll walk you through how to build an agent that responds conversationally using Sonic 3's emotionally nuanced voice.

Follow along below.

What You’ll Build

A voice agent that speaks with realistic, low-latency, emotionally rich voices powered by Cartesia Sonic 3
Handles questions about its own features ("If you have a question about Cartesia Sonic 3, you can ask me specific questions about its features, capabilities, limitations, settings, or troubleshooting.")
Offers instant customization of model, voice, sample rate, or even cloned voices
Text-to-speech integration works in any agent pipeline for natural-sounding conversations

The Stack

TTS → Cartesia Sonic 3 (plugin: vision-agents-plugins-cartesia)
LLM → Any (example uses Gemini)
STT → Deepgram
Turn Detection → Smart-Turn
Transport → Stream WebRTC
Framework → Vision Agents (open-source)

Requirements

Cartesia API key
Stream API key & secret
Gemini API key (or your LLM)
Deepgram API key

Step 1: Install the Plugin

shell

1
2
3
4
# Installation
uv add vision-agents

uv add "vision-agents[getstream,cartesia,deepgram,smart-turn,gemini]"

Step 2: Initialize the Plugin

python

1
2
3
from vision_agents.plugins import cartesia

tts = cartesia.TTS()

Step 3: Run a Minimal Full Example

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import logging

from dotenv import load_dotenv
from vision_agents.core import Runner
from vision_agents.core.agents import Agent, AgentLauncher
from vision_agents.core.edge.types import User
from vision_agents.plugins import cartesia, getstream, gemini, deepgram

logger = logging.getLogger(__name__)

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    # Create agent with TTS
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="TTS Bot", id="agent"),
        instructions="I'm a TTS bot that greets users when they join.",
        stt=deepgram.STT(),
        llm=gemini.LLM("gemini-2.0-flash"),
        tts=cartesia.TTS(model_id="sonic-3"),
    )

    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    # ensure the agent user is created
    await agent.create_user()
    # Create a call
    call = await agent.create_call(call_type, call_id)

    # Join call and wait
    async with agent.join(call):
        await agent.simple_response("tell me something interesting in a short sentence")
        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

bash

1
2
3
4
5
6
7
8
9
# In terminal:
export CARTESIA_API_KEY=...
export GEMINI_API_KEY=...
export DEEPGRAM_API_KEY=...
export STREAM_API_KEY=...
export STREAM_API_SECRET=...
EXAMPLE_BASE_URL=https://demo.visionagents.ai

python main.py

When you join the call in your browser, you can speak to the agent and hear Sonic 3 respond with realism and speed.

Why We Love This Integration

Cartesia Sonic 3 gives your agents noticeably faster speech onset and more human-like rhythm than many alternatives, which matters a lot when users are waiting for replies.

The plugin keeps the integration lightweight so you can focus on prompt engineering and agent logic instead of audio buffering or chunk handling.

We use it when we want low-latency TTS that still sounds expressive and consistent across long turns.

Links & Resources

Try it out! Clone a voice, tweak the model, and see for yourself how lifelike it sounds. 🎧

Add Text-to-Speech to Apps with Cartesia Sonic 3 & Vision Agents