The 6 Best On-Device TTS Models for Voice AI

When building voice AI applications, you have industry-leading cloud options for text-to-speech, such as Cartesia Sonic 3 and Grok TTS. For privacy and to avoid sharing your business's data with these commercial text-to-speech (TTS) providers, your team may want to use free, open-source solutions that run locally on mobile and desktop devices.

Continue reading to unlock the six hidden but capable, high-quality, secure, small, and faster TTS models to build human-like speech experiences.

What is a Speech Synthesis Model?

Any AI model capable of converting an input text into audible spoken audio is called a speech synthesis, or TTS model. Speech synthesis is the component of a voice AI system that enables users to speak to agents in realtime, without using recorded audio. In most speech-enabled applications, it is used with its companion speech-to-text component and an LLM to create a unified voice pipeline, as shown in the diagram above.

Choosing a Text-to-Speech Model: What To Consider

There are many reasons developers prefer testing and prototyping with TTS models that do not require GPU rental or running on some cloud AI inference providers. Here are the key considerations when choosing a TTS model for on-device use.

Naturalness: Can the model produce a realistic, human-sounding speech? None of the options covered in this article produces awkward or robotic voice responses.
Private and Secure: Does the data for the voice service or any of its components share with the underlying model providers for training and other purposes?
Lightweight and Fast: The latency of responses is critical when choosing a model for speech synthesis. Using a small model does not always guarantee a lower latency. If this information cannot be found in the model's card, the characteristics can be tested and verified at runtime. For smaller models, try Kitten TTS, which comprises models ranging from 15M to 80M parameters (25 - 80 MB on disk). Although it is tiny in size, it delivers high-quality voice synthesis on the CPU without requiring a GPU.
Multi-Language Support: When creating a voice service that supports English, a model like Pocket TTS is a great choice. However, it supports only one language. If you are building an app to be used in different locales and support multiple languages, you should consider using VibeVoice.
Customization: TTS models often come with built-in voices that can be used out of the box. These are useful for quick demos and testing purposes. However, not all available solutions, for example, allow customization of voice design or cloning. To ensure a custom and expressive speech-generation experience, you should choose a model that supports voice cloning and design. An excellent pick in this category is Qwen3-TTS.

The Top 6 On-Device and Open Source Text-to-Speech Models for Voice AI

Cartesia, Deepgram, and ElevenLabs offer one of the best commercial speech-generation models for building AI services. However, developers can only use these APIs at a cost or with a subscription. Let's look at free, local, and open-source alternatives to these models to create any voice experience.

The best place to experiment with different TTS models is to select the text-to-speech option under the Hugging Face models category. After testing several of these models, here are the six most suitable options you can use for your projects and run locally on consumer laptops, Raspberry Pi, mobile, and wearable devices. They were picked based on the quality, size of the model parameters, ease of use, language support, customization options, and more.

1. VibeVoice: Build Long-Form and Multi-Speaker Conversational Audio Apps

VibeVoice consists of open-source TTS models designed specifically for multi-speaker and long-form conversational audio generation. One of the best use cases for VibeVoice TTS is an AI podcast with 1 - 4 multi-speaker support and up to 90 minutes of speech generation.

VibeVoice also includes an automatic speech recognition model that can be seamlessly used with the TTS counterpart. Here is a demo of VibeVoice TTS integration with Vision Agents using the VibeVoice-1.5B model.

This model has a 64K context length and ~90 min generation length. Refer to its technical report to read more.

Characteristics of VibeVoice

Since the beginning of 2026, there have been new TTS models (open-source and commercial) releases and updates on X every week. However, choosing the right audio generation model for an agent can be daunting due to factors such as quality, latency, pricing, and more.

Quality and Performance: VibeVoice has one of the highest quality among the open-source on-device TTS models highlighted in this article. Based on our testing, It performs on par with Gemini 2.5 Pro TTS and Cartesia Sonic 3.
Latency: It has ~300 ms of first-audible latency (VibeVoice-Realtime-0.5B)
Available Voices and Styles: You can use it to build agents with natural emotional nuances and spontaneous reactions. It also has singing capabilities.
Price: VibeVoice is free of charge to use. Refer to its licence statement on GitHub to learn more.
Available Languages: It supports English as the primary language and nine additional languages (experimental).

VibeVoice: Quick Start in Vision Agents

All the demos in this article demonstrate the integration of the on-device TTS models with Vision Agents. It is a free and open-source project that provides developers with all the building blocks for voice, video, and vision AI. With the exception of Pocket TTS, which integrates with Vision Agents by default, the other TTS models (VibeVoice, Qwen3-TTS, Neu TTS, TADA TTS, and Kitten TTS) can be added to the framework as custom Python plugins.

Let's go through the getting-started guide for VibeVoice to work with Vision Agents. We will then show only Vision Agents-powered demos for the other models since their installations and configurations are similar to VibeVoice.

Initial Setup

Begin by initializing a new Python project with uv, installing Vision Agents and the required AI services, and setting up your environment variables.

bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
uv init

uv venv .venv && source .venv/bin/activate

uv add vision-agents
uv add "vision-agents[getstream, gemini, deepgram, smart-turn]"

# .env
STREAM_API_KEY=...
STREAM_API_SECRET=...
EXAMPLE_BASE_URL=https://demo.visionagents.ai

OPENAI_API_KEY=...
DEEPGRAM_API_KEY=...

In the VibeVoice example, we use:

STREAM_API_KEY and STREAM_API_SECRET: To access Stream edge for low latency audio and video transport.
OPENAI_API_KEY: To access an OpenAI model.
DEEPGRAM_API_KEY: For speech-to-text.

Note: You can swap any of the AI providers with the one you prefer.

What is missing from the above is the TTS service to use. For that you use any of the on-device TTS models discussed in this article. The VibeVoice plugin, example demos, and step-by-step integration with Vision Agents can be explored on GitHub. To bring other AI services to Vision Agents, refer to the custom plugins integration guide.

Podcast-Style VibeVoice Demo With Vision Agents

The following code snippet creates a conversation audio app in a podcast-style using the VibeVoice plugin and the API credential of AI services highlighted in the previous section.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
"""Podcast with Background Music — VibeVoice TTS + Vision Agents

Demonstrates long-form, multi-turn podcast-style dialogue where the LLM
plays the role of a podcast host.  VibeVoice synthesizes expressive,
conversational speech.  A Deepgram STT captures the user's spoken input
so the conversation flows naturally.

Prerequisites
─────────────
1. Start the VibeVoice server:
       python demo/vibevoice_realtime_demo.py \
           --model_path microsoft/VibeVoice-Realtime-0.5B --port 3000

2. Set environment variables (or use a .env file):
       STREAM_API_KEY, STREAM_API_SECRET
       OPENAI_API_KEY
       DEEPGRAM_API_KEY
       VIBEVOICE_BASE_URL      (defaults to http://localhost:3000)

Usage:
    uv run --extra examples python examples/podcast_with_background_music.py run
"""

from dotenv import load_dotenv

load_dotenv()

from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import getstream, openai, deepgram
from vision_agents.plugins import vibevoice

INSTRUCTIONS = """\
You are "The Deep Dive", a charismatic and curious podcast host known for
making complex topics feel like a fascinating conversation over coffee.

Podcast format guidelines:
- Open each session with a warm, energetic greeting and a teaser of what you'll
  explore: "Hey everyone, welcome back to The Deep Dive!  Today we're diving
  into something truly mind-bending..."
- Ask the guest (the user) thoughtful follow-up questions that reveal depth.
- Use narrative bridges: "That's a great point, and it reminds me of..."
- Summarize key insights periodically: "So what I'm hearing is..."
- Wrap segments with a hook: "Coming up next, we'll tackle..."
- Keep individual responses to 4–6 sentences — enough for a natural podcast
  rhythm without monologuing.
- Maintain a conversational, warm tone throughout.  Occasionally express genuine
  excitement: "Oh, I love that!"

You are speaking to ONE guest at a time.  Make them feel like the most
interesting person in the room.
"""

async def create_agent(**kwargs) -> Agent:
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="The Deep Dive", id="agent"),
        instructions=INSTRUCTIONS,
        stt=deepgram.STT(),
        tts=vibevoice.TTS(
            voice="en-Carter_man",
            cfg_scale=1.5,
        ),
        llm=openai.ChatCompletionsLLM(model="gpt-4o"),
    )

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response(
            "Hey everyone, welcome back to The Deep Dive! "
            "I'm so excited about today's conversation."
        )
        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Running this sample code will display an output similar to the following.

Limitations of VibeVoice

VibeVoice is well-suited for building a production voice-generation application, but it also has practical limitations.

Server Lock (Single Request): The VibeVoice server processes one WebSocket synthesis request at a time. When you talk while the agent is speaking, the ongoing request must be canceled before a new one begins. These to-and-fro exchanges may result in a service busy / 1013 WebSocket close code 1013.
Inaccurate Singing Feature: The spontaneous singing feature uses markers and lyrical text to suggest singing, but it doesn't actually produce it. The singing output is stylized speech rather than karaoke-grade vocals or a pitch-perfect melody.
Additional Configuration for Inference Server: Running any VibeVoice model requires additional local server setup. Get the server up and running in the README.md of the VibeVoice Vision Agents Plugin on GitHub.
VibeVoice Realtime-0.5B is Single-Speaker Only: It supports only one voice at a time. Multi-speaker conversations (up to 4 speakers) require VibeVoice-1.5B. However, the inference code of this model has been removed from the public GitHub repository at the time of writing this article.
First-Chunk Latency Depends on Hardware: The ~300 ms first-chunk latency cited in the VibeVoice paper assumes an NVIDIA T4 GPU. On Apple Silicon, you should expect 200 - 500 ms latency. On older CPUs, latency can exceed several seconds.
Multilingual Support is Experimental: The non-English voices (de-, fr-, jp-, etc.) work but may result in a reduced expressiveness compared to the English voices.

2. Qwen3-TTS: Clone, Design, and Generate AI Voices

Qwen3-TTS consists of a family of speech models for voice generation, cloning, and design.

Design AI Voices: Create AI voices from user-provided descriptions for business and enterprise applications with Qwen3-TTS-12Hz-1.7B-VoiceDesign. This model supports ten languages.
Generate Custom Voices: Use Qwen3-TTS-12Hz-1.7B-CustomVoice styles and parameters (gender, dialect, timbres, age, language, etc.) to create custom AI voices in Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.
Clone Any Voice: Generate replicas of any 3-second input voice in the above languages using Qwen3-TTS-12Hz-1.7B-Base.

Compared with VibeVoice, Qwen3-TTS offers more customization and voice design options for enterprise use cases.

Features of Qwen3-TTS

With the three speech generation modes of Qwen3-TTS, custom voice, cloning, and design, developers can build agentic experiences across the ten languages stated in one of the previous sections.

Built-In Speakers: It has 9 default voices, including Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, and Sohee.
Instruction Control: Fine-tune and control voice tone, emotion, speaking rate, and prosody. This allows you to adjust the expressiveness of a voice using natural language instructions.
Streaming Support: Generate low-latency speech (realtime audio and playback) using a dual-track hybrid architecture.
Voice Customization Methods: It supports two main customization methods, cloning and design.

The following Qwen3-TTS open-source models are available to use on Hugging Face.

Model	HuggingFace ID	Mode	Parameters	Features
CustomVoice 1.7B	`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`	`custom_voice`	1.7B	9 speakers + instruction control
CustomVoice 0.6B	`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`	`custom_voice`	0.6B	9 speakers (no instruction control)
VoiceDesign 1.7B	`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`	`voice_design`	1.7B	Text-described voice design
Base 1.7B	`Qwen/Qwen3-TTS-12Hz-1.7B-Base`	`voice_clone`	1.7B	Zero-shot voice cloning
Base 0.6B	`Qwen/Qwen3-TTS-12Hz-0.6B-Base`	`voice_clone`	0.6B	Lightweight voice cloning

Qwen3-TTS: Quick Start in Vision Agents

Vision Agents does not have built-in support for Qwen3-TTS. However, you can integrate it as a custom plugin in a similar way to VibeVoice. Or use any of its family of models via Vision Agents Hugging Face integration.

The following is a basic Qwen3-TTS usage in Vision Agents. Check out the other use cases such as custom voice design and cloning on GitHub.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
"""
Qwen3-TTS + Vision Agents — Quick Start

Demonstrates the Qwen3-TTS plugin with the 1.7B CustomVoice model,
using the "Vivian" speaker with instruction-controlled prosody.

Required env vars:
    HF_TOKEN, DEEPGRAM_API_KEY, GOOGLE_API_KEY,
    STREAM_API_KEY, STREAM_API_SECRET
"""

import asyncio
import logging
import sys
from pathlib import Path

sys.path.insert(0, str(Path(__file__).resolve().parent))

from dotenv import load_dotenv
from vision_agents.core import Agent, Runner, User
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import deepgram, gemini, getstream, smart_turn

from plugins.qwen3tts.vision_agents.plugins.qwen3tts import TTS as Qwen3TTS

logger = logging.getLogger(__name__)

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    """Create an agent with Qwen3-TTS CustomVoice."""
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Qwen3 TTS AI", id="agent"),
        instructions=(
            "You are a helpful, friendly voice assistant powered by "
            "Qwen3-TTS. Keep responses brief and conversational."
        ),
        tts=Qwen3TTS(
            model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
            mode="custom_voice",
            speaker="Vivian",
            language="Auto",
            instruct="Speak in a warm, friendly tone.",
        ),
        stt=deepgram.STT(eager_turn_detection=True),
        llm=gemini.LLM("gemini-2.5-flash"),
        turn_detection=smart_turn.TurnDetection(
            silence_duration_ms=2000,
            speech_probability_threshold=0.5,
        ),
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join the call and start the agent."""
    call = await agent.create_call(call_type, call_id)

    logger.info("Starting Qwen3-TTS Agent...")

    async with agent.join(call):
        logger.info("Agent joined call")

        await asyncio.sleep(3)
        await agent.llm.simple_response(
            text="Hello! I'm powered by Qwen3-TTS from Alibaba Cloud."
        )

        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

To run this demo successfully, ensure you have the following:

Python 3.12+.
Apple Silicon (MPS) or a similar laptop.
A HuggingFace account and access token (HF_TOKEN).
A Deepgram API key (for speech-to-text).
A Google API key (for a Gemini model).
Stream API credentials (for real-time audio and video communication).

Limitations of Qwen3-TTS

Here are a couple of things to note when using the open-source Gwen3-TTS family of models. You may encounter device and model constraints, as well as language and voice generation limitations.

Device Requirements

CUDA GPU Recommended: Although Qwen3-TTS runs well on Apple Silicon Macs, it is designed for NVIDIA GPUs. The 1.7B models require ~4 GB VRAM and the 0.6B models require ~1.5 GB VRAM.
Apple Silicon Macs: Expect 5 - 10 seconds per utterance for the 0.6B model and 15 - 40+ seconds for the 1.7B model on CPU.

Model Constraints

Qwen3-TTS Model	Limitation
CustomVoice 0.6B	There is no instruction-based style control.
CustomVoice 1.7B	Instruction control works best in only Chinese and English. Other languages may have reduced expressiveness.
VoiceDesign 1.7B	Only available in the 1.7B size and not for the 0.6B variant.
Base 0.6B / 1.7B	Voice cloning quality depends on your reference audio. Ensure to use 3 - 25 seconds of clean, single-speaker audio with no background noise.

Voice Generation

First-Call Latency: The first stream_audio() call fetches the model weights from HuggingFace (~1.2 GB for 0.6B, ~3.4 GB for 1.7B) and loads them into memory. This may take a while to complete. Subsequent calls will reuse the loaded model.

Language and Voice Support

Cross-Lingual Performance Varies: Using a speaker outside its native language (e.g. Vivian for English) works but may result in higher word error rates.
Voice Clone Fidelity: The voice cloning support produces lower-quality clones than full prompt-based cloning.

Building your own app? Get access to our Livestream or Video Calling API and launch in days!

3. Neu TTS: Build Voice Apps that Live On-Device

Neuphonic, the company behind NeuTTS, offers developers small speech-generation models to build secure, private, and on-device conversational AI apps. You can run its speech language models locally to build AI services for tech support, recruitment, service delivery, sales, healthcare, and more. With the text-to-speech model, developers can extend their apps with audio generation using 50+ realistic, human-like AI voices. It is built for scale, enterprise service integration, and customization.

Characteristics of Neu TTS

The rich features of Neu TTS make it suitable for several application areas in conversational AI.

Hardware Support: Due to the model's size, it can run on a wide range of devices, including CPUs, GPUs, and MPS.
Voice Cloning: Generate 3 - 15 second speech from a 3 second reference audio (16 - 44 KHz sample rate WAV file).
Multilingual and Multivoice Support: Although the TTS model supports multiple voices, at the time of writing this article, it was available in only English. This does not make the model suitable for a non-English use case.
Customization Options: Compared to Qwen3-TTS, it has less customization options. It supports voice cloning but you cannot design custom voices with it.
Perth Watermarker: All Neu TTS-based generated audio have a perceptual threshold watermark for identification and tracing. This feature ensures a responsible use of the voice AI technology.
Open-Source: The TTS model is free for both commercial and personal use since it has Apache 2.0 license.
On-Device Deployment: Since it supports offline deployment, you can build your apps with the model by ensuring data privacy without API costs.

Neu TTS: Quick Start in Vision Agents

Similar to the integration of VibeVoice and Qwen3-TTS with Vision Agents, you can use Neu TTS with a custom Python plugin integration. The model does not integrate with Vision Agents by default. So, you can clone the ready-made plugin used in this section's demo from GitHub and test the Neu TTS use case examples for healthcare, customer service, outbound sales, inventory management, and recruitment.

Limitations of Neu TTS

English Only (Smaller Model): NeuTTS Air is available only in English. The Nano multilingual collection is available in French, German, and Spanish. However, each language requires its own variant of the models.
Small Context Window: Neu TTS has a 2048-token context window. This limits voice generation to ~30 seconds (including the reference prompt).
Reference Audio Quality: To get the best results, you should provide a clean, mono, 16 - 44 kHz WAV file with reduced background noise. A poor input audio may degrade the output.
CPU Inference Latency: Using a modern device (laptop, desktop) will produce an output with low-latency. With older devices and low-power CPUs, the speech generation may be slower than using cloud-based TTS solutions.
Watermarking: The Perth watermarker does not install correctly on all devices when using for example, uv sync. It can still generate the audio but without an embedded watermark.

4. Pocket TTS: Run Text-to-Speech AI With Voice Cloning on CPU

Pocket TTS from Kyutai Labs offers ~200 ms to first audio latency voice cloning and generation solution for building conversational AI apps. It has seamless built-in integration with Vision Agents as a Python plugin on pypi.org. It has default voices such as, alba, marius, javert, jean, fantine, cosette, eponine, and azelma. You can use the model for voice cloning by either providing a local reference audio file or hosting one on Hugging Face.

python

1
2
3
4
5
# Use a local wav file
tts = pocket.TTS(voice="path/to/your/voice.wav")

# Or a HuggingFace-hosted voice
tts = pocket.TTS(voice="hf://kyutai/tts-voices/alba-mackenna/casual.wav")

Feature of Pocket TTS

Pocket TTS is small and completely open-source so it can run successfully on any device's CPU.

100M-Parameter Only: It has only 100M parameters, it can be used locally on laptops, IoT devices, wearables, and mobile phones.
Voice Generation: It can generate voices with emotions, cadence, and accent.
Open Source: The Pocket TTS is free to use and released under MIT licence on GitHub.
Voice Reproduction: Create a voice clone from a given 5 seconds sample audio. To start, check out the Kyutai TTS voice library on Hugging Face or use your own sample voice as input for replication.

Pocket TTS: Quick Start in Vision Agents

Since Pocket TTS integrates directly into Vision Agents, it can be installed and used this way.

Install Pocket TTS

Run this command to install the plugin into any Vision Agents project initialized with uv.

bash

1
uv add "vision-agents[pocket]"

Pocket TTS-Powered Voice Demo

The following sample code creates a voice agent using Pocket TTS for voice synthesis.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
"""
Pocket TTS Example

This example demonstrates Pocket TTS integration with Vision Agents.

This example creates an agent that uses:
- Pocket TTS for text-to-speech (runs locally on CPU)
- Deepgram for speech-to-text
- Gemini for LLM
- GetStream for edge/real-time communication

Requirements:
- DEEPGRAM_API_KEY environment variable
- GOOGLE_API_KEY environment variable
- STREAM_API_KEY and STREAM_API_SECRET environment variables
"""

import asyncio
import logging

from dotenv import load_dotenv
from vision_agents.core import Agent, Runner, User
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import deepgram, gemini, getstream, pocket

logger = logging.getLogger(__name__)

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    """Create the agent with Pocket TTS."""
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Pocket AI", id="agent"),
        instructions="You are a helpful voice assistant. Keep responses brief and conversational.",
        tts=pocket.TTS(voice="alba"),
        stt=deepgram.STT(eager_turn_detection=True),
        llm=gemini.LLM("gemini-3-flash-preview"),
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join the call and start the agent."""
    call = await agent.create_call(call_type, call_id)

    logger.info("🤖 Starting Pocket TTS Agent...")

    async with agent.join(call):
        logger.info("Agent joined call")

        await asyncio.sleep(3)
        await agent.llm.simple_response(text="Hello! I'm running Pocket TTS locally.")

        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

When you run the demo, you should be able to interact with the Pocket TTS agent locally in realtime.

Limitations of Pocket TTS

Although Pocket TTS is great for prototyping and testing local conversational voice AI apps, it has some unsupported features and lacks multilingual support.

Language Support: Only English is supported at the time of writing this article.
No Pause or Silence Control: It is not possible to insert pauses within the text input to produce breaks in a generated speech.
Expressiveness: Although it has low-latency and great CPU execution, the model's 100M parameters limit its expressiveness and naturalness compared to larger TTS models.
Noise in Cloned Speech: A noise in the reference audio can be carried to the generated version. From our experiments, we noticed the voice cloning feature of Voxtral TTS does an excellent job in suppressing background noise when reproducing a speech from a reference audio.

Note: Voxtral TTS by Mistral AI is an open-weight model but not open-source.

5. TADA By Hume AI: Generate Speech Via Text-Acoustic Synchronization

Hume AI has open-sourced TADA TTS for natural, reliable, and expressive voice generation integration in any application. You can easily run TADA TTS on edge devices and mobile phones without GPUs. Compared to other on-device TTS models like Higgs Audio v2 and FireRedTTS-2, TADA TTS has a lower hallucination rate and sounds more natural. It is an excellent choice if your agent's response time is more important to you. Similar to VibeVoice, you can use this model for long-form conversational speech applications, such as podcasts.

Here are the available models for TADA TTS.

markdown

1
2
3
4
| Model | Parameters | Languages | HuggingFace |
|-------|-----------|-----------|-------------|
| `HumeAI/tada-1b` | 1B | English |https://huggingface.co/HumeAI/tada-1b |
| `HumeAI/tada-3b-ml` | 3B | en, ar, ch, de, es, fr, it, ja, pl, pt | https://huggingface.co/HumeAI/tada-3b-ml |

Characteristics of TADA TTS

The open-source text-to-speech model has the following key highlights.

Quick Response: It has ~0.09 real-time factor (about 2x faster than other TTS models on Hugging Face).
Expressiveness: It uses dynamic duration and prosody per token.
Voice Cloning: Clone any voice from a short sample audio clip (WAV file).
Multi-Language Support: Its 3B parameters model is available in 10 languages.
Local: It runs entirely on your GPU, no API keys needed for the TTS functionality.

TADA TTS: Quick Start in Vision Agents

You should create a custom Vision Agents TTS plugin that integrates TADA for high-quality, locally synthesized text-to-speech. For a quick test, you can clone and experiment with the plugin on GitHub. Once you clone this repo and navigate to the TADA TTS directory (stream-tutorial-projects/AI/VisionAgents/VisionAgentsPythonPlugins/TADA_TTS), you should run uv sync to install dependencies.

Next, set the following API credentials.

bash

1
2
3
4
export DEEPGRAM_API_KEY="your_deepgram_key"
export GOOGLE_API_KEY="your_google_key"
export STREAM_API_KEY="your_stream_key"
export STREAM_API_SECRET="your_stream_secret"

With the sample code below, you can initialize the TADA TTS model, integrate it for voice cloning and multilingual support.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from vision_agents.plugins import tada

# Default (3B multilingual model, built-in voice)
tts = tada.TTS()

# English-only 1B model
tts = tada.TTS(model="HumeAI/tada-1b")

# Voice cloning
tts = tada.TTS(
    voice="path/to/reference.wav",
    voice_transcript="Transcript of the reference audio.",
)

# Multilingual (German)
tts = tada.TTS(
    model="HumeAI/tada-3b-ml",
    language="de",
)

For a complete demo, you can run this Python script to create a voice agent in Vision Agents powered by TADA TTS for speech synthesis.

Limitations of TADA TTS

Long-Form Generation: During extended generations (10+ minutes of context) a speaker drift can occur . A recommended workaround is to reset your context periodically.
Language Coverage: Currently, TADA TTS supports English and eight other languages (ar, ch, de, es, fr, it, ja, pl, pt). Broader multilingual support may be required for enterprise use cases.

6. Kitten TTS: Create Voice-Enabled Apps With Tiny Open Source Model

Kitten TTS by KittenML is an ultra-lightweight speech synthesis model that runs efficiently on CPU with no GPU required. Among all the other TTS models discussed in this article, Kitten TTS is the smallest, with a model size under 25MB (int8). Key features of the open-source model include the following.

It runs on CPU, so no GPU is required to experiment with it offline.
Ultra-Lightweight: The model sizes range from 25MB (int8) to 80MB (mini).

Model	Params	Size	HuggingFace
kitten-tts-mini	80M	80MB	`KittenML/kitten-tts-mini-0.8`
kitten-tts-micro	40M	41MB	`KittenML/kitten-tts-micro-0.8`
kitten-tts-nano	15M	56MB	`KittenML/kitten-tts-nano-0.8`
kitten-tts-nano-int8	15M	25MB	`KittenML/kitten-tts-nano-0.8-int8`

Voice Options: There are multiple male and female voices ready to use for any use case.

Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo
Customization: Adjust configurable parameters like speed.

Parameter	Type	Default	Description
`model`	`str`	`"KittenML/kitten-tts-mini-0.8"`	HuggingFace model ID
`voice`	`str`	`"Bella"`	Voice name for synthesis
`speed`	`float`	`1.0`	Speech speed multiplier
`client`	`KittenTTS \| None`	`None`	Pre-initialized KittenTTS instance

Kitten TTS: Quick Start in Vision Agents

Like the four TTS models (VibeVoice, Qwen3-TTS, Neu TTS, TADA TTS) in the previous sections, Vision Agents does not provide built-in support for Kitten TTS. You can easily add it as a custom plugin, install it, and use it in Vision Agents as follows.

Install the Kitten TTS Plugin

uv add vision-agents-plugins-kittentts

Use Kitten TTS in Vision Agents

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from vision_agents.plugins import kittentts

# Create TTS with default settings (mini model, Bella voice)
tts = kittentts.TTS()

# Or specify model and voice
tts = kittentts.TTS(
    model="KittenML/kitten-tts-mini-0.8",
    voice="Jasper",
    speed=1.0,
)

# Or use the nano int8 model for minimal footprint
tts = kittentts.TTS(
    model="KittenML/kitten-tts-nano-0.8-int8",
    voice="Luna",
)

For a working Kitten TTS demo in Vision Agents, clone this repo and follow the instructions to run the example on the plugin's page.

Limitations of Kitten TTS

Expressiveness and Naturalness: Being the smallest TTS model hinders its ability to produce realistic speech compared to VibeVoice or Qwen3-TTS.
English Focused: Kitten TTS currently supports English only. You will find information about its multilingual TTS on the GitHub roadmap but not yet implemented.
Developer Preview: Its API is in developer preview so it may be suitable to use it for only experimentation and prototypes.
No Voice Cloning or Fine-Tuning: Only eight built-in voices are available and there is no custom voice cloning or fine-tuning support at this time. This may change in future updates.
Nano-int8 Model Issues: Some users have reported problems with kitten-tts-nano-0.8-int8 on GitHub. It is recommended to use the mini or micro model if you encounter quality or stability issues.
No Prosody Control: It does not support Speech Synthesis Markup Language (SSML). The only configurable/tunable parameter is speed.

The 6 Best On-Device TTS Models for Voice AI

What is a Speech Synthesis Model?

Choosing a Text-to-Speech Model: What To Consider

The Top 6 On-Device and Open Source Text-to-Speech Models for Voice AI

1. VibeVoice: Build Long-Form and Multi-Speaker Conversational Audio Apps

Characteristics of VibeVoice

VibeVoice: Quick Start in Vision Agents

Limitations of VibeVoice

2. Qwen3-TTS: Clone, Design, and Generate AI Voices

Features of Qwen3-TTS

Qwen3-TTS: Quick Start in Vision Agents

Limitations of Qwen3-TTS

3. Neu TTS: Build Voice Apps that Live On-Device

Characteristics of Neu TTS

Neu TTS: Quick Start in Vision Agents

Limitations of Neu TTS

4. Pocket TTS: Run Text-to-Speech AI With Voice Cloning on CPU

Feature of Pocket TTS

Pocket TTS: Quick Start in Vision Agents

Limitations of Pocket TTS

5. TADA By Hume AI: Generate Speech Via Text-Acoustic Synchronization

Characteristics of TADA TTS

TADA TTS: Quick Start in Vision Agents

Limitations of TADA TTS

6. Kitten TTS: Create Voice-Enabled Apps With Tiny Open Source Model

Kitten TTS: Quick Start in Vision Agents

Limitations of Kitten TTS

Further Reading