Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub

The 6 Best On-Device TTS Models for Voice AI

New
21 min read

There are several text-to-speech models for building voice agents, but which ones can you run privately, locally, and on-device? Let’s find out.

Amos G.
Amos G.
Published April 13, 2026
On-device TTS models

When building voice AI applications, you have industry-leading cloud options for text-to-speech, such as Cartesia Sonic 3 and Grok TTS. For privacy and to avoid sharing your business’s data with these commercial text-to-speech (TTS) providers, your team may want to use free, open-source solutions that run locally on mobile and desktop devices.

Continue reading to unlock the six hidden but capable, high-quality, secure, small, and faster TTS models to build human-like speech experiences.

What is a Speech Synthesis Model?

Speech Synthesis Model

Any AI model capable of converting an input text into audible spoken audio is called a speech synthesis, or TTS model. Speech synthesis is the component of a voice AI system that enables users to speak to agents in realtime, without using recorded audio. In most speech-enabled applications, it is used with its companion speech-to-text component and an LLM to create a unified voice pipeline, as shown in the diagram above.

Choosing a Text-to-Speech Model: What To Consider

There are many reasons developers prefer testing and prototyping with TTS models that do not require GPU rental or running on some cloud AI inference providers. Here are the key considerations when choosing a TTS model for on-device use.

  • Naturalness: Can the model produce a realistic, human-sounding speech? None of the options covered in this article produces awkward or robotic voice responses.
  • Private and Secure: Does the data for the voice service or any of its components share with the underlying model providers for training and other purposes?
  • Lightweight and Fast: The latency of responses is critical when choosing a model for speech synthesis. Using a small model does not always guarantee a lower latency. If this information cannot be found in the model’s card, the characteristics can be tested and verified at runtime. For smaller models, try Kitten TTS, which comprises models ranging from 15M to 80M parameters (25 - 80 MB on disk). Although it is tiny in size, it delivers high-quality voice synthesis on the CPU without requiring a GPU.
  • Multi-Language Support: When creating a voice service that supports English, a model like Pocket TTS is a great choice. However, it supports only one language. If you are building an app to be used in different locales and support multiple languages, you should consider using VibeVoice.
  • Customization: TTS models often come with built-in voices that can be used out of the box. These are useful for quick demos and testing purposes. However, not all available solutions, for example, allow customization of voice design or cloning. To ensure a custom and expressive speech-generation experience, you should choose a model that supports voice cloning and design. An excellent pick in this category is Qwen3-TTS.

The Top 6 On-Device and Open Source Text-to-Speech Models for Voice AI

Top 6 On-Device and Open Source Text-to-Speech Models

Cartesia, Deepgram, and ElevenLabs offer one of the best commercial speech-generation models for building AI services. However, developers can only use these APIs at a cost or with a subscription. Let’s look at free, local, and open-source alternatives to these models to create any voice experience.

The best place to experiment with different TTS models is to select the text-to-speech option under the Hugging Face models category. After testing several of these models, here are the six most suitable options you can use for your projects and run locally on consumer laptops, Raspberry Pi, mobile, and wearable devices. They were picked based on the quality, size of the model parameters, ease of use, language support, customization options, and more.

1. VibeVoice: Build Long-Form and Multi-Speaker Conversational Audio Apps

VibeVoice

VibeVoice consists of open-source TTS models designed specifically for multi-speaker and long-form conversational audio generation. One of the best use cases for VibeVoice TTS is an AI podcast with 1 - 4 multi-speaker support and up to 90 minutes of speech generation.

VibeVoice also includes an automatic speech recognition model that can be seamlessly used with the TTS counterpart. Here is a demo of VibeVoice TTS integration with Vision Agents using the VibeVoice-1.5B model.

This model has a 64K context length and ~90 min generation length. Refer to its technical report to read more.

Characteristics of VibeVoice

Since the beginning of 2026, there have been new TTS models (open-source and commercial) releases and updates on X every week. However, choosing the right audio generation model for an agent can be daunting due to factors such as quality, latency, pricing, and more.

  • Quality and Performance: VibeVoice has one of the highest quality among the open-source on-device TTS models highlighted in this article. Based on our testing, It performs on par with Gemini 2.5 Pro TTS and Cartesia Sonic 3.
  • Latency: It has ~300 ms of first-audible latency (VibeVoice-Realtime-0.5B)
  • Available Voices and Styles: You can use it to build agents with natural emotional nuances and spontaneous reactions. It also has singing capabilities.
  • Price: VibeVoice is free of charge to use. Refer to its licence statement on GitHub to learn more.
  • Available Languages: It supports English as the primary language and nine additional languages (experimental).

VibeVoice: Quick Start in Vision Agents

All the demos in this article demonstrate the integration of the on-device TTS models with Vision Agents. It is a free and open-source project that provides developers with all the building blocks for voice, video, and vision AI. With the exception of Pocket TTS, which integrates with Vision Agents by default, the other TTS models (VibeVoice, Qwen3-TTS, Neu TTS, TADA TTS, and Kitten TTS) can be added to the framework as custom Python plugins.

Let’s go through the getting-started guide for VibeVoice to work with Vision Agents. We will then show only Vision Agents-powered demos for the other models since their installations and configurations are similar to VibeVoice.

Initial Setup

Begin by initializing a new Python project with uv, installing Vision Agents and the required AI services, and setting up your environment variables.

bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
uv init uv venv .venv && source .venv/bin/activate uv add vision-agents uv add "vision-agents[getstream, gemini, deepgram, smart-turn]" # .env STREAM_API_KEY=... STREAM_API_SECRET=... EXAMPLE_BASE_URL=https://demo.visionagents.ai OPENAI_API_KEY=... DEEPGRAM_API_KEY=...

In the VibeVoice example, we use:

  • STREAM_API_KEY and STREAM_API_SECRET: To access Stream edge for low latency audio and video transport.
  • OPENAI_API_KEY: To access an OpenAI model.
  • DEEPGRAM_API_KEY: For speech-to-text.

Note: You can swap any of the AI providers with the one you prefer.

What is missing from the above is the TTS service to use. For that you use any of the on-device TTS models discussed in this article. The VibeVoice plugin, example demos, and step-by-step integration with Vision Agents can be explored on GitHub. To bring other AI services to Vision Agents, refer to the custom plugins integration guide.

Podcast-Style VibeVoice Demo With Vision Agents

The following code snippet creates a conversation audio app in a podcast-style using the VibeVoice plugin and the API credential of AI services highlighted in the previous section.

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
"""Podcast with Background Music — VibeVoice TTS + Vision Agents Demonstrates long-form, multi-turn podcast-style dialogue where the LLM plays the role of a podcast host. VibeVoice synthesizes expressive, conversational speech. A Deepgram STT captures the user's spoken input so the conversation flows naturally. Prerequisites ───────────── 1. Start the VibeVoice server: python demo/vibevoice_realtime_demo.py \ --model_path microsoft/VibeVoice-Realtime-0.5B --port 3000 2. Set environment variables (or use a .env file): STREAM_API_KEY, STREAM_API_SECRET OPENAI_API_KEY DEEPGRAM_API_KEY VIBEVOICE_BASE_URL (defaults to http://localhost:3000) Usage: uv run --extra examples python examples/podcast_with_background_music.py run """ from dotenv import load_dotenv load_dotenv() from vision_agents.core import Agent, AgentLauncher, User, Runner from vision_agents.plugins import getstream, openai, deepgram from vision_agents.plugins import vibevoice INSTRUCTIONS = """\ You are "The Deep Dive", a charismatic and curious podcast host known for making complex topics feel like a fascinating conversation over coffee. Podcast format guidelines: - Open each session with a warm, energetic greeting and a teaser of what you'll explore: "Hey everyone, welcome back to The Deep Dive! Today we're diving into something truly mind-bending…" - Ask the guest (the user) thoughtful follow-up questions that reveal depth. - Use narrative bridges: "That's a great point, and it reminds me of…" - Summarize key insights periodically: "So what I'm hearing is…" - Wrap segments with a hook: "Coming up next, we'll tackle…" - Keep individual responses to 4–6 sentences — enough for a natural podcast rhythm without monologuing. - Maintain a conversational, warm tone throughout. Occasionally express genuine excitement: "Oh, I love that!" You are speaking to ONE guest at a time. Make them feel like the most interesting person in the room. """ async def create_agent(**kwargs) -> Agent: return Agent( edge=getstream.Edge(), agent_user=User(name="The Deep Dive", id="agent"), instructions=INSTRUCTIONS, stt=deepgram.STT(), tts=vibevoice.TTS( voice="en-Carter_man", cfg_scale=1.5, ), llm=openai.ChatCompletionsLLM(model="gpt-4o"), ) async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: call = await agent.create_call(call_type, call_id) async with agent.join(call): await agent.simple_response( "Hey everyone, welcome back to The Deep Dive! " "I'm so excited about today's conversation." ) await agent.finish() if __name__ == "__main__": Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Running this sample code will display an output similar to the following.

Limitations of VibeVoice

VibeVoice is well-suited for building a production voice-generation application, but it also has practical limitations.

  • Server Lock (Single Request): The VibeVoice server processes one WebSocket synthesis request at a time. When you talk while the agent is speaking, the ongoing request must be canceled before a new one begins. These to-and-fro exchanges may result in a service busy / 1013 WebSocket close code 1013.
  • Inaccurate Singing Feature: The spontaneous singing feature uses markers and lyrical text to suggest singing, but it doesn't actually produce it. The singing output is stylized speech rather than karaoke-grade vocals or a pitch-perfect melody.
  • Additional Configuration for Inference Server: Running any VibeVoice model requires additional local server setup. Get the server up and running in the README.md of the VibeVoice Vision Agents Plugin on GitHub.
  • VibeVoice Realtime-0.5B is Single-Speaker Only: It supports only one voice at a time. Multi-speaker conversations (up to 4 speakers) require VibeVoice-1.5B. However, the inference code of this model has been removed from the public GitHub repository at the time of writing this article.
  • First-Chunk Latency Depends on Hardware: The ~300 ms first-chunk latency cited in the VibeVoice paper assumes an NVIDIA T4 GPU. On Apple Silicon, you should expect 200 – 500 ms latency. On older CPUs, latency can exceed several seconds.
  • Multilingual Support is Experimental: The non-English voices (de-, fr-, jp-, etc.) work but may result in a reduced expressiveness compared to the English voices.

2. Qwen3-TTS: Clone, Design, and Generate AI Voices

Qwen3-TTS

Qwen3-TTS consists of a family of speech models for voice generation, cloning, and design.

  • Design AI Voices: Create AI voices from user-provided descriptions for business and enterprise applications with Qwen3-TTS-12Hz-1.7B-VoiceDesign. This model supports ten languages.
  • Generate Custom Voices: Use Qwen3-TTS-12Hz-1.7B-CustomVoice styles and parameters (gender, dialect, timbres, age, language, etc.) to create custom AI voices in Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.
  • Clone Any Voice: Generate replicas of any 3-second input voice in the above languages using Qwen3-TTS-12Hz-1.7B-Base.

Compared with VibeVoice, Qwen3-TTS offers more customization and voice design options for enterprise use cases.

Features of Qwen3-TTS

With the three speech generation modes of Qwen3-TTS, custom voice, cloning, and design, developers can build agentic experiences across the ten languages stated in one of the previous sections.

  • Built-In Speakers: It has 9 default voices, including Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, and Sohee.
  • Instruction Control: Fine-tune and control voice tone, emotion, speaking rate, and prosody. This allows you to adjust the expressiveness of a voice using natural language instructions.
  • Streaming Support: Generate low-latency speech (realtime audio and playback) using a dual-track hybrid architecture.
  • Voice Customization Methods: It supports two main customization methods, cloning and design.

The following Qwen3-TTS open-source models are available to use on Hugging Face.

ModelHuggingFace IDModeParametersFeatures
CustomVoice 1.7BQwen/Qwen3-TTS-12Hz-1.7B-CustomVoicecustom_voice1.7B9 speakers + instruction control
CustomVoice 0.6BQwen/Qwen3-TTS-12Hz-0.6B-CustomVoicecustom_voice0.6B9 speakers (no instruction control)
VoiceDesign 1.7BQwen/Qwen3-TTS-12Hz-1.7B-VoiceDesignvoice_design1.7BText-described voice design
Base 1.7BQwen/Qwen3-TTS-12Hz-1.7B-Basevoice_clone1.7BZero-shot voice cloning
Base 0.6BQwen/Qwen3-TTS-12Hz-0.6B-Basevoice_clone0.6BLightweight voice cloning

Qwen3-TTS: Quick Start in Vision Agents

Vision Agents does not have built-in support for Qwen3-TTS. However, you can integrate it as a custom plugin in a similar way to VibeVoice. Or use any of its family of models via Vision Agents Hugging Face integration.

The following is a basic Qwen3-TTS usage in Vision Agents. Check out the other use cases such as custom voice design and cloning on GitHub.

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
""" Qwen3-TTS + Vision Agents — Quick Start Demonstrates the Qwen3-TTS plugin with the 1.7B CustomVoice model, using the "Vivian" speaker with instruction-controlled prosody. Required env vars: HF_TOKEN, DEEPGRAM_API_KEY, GOOGLE_API_KEY, STREAM_API_KEY, STREAM_API_SECRET """ import asyncio import logging import sys from pathlib import Path sys.path.insert(0, str(Path(__file__).resolve().parent)) from dotenv import load_dotenv from vision_agents.core import Agent, Runner, User from vision_agents.core.agents import AgentLauncher from vision_agents.plugins import deepgram, gemini, getstream, smart_turn from plugins.qwen3tts.vision_agents.plugins.qwen3tts import TTS as Qwen3TTS logger = logging.getLogger(__name__) load_dotenv() async def create_agent(**kwargs) -> Agent: """Create an agent with Qwen3-TTS CustomVoice.""" agent = Agent( edge=getstream.Edge(), agent_user=User(name="Qwen3 TTS AI", id="agent"), instructions=( "You are a helpful, friendly voice assistant powered by " "Qwen3-TTS. Keep responses brief and conversational." ), tts=Qwen3TTS( model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", mode="custom_voice", speaker="Vivian", language="Auto", instruct="Speak in a warm, friendly tone.", ), stt=deepgram.STT(eager_turn_detection=True), llm=gemini.LLM("gemini-2.5-flash"), turn_detection=smart_turn.TurnDetection( silence_duration_ms=2000, speech_probability_threshold=0.5, ), ) return agent async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: """Join the call and start the agent.""" call = await agent.create_call(call_type, call_id) logger.info("Starting Qwen3-TTS Agent...") async with agent.join(call): logger.info("Agent joined call") await asyncio.sleep(3) await agent.llm.simple_response( text="Hello! I'm powered by Qwen3-TTS from Alibaba Cloud." ) await agent.finish() if __name__ == "__main__": Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

To run this demo successfully, ensure you have the following:

Limitations of Qwen3-TTS

Here are a couple of things to note when using the open-source Gwen3-TTS family of models. You may encounter device and model constraints, as well as language and voice generation limitations.

Device Requirements

  • CUDA GPU Recommended: Although Qwen3-TTS runs well on Apple Silicon Macs, it is designed for NVIDIA GPUs. The 1.7B models require ~4 GB VRAM and the 0.6B models require ~1.5 GB VRAM.
  • Apple Silicon Macs: Expect 5 - 10 seconds per utterance for the 0.6B model and 15 - 40+ seconds for the 1.7B model on CPU.

Model Constraints

Qwen3-TTS ModelLimitation
CustomVoice 0.6BThere is no instruction-based style control.
CustomVoice 1.7BInstruction control works best in only Chinese and English. Other languages may have reduced expressiveness.
VoiceDesign 1.7BOnly available in the 1.7B size and not for the 0.6B variant.
Base 0.6B / 1.7BVoice cloning quality depends on your reference audio. Ensure to use 3 - 25 seconds of clean, single-speaker audio with no background noise.

Voice Generation

First-Call Latency: The first stream_audio() call fetches the model weights from HuggingFace (~1.2 GB for 0.6B, ~3.4 GB for 1.7B) and loads them into memory. This may take a while to complete. Subsequent calls will reuse the loaded model.

Language and Voice Support

Building your own app? Get early access to our Livestream or Video Calling API and launch in days!
  • Cross-Lingual Performance Varies: Using a speaker outside its native language (e.g. Vivian for English) works but may result in higher word error rates.
  • Voice Clone Fidelity: The voice cloning support produces lower-quality clones than full prompt-based cloning.

3. Neu TTS: Build Voice Apps that Live On-Device

Neuphonic, the company behind NeuTTS, offers developers small speech-generation models to build secure, private, and on-device conversational AI apps. You can run its speech language models locally to build AI services for tech support, recruitment, service delivery, sales, healthcare, and more. With the text-to-speech model, developers can extend their apps with audio generation using 50+ realistic, human-like AI voices. It is built for scale, enterprise service integration, and customization.

Characteristics of Neu TTS

The rich features of Neu TTS make it suitable for several application areas in conversational AI.

  • Hardware Support: Due to the model's size, it can run on a wide range of devices, including CPUs, GPUs, and MPS.
  • Voice Cloning: Generate 3 - 15 second speech from a 3 second reference audio (16 - 44 KHz sample rate WAV file).
  • Multilingual and Multivoice Support: Although the TTS model supports multiple voices, at the time of writing this article, it was available in only English. This does not make the model suitable for a non-English use case.
  • Customization Options: Compared to Qwen3-TTS, it has less customization options. It supports voice cloning but you cannot design custom voices with it.
  • Perth Watermarker: All Neu TTS-based generated audio have a perceptual threshold watermark for identification and tracing. This feature ensures a responsible use of the voice AI technology.
  • Open-Source: The TTS model is free for both commercial and personal use since it has Apache 2.0 license.
  • On-Device Deployment: Since it supports offline deployment, you can build your apps with the model by ensuring data privacy without API costs.

Neu TTS: Quick Start in Vision Agents

Similar to the integration of VibeVoice and Qwen3-TTS with Vision Agents, you can use Neu TTS with a custom Python plugin integration. The model does not integrate with Vision Agents by default. So, you can clone the ready-made plugin used in this section's demo from GitHub and test the Neu TTS use case examples for healthcare, customer service, outbound sales, inventory management, and recruitment.

Limitations of Neu TTS

  • English Only (Smaller Model): NeuTTS Air is available only in English. The Nano multilingual collection is available in French, German, and Spanish. However, each language requires its own variant of the models.
  • Small Context Window: Neu TTS has a 2048-token context window. This limits voice generation to ~30 seconds (including the reference prompt).
  • Reference Audio Quality: To get the best results, you should provide a clean, mono, 16 – 44 kHz WAV file with reduced background noise. A poor input audio may degrade the output.
  • CPU Inference Latency: Using a modern device (laptop, desktop) will produce an output with low-latency. With older devices and low-power CPUs, the speech generation may be slower than using cloud-based TTS solutions.
  • Watermarking: The Perth watermarker does not install correctly on all devices when using for example, uv sync. It can still generate the audio but without an embedded watermark.

4. Pocket TTS: Run Text-to-Speech AI With Voice Cloning on CPU

Pocket TTS from Kyutai Labs offers ~200 ms to first audio latency voice cloning and generation solution for building conversational AI apps. It has seamless built-in integration with Vision Agents as a Python plugin on pypi.org. It has default voices such as, alba, marius, javert, jean, fantine, cosette, eponine, and azelma. You can use the model for voice cloning by either providing a local reference audio file or hosting one on Hugging Face.

python
1
2
3
4
5
# Use a local wav file tts = pocket.TTS(voice="path/to/your/voice.wav") # Or a HuggingFace-hosted voice tts = pocket.TTS(voice="hf://kyutai/tts-voices/alba-mackenna/casual.wav")

Feature of Pocket TTS

Pocket TTS is small and completely open-source so it can run successfully on any device’s CPU.

  • 100M-Parameter Only: It has only 100M parameters, it can be used locally on laptops, IoT devices, wearables, and mobile phones.
  • Voice Generation: It can generate voices with emotions, cadence, and accent.
  • Open Source: The Pocket TTS is free to use and released under MIT licence on GitHub.
  • Voice Reproduction: Create a voice clone from a given 5 seconds sample audio. To start, check out the Kyutai TTS voice library on Hugging Face or use your own sample voice as input for replication.

Pocket TTS: Quick Start in Vision Agents

Since Pocket TTS integrates directly into Vision Agents, it can be installed and used this way.

Install Pocket TTS

Run this command to install the plugin into any Vision Agents project initialized with uv.

bash
1
uv add "vision-agents[pocket]"

Pocket TTS-Powered Voice Demo

The following sample code creates a voice agent using Pocket TTS for voice synthesis.

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
""" Pocket TTS Example This example demonstrates Pocket TTS integration with Vision Agents. This example creates an agent that uses: - Pocket TTS for text-to-speech (runs locally on CPU) - Deepgram for speech-to-text - Gemini for LLM - GetStream for edge/real-time communication Requirements: - DEEPGRAM_API_KEY environment variable - GOOGLE_API_KEY environment variable - STREAM_API_KEY and STREAM_API_SECRET environment variables """ import asyncio import logging from dotenv import load_dotenv from vision_agents.core import Agent, Runner, User from vision_agents.core.agents import AgentLauncher from vision_agents.plugins import deepgram, gemini, getstream, pocket logger = logging.getLogger(__name__) load_dotenv() async def create_agent(**kwargs) -> Agent: """Create the agent with Pocket TTS.""" agent = Agent( edge=getstream.Edge(), agent_user=User(name="Pocket AI", id="agent"), instructions="You are a helpful voice assistant. Keep responses brief and conversational.", tts=pocket.TTS(voice="alba"), stt=deepgram.STT(eager_turn_detection=True), llm=gemini.LLM("gemini-3-flash-preview"), ) return agent async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: """Join the call and start the agent.""" call = await agent.create_call(call_type, call_id) logger.info("🤖 Starting Pocket TTS Agent...") async with agent.join(call): logger.info("Agent joined call") await asyncio.sleep(3) await agent.llm.simple_response(text="Hello! I'm running Pocket TTS locally.") await agent.finish() if __name__ == "__main__": Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

When you run the demo, you should be able to interact with the Pocket TTS agent locally in realtime.

Limitations of Pocket TTS

Although Pocket TTS is great for prototyping and testing local conversational voice AI apps, it has some unsupported features and lacks multilingual support.

  • Language Support: Only English is supported at the time of writing this article.
  • No Pause or Silence Control: It is not possible to insert pauses within the text input to produce breaks in a generated speech.
  • Expressiveness: Although it has low-latency and great CPU execution, the model’s 100M parameters limit its expressiveness and naturalness compared to larger TTS models.
  • Noise in Cloned Speech: A noise in the reference audio can be carried to the generated version. From our experiments, we noticed the voice cloning feature of Voxtral TTS does an excellent job in suppressing background noise when reproducing a speech from a reference audio.

Note: Voxtral TTS by Mistral AI is an open-weight model but not open-source.

5. TADA By Hume AI: Generate Speech Via Text-Acoustic Synchronization

Hume AI has open-sourced TADA TTS for natural, reliable, and expressive voice generation integration in any application. You can easily run TADA TTS on edge devices and mobile phones without GPUs. Compared to other on-device TTS models like Higgs Audio v2 and FireRedTTS-2, TADA TTS has a lower hallucination rate and sounds more natural. It is an excellent choice if your agent's response time is more important to you. Similar to VibeVoice, you can use this model for long-form conversational speech applications, such as podcasts.

Here are the available models for TADA TTS.

markdown
1
2
3
4
| Model | Parameters | Languages | HuggingFace | |-------|-----------|-----------|-------------| | `HumeAI/tada-1b` | 1B | English |https://huggingface.co/HumeAI/tada-1b | | `HumeAI/tada-3b-ml` | 3B | en, ar, ch, de, es, fr, it, ja, pl, pt | https://huggingface.co/HumeAI/tada-3b-ml |

Characteristics of TADA TTS

The open-source text-to-speech model has the following key highlights.

  • Quick Response: It has ~0.09 real-time factor (about 2x faster than other TTS models on Hugging Face).
  • Expressiveness: It uses dynamic duration and prosody per token.
  • Voice Cloning: Clone any voice from a short sample audio clip (WAV file).
  • Multi-Language Support: Its 3B parameters model is available in 10 languages.
  • Local: It runs entirely on your GPU, no API keys needed for the TTS functionality.

TADA TTS: Quick Start in Vision Agents

You should create a custom Vision Agents TTS plugin that integrates TADA for high-quality, locally synthesized text-to-speech. For a quick test, you can clone and experiment with the plugin on GitHub. Once you clone this repo and navigate to the TADA TTS directory (stream-tutorial-projects/AI/VisionAgents/VisionAgentsPythonPlugins/TADA_TTS), you should run uv sync to install dependencies.

Next, set the following API credentials.

bash
1
2
3
4
export DEEPGRAM_API_KEY="your_deepgram_key" export GOOGLE_API_KEY="your_google_key" export STREAM_API_KEY="your_stream_key" export STREAM_API_SECRET="your_stream_secret"

With the sample code below, you can initialize the TADA TTS model, integrate it for voice cloning and multilingual support.

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from vision_agents.plugins import tada # Default (3B multilingual model, built-in voice) tts = tada.TTS() # English-only 1B model tts = tada.TTS(model="HumeAI/tada-1b") # Voice cloning tts = tada.TTS( voice="path/to/reference.wav", voice_transcript="Transcript of the reference audio.", ) # Multilingual (German) tts = tada.TTS( model="HumeAI/tada-3b-ml", language="de", )

For a complete demo, you can run this Python script to create a voice agent in Vision Agents powered by TADA TTS for speech synthesis.

Limitations of TADA TTS

  • Long-Form Generation: During extended generations (10+ minutes of context) a speaker drift can occur . A recommended workaround is to reset your context periodically.
  • Language Coverage: Currently, TADA TTS supports English and eight other languages (ar, ch, de, es, fr, it, ja, pl, pt). Broader multilingual support may be required for enterprise use cases.

6. Kitten TTS: Create Voice-Enabled Apps With Tiny Open Source Model

Kitten TTS by KittenML is an ultra-lightweight speech synthesis model that runs efficiently on CPU with no GPU required. Among all the other TTS models discussed in this article, Kitten TTS is the smallest, with a model size under 25MB (int8). Key features of the open-source model include the following.

  • It runs on CPU, so no GPU is required to experiment with it offline.
  • Ultra-Lightweight: The model sizes range from 25MB (int8) to 80MB (mini).
ModelParamsSizeHuggingFace
kitten-tts-mini80M80MBKittenML/kitten-tts-mini-0.8
kitten-tts-micro40M41MBKittenML/kitten-tts-micro-0.8
kitten-tts-nano15M56MBKittenML/kitten-tts-nano-0.8
kitten-tts-nano-int815M25MBKittenML/kitten-tts-nano-0.8-int8
  • Voice Options: There are multiple male and female voices ready to use for any use case.

    Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo

  • Customization: Adjust configurable parameters like speed.

ParameterTypeDefaultDescription
modelstr"KittenML/kitten-tts-mini-0.8"HuggingFace model ID
voicestr"Bella"Voice name for synthesis
speedfloat1.0Speech speed multiplier
clientKittenTTS \| NoneNonePre-initialized KittenTTS instance

Kitten TTS: Quick Start in Vision Agents

Like the four TTS models (VibeVoice, Qwen3-TTS, Neu TTS, TADA TTS) in the previous sections, Vision Agents does not provide built-in support for Kitten TTS. You can easily add it as a custom plugin, install it, and use it in Vision Agents as follows.

Install the Kitten TTS Plugin

uv add vision-agents-plugins-kittentts

Use Kitten TTS in Vision Agents

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from vision_agents.plugins import kittentts # Create TTS with default settings (mini model, Bella voice) tts = kittentts.TTS() # Or specify model and voice tts = kittentts.TTS( model="KittenML/kitten-tts-mini-0.8", voice="Jasper", speed=1.0, ) # Or use the nano int8 model for minimal footprint tts = kittentts.TTS( model="KittenML/kitten-tts-nano-0.8-int8", voice="Luna", )

For a working Kitten TTS demo in Vision Agents, clone this repo and follow the instructions to run the example on the plugin’s page.

Limitations of Kitten TTS

  • Expressiveness and Naturalness: Being the smallest TTS model hinders its ability to produce realistic speech compared to VibeVoice or Qwen3-TTS.
  • English Focused: Kitten TTS currently supports English only. You will find information about its multilingual TTS on the GitHub roadmap but not yet implemented.
  • Developer Preview: Its API is in developer preview so it may be suitable to use it for only experimentation and prototypes.
  • No Voice Cloning or Fine-Tuning: Only eight built-in voices are available and there is no custom voice cloning or fine-tuning support at this time. This may change in future updates.
  • Nano-int8 Model Issues: Some users have reported problems with kitten-tts-nano-0.8-int8 on GitHub. It is recommended to use the mini or micro model if you encounter quality or stability issues.
  • No Prosody Control: It does not support Speech Synthesis Markup Language (SSML). The only configurable/tunable parameter is speed.

Further Reading

We have covered many on-device TTS alternatives, including VibeVoice, Qwen3-TTS, Neu TTS, Pocket TTS, TADA TTS, and Kitten TTS. With the realtime voice agent demos in Vision Agents, speech generation characteristics/features and limitations highlighted in this article, you now have a clear overview of which of these speech synthesis models are ready for production voice AI apps, prototypes, and enterprise use cases.

Aside from the six on-device TTS models highlighted in this article, there are plenty of other models in the category to try on Hugging Face. Another small multilingual TTS model you can check out is Fish Audio S2, which claims to be open source but has a complicated license. Refer to this Reddit post to learn more about Fish Audio S2.

Integrating Video With Your App?
We've built a Video and Audio solution just for you. Check out our APIs and SDKs.
Learn more