How To Design AI Voices in Minutes Using Qwen3-TTS

Before You Start

To begin, ensure that you meet these requirements and have the following credentials.

Python 3.13 or a later version.
An Apple Silicon Mac (recommended) or any modern laptop.
Stream API credentials (for realtime audio and video communication).
A HuggingFace Account and access token (HF_TOKEN).
A Deepgram API key (for speech-to-text).
A Google API key (for Gemini LLM).

What is an AI Voice Design?

Voice design is the process of describing the voice you want to a supported AI model and generating it as human-sounding output. When designing a voice for an app, you simply describe how you want it to sound and add voice styles and characteristics to enhance its expressiveness and naturalness. The underlying AI model will then analyze the input query and generate synthesized speech that matches the prompt.

Most open-source and commercial state-of-the-art text-to-speech (TTS) models support voice design. For example, you can use ElevenLabs voice design tools and models to create and add custom speech generation to any conversational app. An excellent alternative to ElevenLab's voice design is the open-source Qwen3-TTS family of models. In the following sections, you will use the voice design capabilities of Qwen3-TTS to create a wide range of voices for several use cases.

What is Lacking in Leading TTS Models

Nearly all leading AI providers offer TTS models for audio generation, but not all of these speech synthesis services can be used to design voices. For example, Gemini TTS has about thirty built-in AI voices and supports multiple languages. Gemini 2.5 Flash TTS and Gemini 2.5 Pro TTS can be used to create unique voices by describing style, accent, pace, tone, and even emotional expression. However, you do not get the level of flexibility in Qwen3-TTS in any of these Gemini models. With Qwen3-TTS, you can prompt to control several aspects of the voice design and generation.

Voice Design Use Cases

Besides designing AI voices to look and sound different, well-crafted custom voices can be applied across many areas in businesses and industries.

Films: Customize tone, accent, and age to create distinct characters for movies. Example Voice Prompt: "A grizzled male detective in his 50s with a gravelly, world-weary baritone. He speaks in short, clipped sentences with a dry, sardonic undertone. Slight New York accent, low and steady, as if every word costs him effort."
Video Games: Create computer-controlled characters with unique voices. Example Voice Prompt: "A young, eager female elf with a bright, crystalline mid-range voice. She speaks quickly, with wide-eyed enthusiasm, her pitch rising at the end of each phrase. Light and airy, with a slight musical lilt, as if every sentence is an invitation to adventure."
Multi-Speaker Speech: Generate conversations, panel discussions, or dialogue scenes by designing multiple distinct voices. Example Voice Prompts for a Two-Host Podcast:
Host A: "A cheerful male voice in his early 30s with a warm, mid-range tone and upbeat, energetic pacing. Friendly and conversational, with a natural American accent and a slight laugh ready behind every sentence." Host B: "A thoughtful female voice in her 40s with a calm, lower register and measured, deliberate pacing. She speaks with a dry wit and a smooth British accent, pausing slightly before key points for emphasis."
Customer Support: Adjust emotions, delivery, and accent to create realistic voices for automated customer service systems. Example Voice Prompt: "A friendly, patient female voice in her late 20s with a clear mid-range tone. She speaks at a calm, moderate pace with a warm, reassuring quality. Her diction is crisp and professional but never robotic, with gentle emphasis on key information and a natural, approachable American accent."
Audiobooks: Design narrator voices tailored to a book's genre and atmosphere. Example Voice Prompt: "A warm, expressive female storyteller in her 30s with a smooth, mid-range voice. She reads with gentle theatrical flair: slowing down for dramatic moments, lightening for humor, and dropping to a near-whisper for suspense. Her pacing is unhurried and intimate, as if reading aloud to a small audience by a fireplace."

How To Prompt To Control Your Voice Design

When prompting to design a voice with Qwen3-TTS, you have full control over style and several aspects of the speech, including timbre and the following speech attributes with their examples.

Accent: Prompt the model to adopt a specific accent, such as American or British English, African-American English.
Age: An old adult, young or elderly in his/her 30s.
Gender: Male, female or gender neutral.
Emotion: Instruct the model to perform specific emotions. For example, sad when hearing tragic news or excited when narrating a story.
Clarity: Distinct pronunciation.
Fluency: A clear tone with no hesitation.
Pitch: Low, high or normal pitch female showing sadness or excitement when laughing.
Timbre/Tone: Prompt the model to produce a certain range of tones and expressions like deep, warm, smooth, authoritative, upbeat, and playful.
Speed/Pacing: Quicker cadence or slow-paced speech delivery with pauses, rapid during laughter.
Personality: Introverted, extroverted, confident, engaging, shy, and expressive.
Texture: A lady in her 2os with a bright and clear vocal texture.
Volume: A projecting voice that escalates quickly to loud.

You can mix-and-match any of the above speech attributes to create and generate unique voice experiences in MP3, Opus, AAC, FLAC, PCM, and WAV audio output formats. Here is an example.

Aside from specifying the above voice attributes in your prompt, the Qwen3-TTS-12Hz-1.7B-VoiceDesign model can be steered with background information to design a unique voice character using the following.

Character Name: Anna Marie.
Voice Profile: An agile female voice with a natural upward lift, seamless flow, energetic pace, and a clear projection volume to convey excitement.
Background: A new anchor for national television focusing on delivering news about recent AI technologies.
Presence: Late 30s, broadcasting from location with a bright and nice studio lighting.
Personality: Engaging, enthusiastic, and enegertic.

Design AI Voices With Qwen3-TTS: Quick Start in Vision Agents

There are various ways to try the voice design functionality of Qwen3-TTS. You can experiment with it on a Hugging Face playground to generate audio in a specific direction for your projects. However, the best way to see how it works is to create a voice agent with it using an open-source Python framework like Vision Agents. It is a platform for building voice/video/vision AI apps in Python.

To integrate Qwen3-TTS with Vision Agents, we should create a custom voice AI pipeline consisting of speech-to-text (STT) -> LLM -> text-to-speech (TTS) and use the Qwen models for speech synthesis. For the STT and LLM components of the pipeline, you can use any AI service provider you prefer. Vision Agents provides built-in support for Qwen models as LLMs, but not for audio-generation Qwen models. To use a Qwen model for TTS in Vision Agents, it must be integrated as a custom Python plugin. Read the docs for a step-by-step guide.

Configure Your Environment and Credentials

You may start with a new uv-based Python project, install Vision Agents, and all required AI plugins.

bash

1
2
3
uv init
uv add vision-agents
uv add "vision-agents[getstream, qwen3tts, gemini, deepgram, smart-turn]"

Note: uv add "vision-agents[qwen3tts]" only works after you have created it as a custom plugin for Vision Agents.
`

Alternatively, you can clone the ready-made plugin on GitHub, navigate to the plugin's directory http://stream-tutorial-projects/AI/VisionAgents/VisionAgentsPythonPlugins/Qwen3-TTS-HF/plugins/qwen3tts/, and get started with the instructions.

Install Project Dependencies

After cloning the Qwen3-TTS plugin from the above repo, you should run uv sync to install all dependencies.

Next, create a .env for your API keys.

bash

1
2
3
4
5
6
7
STREAM_API_KEY=...
STREAM_API_SECRET=...
EXAMPLE_BASE_URL=https://demo.visionagents.ai

HF_TOKEN=...
DEEPGRAM_API_KEY=...
GOOGLE_API_KEY=...

Design a Voice From Descriptions

After setting up your environment, installing Vision Agents, and all required AI services for your voice pipeline, you can now create a new Qwen3-TTS instance in Vision Agents with parameters such as model, mode, language, and instruct. Since we are generating a custom voice, ensure to set the mode as mode="voice_design".

python

1
2
3
4
5
6
tts = Qwen3TTS(
    model="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    mode="voice_design",
    language="English",
    instruct="A warm, confident female narrator in her 30s with a clear mid-range voice.",
)

Note: We are using the following voice design model from Hugging Face.

Model	HuggingFace ID	Mode	Parameters	Features
VoiceDesign 1.7B	`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`	`voice_design`	1.7B	Text-described voice design

Building your own app? Get access to our Livestream or Video Calling API and launch in days!

A Complete Voice Design Example

Let's create a custom AI voice to simulate an old African-American grandma.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
"""
Qwen3-TTS VoiceDesign — African-American Grandma

A very old, cranky, and croaky African-American grandma. 80 years old.
Very hoarse, grumpy, shrill, and frustrated.

Combines: Age Control (80 years old), Acoustic Attribute Control (hoarse,
shrill, croaky), Human-Likeness (natural elderly speech), Gradual Control
(grumpy pacing with frustrated emphasis).

Required env vars:
    HF_TOKEN, DEEPGRAM_API_KEY, GOOGLE_API_KEY,
    STREAM_API_KEY, STREAM_API_SECRET
"""

import asyncio
import logging
import sys
from pathlib import Path

PROJECT_ROOT = Path(__file__).resolve().parents[3]
sys.path.insert(0, str(PROJECT_ROOT))

from dotenv import load_dotenv
from vision_agents.core import Agent, Runner, User
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import deepgram, gemini, getstream, smart_turn

from plugins.qwen3tts.vision_agents.plugins.qwen3tts import TTS as Qwen3TTS

logger = logging.getLogger(__name__)

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    """Create a voice agent with a cranky grandma persona."""
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Grandma Lucille", id="agent"),
        instructions=(
            "You are Grandma Lucille, an 80-year-old African-American grandmother "
            "who has seen it all and has zero patience left. You are cranky, blunt, "
            "and always complaining, but deep down you care. "
            "IMPORTANT: Keep every response to ONE short sentence, under 15 words."
        ),
        tts=Qwen3TTS(
            model="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
            mode="voice_design",
            language="English",
            instruct=(
                "An elderly female grandmother, 80 years old, with a high-pitched, "
                "thin, croaky old woman's voice. She sounds cranky and shrill, "
                "with a scratchy, nasal, feminine tone that wavers with age. Her "
                "speech is slow with sharp, irritable emphasis. The voice is "
                "distinctly an old lady's — reedy, quavering, and breathless, "
                "with a warm Southern African-American cadence."
            ),
        ),
        stt=deepgram.STT(eager_turn_detection=True),
        llm=gemini.LLM("gemini-2.5-flash"),
        turn_detection=smart_turn.TurnDetection(
            silence_duration_ms=2000,
            speech_probability_threshold=0.5,
        ),
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    logger.info("Starting African-American Grandma VoiceDesign Agent...")

    async with agent.join(call):
        logger.info("Agent joined call")
        await asyncio.sleep(3)
        await agent.llm.simple_response(
            text="Mmhmm. What do you want now, child?"
        )
        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

In this example, we created a new voice agent in Vision Agents, defined a custom voice prompt, and passed it to the instruct parameter of the Qwen3-TTS model's definition.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
        tts=Qwen3TTS(
            model="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
            mode="voice_design",
            language="English",
            instruct=(
                "An elderly female grandmother, 80 years old, with a high-pitched, "
                "thin, croaky old woman's voice. She sounds cranky and shrill, "
                "with a scratchy, nasal, feminine tone that wavers with age. Her "
                "speech is slow with sharp, irritable emphasis. The voice is "
                "distinctly an old lady's — reedy, quavering, and breathless, "
                "with a warm Southern African-American cadence."
            ),
        ),

Running the complete Python script above should produce a voice similar to this demo.

Steer Your Voice Prompt To Another Use Case

In the previous voice design demo, we generated a speech that sounds like an old African-American grandma. To modify it for another use case, all you need to do is change the prompt to generate, for example, "A friendly mythical God, Zeus, with a huge, deep, powerful voice, charming, proud, strong, and theatrical".

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
        tts=Qwen3TTS(
            model="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
            mode="voice_design",
            language="English",
            instruct=(
                "A powerful male god with an immensely deep, booming, resonant "
                "bass voice that reverberates as if echoing through a vast marble "
                "temple. The tone is charming, proud, and strong with a theatrical, "
                "grandiose delivery. He speaks slowly and deliberately, savoring "
                "each word with regal authority. The voice carries warmth beneath "
                "its overwhelming power, with rich, velvety low tones and a "
                "commanding, larger-than-life presence."
            ),
        ),

Inserting this code snippet into the complete Python script should generate a voice based on the modified description.

The Qwen3-TTS plug-in on GitHub has several other voice design demos ready for you to try.

Voice Design Prompting Guide and Best Practices

Creating effective voice design prompts involves writing the descriptions in dimensions/layers as shown below.

Attribute	Description	Sample
Identity	Gender, age, character archetype	"An elderly African-American female grandmother, 80 years old"
Pitch & register	High, mid-range, deep, bass	"A deep, booming, and resonant bass voice"
Texture & timbre	Breathy, raspy, smooth, nasal, husky	"Thin, raspy, and cracked with age"
Emotion & personality	Warm, angry, menacing, cheerful, proud	"Cool, composed, and subtly seductive"
Pacing & cadence	Slow, fast, deliberate, erratic	"Speaks slowly and deliberately, savoring each word"
Accent & dialect	Regional or cultural influence	"A thick French accent that softens consonants"
Distinguishing details	Unique mannerisms, metaphors, imagery	"As if echoing through a vast marble temple"

Be Clear and Specific

The Qwen3-TTS voice design model uses text comprehension and an internal thinking process to parse complex descriptions. To get better results, use concrete descriptions rather than abstract labels. Here are some weak and strong examples.

Weak: "An old man's voice"
Strong: "An elderly man in his 80s with a reedy, quavering voice that wavers with age, slow and breathless, with a warm scratchy quality"

Sentence Length

The demos in this article use 2 - 4 sentences (about 40-80 words). This simplifies it for the audio model without overwhelming it. A short prompt ("deep male voice") lacks specificity, while a long one risks conflicting attributes.

Acoustic Qualities Description

Ensure you describe acoustic qualities, not just personality.
The model controls timbre, pitch, and prosody, not semantic content. Focus on how the voice sounds, not what the agent would say.

Weak: "A wise philosopher who quotes Aristotle."
Strong: "A calm, measured male voice with a deep, resonant tone and slow, contemplative pacing."

Language Context for Accents

When creating a non-native accent, the origin and its target language must be specified. This ensures the accent characteristics are blended correctly. Here are examples.

"A calm, husky male voice with a thick Japanese accent speaking English"
"A whispery female voice with a thick French accent speaking English"

Limitations of Qwen3-TTS Voice Design

Although the voice design functionality of Qwen3-TTS performs on par with a model like Gemini 3.1 Flash TTS, it still has some limitations.

The voice design feature is only available with the 1.7B variant of the Qwen3-TTS family of models. The 0.6B does not support voice design.
There is no way to mix voice design and cloning.
It is not possible to set speaker ID or embedding to anchor the voice. This can hinder speech consistency and slightly alter voice characteristics across several generations using the same prompt.
The voice design support excels in Chinese and English. Descriptions for other supported languages may produce less accurate or expressive results.
The model does not follow instructions correctly when dealing with conflicting attributes, such as "high-pitched deep bass". It may output unpredictable results by favoring one attribute over the other.

How To Design AI Voices in Minutes Using Qwen3-TTS

Before You Start

What is an AI Voice Design?

What is Lacking in Leading TTS Models

Voice Design Use Cases

How To Prompt To Control Your Voice Design

Design AI Voices With Qwen3-TTS: Quick Start in Vision Agents

Configure Your Environment and Credentials

Design a Voice From Descriptions

A Complete Voice Design Example

Steer Your Voice Prompt To Another Use Case

Voice Design Prompting Guide and Best Practices

Limitations of Qwen3-TTS Voice Design

Further Reading