Before You Start
To begin, ensure that you meet these requirements and have the following credentials.
- Python 3.13 or a later version.
- An Apple Silicon Mac (recommended) or any modern laptop.
- Stream API credentials (for realtime audio and video communication).
- A HuggingFace Account and access token (
HF_TOKEN). - A Deepgram API key (for speech-to-text).
- A Google API key (for Gemini LLM).
What is an AI Voice Design?
Voice design is the process of describing the voice you want to a supported AI model and generating it as human-sounding output. When designing a voice for an app, you simply describe how you want it to sound and add voice styles and characteristics to enhance its expressiveness and naturalness. The underlying AI model will then analyze the input query and generate synthesized speech that matches the prompt.
Most open-source and commercial state-of-the-art text-to-speech (TTS) models support voice design. For example, you can use ElevenLabs voice design tools and models to create and add custom speech generation to any conversational app. An excellent alternative to ElevenLab’s voice design is the open-source Qwen3-TTS family of models. In the following sections, you will use the voice design capabilities of Qwen3-TTS to create a wide range of voices for several use cases.
What is Lacking in Leading TTS Models
Nearly all leading AI providers offer TTS models for audio generation, but not all of these speech synthesis services can be used to design voices. For example, Gemini TTS has about thirty built-in AI voices and supports multiple languages. Gemini 2.5 Flash TTS and Gemini 2.5 Pro TTS can be used to create unique voices by describing style, accent, pace, tone, and even emotional expression. However, you do not get the level of flexibility in Qwen3-TTS in any of these Gemini models. With Qwen3-TTS, you can prompt to control several aspects of the voice design and generation.
Voice Design Use Cases
Besides designing AI voices to look and sound different, well-crafted custom voices can be applied across many areas in businesses and industries.
- Films: Customize tone, accent, and age to create distinct characters for movies. Example Voice Prompt: “A grizzled male detective in his 50s with a gravelly, world-weary baritone. He speaks in short, clipped sentences with a dry, sardonic undertone. Slight New York accent, low and steady, as if every word costs him effort.”
- Video Games: Create computer-controlled characters with unique voices. Example Voice Prompt: “A young, eager female elf with a bright, crystalline mid-range voice. She speaks quickly, with wide-eyed enthusiasm, her pitch rising at the end of each phrase. Light and airy, with a slight musical lilt, as if every sentence is an invitation to adventure.”
- Multi-Speaker Speech: Generate conversations, panel discussions, or dialogue scenes by designing multiple distinct voices. Example Voice Prompts for a Two-Host Podcast:
Host A: “A cheerful male voice in his early 30s with a warm, mid-range tone and upbeat, energetic pacing. Friendly and conversational, with a natural American accent and a slight laugh ready behind every sentence.” Host B: “A thoughtful female voice in her 40s with a calm, lower register and measured, deliberate pacing. She speaks with a dry wit and a smooth British accent, pausing slightly before key points for emphasis.” - Customer Support: Adjust emotions, delivery, and accent to create realistic voices for automated customer service systems. Example Voice Prompt: “A friendly, patient female voice in her late 20s with a clear mid-range tone. She speaks at a calm, moderate pace with a warm, reassuring quality. Her diction is crisp and professional but never robotic, with gentle emphasis on key information and a natural, approachable American accent.”
- Audiobooks: Design narrator voices tailored to a book’s genre and atmosphere. Example Voice Prompt: “A warm, expressive female storyteller in her 30s with a smooth, mid-range voice. She reads with gentle theatrical flair — slowing down for dramatic moments, lightening for humor, and dropping to a near-whisper for suspense. Her pacing is unhurried and intimate, as if reading aloud to a small audience by a fireplace.”
How To Prompt To Control Your Voice Design
When prompting to design a voice with Qwen3-TTS, you have full control over style and several aspects of the speech, including timbre and the following speech attributes with their examples.
- Accent: Prompt the model to adopt a specific accent, such as American or British English, African-American English.
- Age: An old adult, young or elderly in his/her 30s.
- Gender: Male, female or gender neutral.
- Emotion: Instruct the model to perform specific emotions. For example, sad when hearing tragic news or excited when narrating a story.
- Clarity: Distinct pronunciation.
- Fluency: A clear tone with no hesitation.
- Pitch: Low, high or normal pitch female showing sadness or excitement when laughing.
- Timbre/Tone: Prompt the model to produce a certain range of tones and expressions like deep, warm, smooth, authoritative, upbeat, and playful.
- Speed/Pacing: Quicker cadence or slow-paced speech delivery with pauses, rapid during laughter.
- Personality: Introverted, extroverted, confident, engaging, shy, and expressive.
- Texture: A lady in her 2os with a bright and clear vocal texture.
- Volume: A projecting voice that escalates quickly to loud.
You can mix-and-match any of the above speech attributes to create and generate unique voice experiences in MP3, Opus, AAC, FLAC, PCM, and WAV audio output formats. Here is an example.
Aside from specifying the above voice attributes in your prompt, the Qwen3-TTS-12Hz-1.7B-VoiceDesign model can be steered with background information to design a unique voice character using the following.
- Character Name: Anna Marie.
- Voice Profile: An agile female voice with a natural upward lift, seamless flow, energetic pace, and a clear projection volume to convey excitement.
- Background: A new anchor for national television focusing on delivering news about recent AI technologies.
- Presence: Late 30s, broadcasting from location with a bright and nice studio lighting.
- Personality: Engaging, enthusiastic, and enegertic.
Design AI Voices With Qwen3-TTS: Quick Start in Vision Agents
There are various ways to try the voice design functionality of Qwen3-TTS. You can experiment with it on a Hugging Face playground to generate audio in a specific direction for your projects. However, the best way to see how it works is to create a voice agent with it using an open-source Python framework like Vision Agents. It is a platform for building voice/video/vision AI apps in Python.
To integrate Qwen3-TTS with Vision Agents, we should create a custom voice AI pipeline consisting of speech-to-text (STT) → LLM → text-to-speech (TTS) and use the Qwen models for speech synthesis. For the STT and LLM components of the pipeline, you can use any AI service provider you prefer. Vision Agents provides built-in support for Qwen models as LLMs, but not for audio-generation Qwen models. To use a Qwen model for TTS in Vision Agents, it must be integrated as a custom Python plugin. Read the docs for a step-by-step guide.
Configure Your Environment and Credentials
You may start with a new uv-based Python project, install Vision Agents, and all required AI plugins.
123uv init uv add vision-agents uv add “vision-agents[getstream, qwen3tts, gemini, deepgram, smart-turn]”
Note: uv add “vision-agents[qwen3tts]” only works after you have created it as a custom plugin for Vision Agents.
`
Alternatively, you can clone the ready-made plugin on GitHub, navigate to the plugin’s directory http://stream-tutorial-projects/AI/VisionAgents/VisionAgentsPythonPlugins/Qwen3-TTS-HF/plugins/qwen3tts/, and get started with the instructions.
Install Project Dependencies
After cloning the Qwen3-TTS plugin from the above repo, you should run uv sync to install all dependencies.
Next, create a .env for your API keys.
1234567STREAM_API_KEY=... STREAM_API_SECRET=... EXAMPLE_BASE_URL=https://demo.visionagents.ai HF_TOKEN=... DEEPGRAM_API_KEY=... GOOGLE_API_KEY=...
Design a Voice From Descriptions
After setting up your environment, installing Vision Agents, and all required AI services for your voice pipeline, you can now create a new Qwen3-TTS instance in Vision Agents with parameters such as model, mode, language, and instruct. Since we are generating a custom voice, ensure to set the mode as mode="voice_design".
123456tts = Qwen3TTS( model="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", mode="voice_design", language="English", instruct="A warm, confident female narrator in her 30s with a clear mid-range voice.", )
Note: We are using the following voice design model from Hugging Face.
| Model | HuggingFace ID | Mode | Parameters | Features |
|---|---|---|---|---|
| VoiceDesign 1.7B | Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign | voice_design | 1.7B | Text-described voice design |
A Complete Voice Design Example
Let’s create a custom AI voice to simulate an old African-American grandma.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081""" Qwen3-TTS VoiceDesign — African-American Grandma A very old, cranky, and croaky African-American grandma. 80 years old. Very hoarse, grumpy, shrill, and frustrated. Combines: Age Control (80 years old), Acoustic Attribute Control (hoarse, shrill, croaky), Human-Likeness (natural elderly speech), Gradual Control (grumpy pacing with frustrated emphasis). Required env vars: HF_TOKEN, DEEPGRAM_API_KEY, GOOGLE_API_KEY, STREAM_API_KEY, STREAM_API_SECRET """ import asyncio import logging import sys from pathlib import Path PROJECT_ROOT = Path(__file__).resolve().parents[3] sys.path.insert(0, str(PROJECT_ROOT)) from dotenv import load_dotenv from vision_agents.core import Agent, Runner, User from vision_agents.core.agents import AgentLauncher from vision_agents.plugins import deepgram, gemini, getstream, smart_turn from plugins.qwen3tts.vision_agents.plugins.qwen3tts import TTS as Qwen3TTS logger = logging.getLogger(__name__) load_dotenv() async def create_agent(**kwargs) -> Agent: """Create a voice agent with a cranky grandma persona.""" agent = Agent( edge=getstream.Edge(), agent_user=User(name="Grandma Lucille", id="agent"), instructions=( "You are Grandma Lucille, an 80-year-old African-American grandmother " "who has seen it all and has zero patience left. You are cranky, blunt, " "and always complaining, but deep down you care. " "IMPORTANT: Keep every response to ONE short sentence, under 15 words." ), tts=Qwen3TTS( model="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", mode="voice_design", language="English", instruct=( "An elderly female grandmother, 80 years old, with a high-pitched, " "thin, croaky old woman's voice. She sounds cranky and shrill, " "with a scratchy, nasal, feminine tone that wavers with age. Her " "speech is slow with sharp, irritable emphasis. The voice is " "distinctly an old lady's — reedy, quavering, and breathless, " "with a warm Southern African-American cadence." ), ), stt=deepgram.STT(eager_turn_detection=True), llm=gemini.LLM("gemini-2.5-flash"), turn_detection=smart_turn.TurnDetection( silence_duration_ms=2000, speech_probability_threshold=0.5, ), ) return agent async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: call = await agent.create_call(call_type, call_id) logger.info("Starting African-American Grandma VoiceDesign Agent...") async with agent.join(call): logger.info("Agent joined call") await asyncio.sleep(3) await agent.llm.simple_response( text="Mmhmm. What do you want now, child?" ) await agent.finish() if __name__ == "__main__": Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
In this example, we created a new voice agent in Vision Agents, defined a custom voice prompt, and passed it to the instruct parameter of the Qwen3-TTS model’s definition.
12345678910111213tts=Qwen3TTS( model="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", mode="voice_design", language="English", instruct=( "An elderly female grandmother, 80 years old, with a high-pitched, " "thin, croaky old woman's voice. She sounds cranky and shrill, " "with a scratchy, nasal, feminine tone that wavers with age. Her " "speech is slow with sharp, irritable emphasis. The voice is " "distinctly an old lady's — reedy, quavering, and breathless, " "with a warm Southern African-American cadence." ), ),
Running the complete Python script above should produce a voice similar to this demo
Steer Your Voice Prompt To Another Use Case
In the previous voice design demo, we generated a speech that sounds like an old African-American grandma. To modify it for another use case, all you need to do is change the prompt to generate, for example, “A friendly mythical God, Zeus, with a huge, deep, powerful voice, charming, proud, strong, and theatrical”.
1234567891011121314tts=Qwen3TTS( model="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", mode="voice_design", language="English", instruct=( "A powerful male god with an immensely deep, booming, resonant " "bass voice that reverberates as if echoing through a vast marble " "temple. The tone is charming, proud, and strong with a theatrical, " "grandiose delivery. He speaks slowly and deliberately, savoring " "each word with regal authority. The voice carries warmth beneath " "its overwhelming power, with rich, velvety low tones and a " "commanding, larger-than-life presence." ), ),
Inserting this code snippet into the complete Python script should generate a voice based on the modified description.
The Qwen3-TTS plug-in on GitHub has several other voice design demos ready for you to try.
Voice Design Prompting Guide and Best Practices
Creating effective voice design prompts involves writing the descriptions in dimensions/layers as shown below.
| Attribute | Description | Sample |
|---|---|---|
| Identity | Gender, age, character archetype | "An elderly African-American female grandmother, 80 years old" |
| Pitch & register | High, mid-range, deep, bass | "A deep, booming, and resonant bass voice" |
| Texture & timbre | Breathy, raspy, smooth, nasal, husky | "Thin, raspy, and cracked with age" |
| Emotion & personality | Warm, angry, menacing, cheerful, proud | "Cool, composed, and subtly seductive" |
| Pacing & cadence | Slow, fast, deliberate, erratic | "Speaks slowly and deliberately, savoring each word" |
| Accent & dialect | Regional or cultural influence | "A thick French accent that softens consonants" |
| Distinguishing details | Unique mannerisms, metaphors, imagery | "As if echoing through a vast marble temple" |
Be Clear and Specific
The Qwen3-TTS voice design model uses text comprehension and an internal thinking process to parse complex descriptions. To get better results, use concrete descriptions rather than abstract labels. Here are some weak and strong examples.
- Weak: "An old man's voice"
- Strong: "An elderly man in his 80s with a reedy, quavering voice that wavers with age, slow and breathless, with a warm scratchy quality"
Sentence Length
The demos in this article use 2 - 4 sentences (about 40-80 words). This simplifies it for the audio model without overwhelming it. A short prompt ("deep male voice") lacks specificity, while a long one risks conflicting attributes.
Acoustic Qualities Description
Ensure you describe acoustic qualities, not just personality.
The model controls timbre, pitch, and prosody, not semantic content. Focus on how the voice sounds, not what the agent would say.
- Weak: “A wise philosopher who quotes Aristotle.”
- Strong: “A calm, measured male voice with a deep, resonant tone and slow, contemplative pacing.”
Language Context for Accents
When creating a non-native accent, the origin and its target language must be specified. This ensures the accent characteristics are blended correctly. Here are examples.
- “A calm, husky male voice with a thick Japanese accent speaking English”
- “A whispery female voice with a thick French accent speaking English”
Limitations of Qwen3-TTS Voice Design
Although the voice design functionality of Qwen3-TTS performs on par with a model like Gemini 3.1 Flash TTS, it still has some limitations.
- The voice design feature is only available with the 1.7B variant of the Qwen3-TTS family of models. The 0.6B does not support voice design.
- There is no way to mix voice design and cloning.
- It is not possible to set speaker
IDor embedding to anchor the voice. This can hinder speech consistency and slightly alter voice characteristics across several generations using the same prompt. - The voice design support excels in Chinese and English. Descriptions for other supported languages may produce less accurate or expressive results.
- The model does not follow instructions correctly when dealing with conflicting attributes, such as "high-pitched deep bass". It may output unpredictable results by favoring one attribute over the other.
Further Reading
This article introduced you to AI voice design, tools, models, and frameworks to help you get started quickly. You now know how to design a custom AI voice experience using Qwen3-TTS in Vision Agents.
Voice design is one of the three main characteristics that make an AI speech feel human-like, natural-sounding, and expressive. Not all TTS models support voice design, but if you would like to try alternatives with Vision Agents, there are excellent options, such as the Eleven Labs TTS or Gemini 3.1 Flash TTS. They all integrate seamlessly with Vision Agents.
