How to Clone Any Voice in Minutes Using Voxtral TTS

What You Will Build

This tutorial demonstrates how to build an AI speech app with in-app voice cloning support.
You can clone your favorite voice by supplying a reference audio of about 3 seconds. Here is a demo.

Voice cloning example demonstrating reference and agent's output voices

Requirements

To build an agent with voice cloning support, we need a couple of AI services. To keep things simple, let’s create the agent in Vision Agents, a Python framework for building multimodal AI apps that support speech, video, and vision. Grab Python 3.12 or later, install it on your machine, and let’s begin by obtaining the following credentials.

MISTRAL_API_KEY to access Voxtral TTS.
DEEPGRAM_API_KEY for speech-to-text model access.
GOOGLE_API_KEY for access to a Gemini model.
STREAM_API_KEY and STREAM_API_SECRET from your dashboard for realtime audio and video transport.
A 3 - 25 seconds reference audio file for voice cloning (WAV recommended).

Quick Start in Vision Agents

Visit the Vision Agents quickstart guide to install the core Python framework and the required plugins (Gemini, Deepgram, and Smart Turn) to run the agent. To support voice cloning, we also need a text-to-speech (TTS) model. Feel free to use any TTS model from providers such as Google (Gemini 3.1 Flash TTS), OpenAI, Grok, or Qwen (Qwen3-TTS). Voxtral TTS, a new text-to-speech model from Mistral AI, is the one we will use to provide cloning.

Vision Agents supports Mistral models when you want to use them for LLM processing. However, to use a Mistral model for TTS, we should integrate it as a custom plugin. Check out the Vision Agents step-by-step guide to learn more. The Voxtral TTS plugin used in this tutorial is hosted on GitHub. Just follow the README.md to discover how to use it, or clone the repo and run the two voice cloning examples for the plugin.

Clone a Voice and Test It in Multiple Languages

Start by providing the voice you want to clone in WAV format. You can place the WAV audio anywhere in your Python Project. After you install Vision Agents and the plugins needed for voice cloning, create a new Python file and fill it with this code.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
"""
Voxtral TTS Example

This example demonstrates Voxtral TTS (Mistral) integration with Vision Agents.

This example creates an agent that uses:
- Voxtral TTS for text-to-speech (voxtral-mini-tts-2603)
- Deepgram for speech-to-text
- Gemini for LLM
- GetStream for edge/real-time communication
- Smart Turn for turn detection

Requirements:
- MISTRAL_API_KEY environment variable
- DEEPGRAM_API_KEY environment variable
- GOOGLE_API_KEY environment variable
- STREAM_API_KEY and STREAM_API_SECRET environment variables
"""

import asyncio
import base64
import logging
from pathlib import Path

from dotenv import load_dotenv
from vision_agents.core import Agent, Runner, User
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import deepgram, gemini, getstream, smart_turn

import sys

sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from vision_agents.plugins.voxtral import TTS as VoxtralTTS

logger = logging.getLogger(__name__)

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    """Create the agent with Voxtral TTS."""
    ref_audio_path = Path(__file__).resolve().parents[3] / "david.wav"
    ref_audio_b64 = base64.b64encode(ref_audio_path.read_bytes()).decode()

    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Voxtral AI", id="agent"),
        instructions=(
            "You are a friendly multilingual voice assistant powered by "
            "Voxtral TTS. Keep your responses short and conversational. "
            "You can speak in English, French, Spanish, Portuguese, Italian, "
            "Dutch, German, Hindi, and Arabic."
        ),
        tts=VoxtralTTS(
            model="voxtral-mini-tts-2603",
            ref_audio=ref_audio_b64,
            response_format="pcm",
        ),
        stt=deepgram.STT(eager_turn_detection=True),
        llm=gemini.LLM("gemini-3-flash-preview"),
        turn_detection=smart_turn.TurnDetection(
            silence_duration_ms=2000,
            speech_probability_threshold=0.5,
        ),
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join the call and start the agent."""
    call = await agent.create_call(call_type, call_id)

    logger.info("Starting Voxtral TTS Agent...")

    async with agent.join(call):
        logger.info("Agent joined call")

        await asyncio.sleep(3)
        await agent.llm.simple_response(
            text="Hello! I'm powered by Voxtral TTS from Mistral."
        )

        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

The above sample code creates an agent in Vision Agents that uses the following Voxtral TTS model to support voice cloning from a ref_audio.

python

1
2
3
4
5
     tts=VoxtralTTS(
            model="voxtral-mini-tts-2603",
            ref_audio=ref_audio_b64,
            response_format="pcm",
        ),

The path of the input audio must be specified and decoded like this.

python

1
2
 ref_audio_path = Path(__file__).resolve().parents[3] / "david.wav"
 ref_audio_b64 = base64.b64encode(ref_audio_path.read_bytes()).decode()

When you run the complete sample code above, you can test and interact with the voice agent as shown below. You will notice that your reference audio is very similar to the voice agent’s real-time responses.

Voice cloning example demonstrating multiple languages

Zero-Shot Voice Cloning With Voxtral TTS Example

Using the same reference audio or a different 3 - 5 seconds one, let’s capture emotion, speaking style, and accent from it with this sample code.

Building your own app? Get early access to our Livestream or Video Calling API and launch in days!

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
"""
Voxtral TTS Voice Cloning Example

Requirements:
- MISTRAL_API_KEY environment variable
- DEEPGRAM_API_KEY environment variable
- GOOGLE_API_KEY environment variable
- STREAM_API_KEY and STREAM_API_SECRET environment variables
- A reference audio file (david.wav) for voice cloning
"""

import asyncio
import base64
import logging
import os
import sys
from pathlib import Path

from dotenv import load_dotenv
from mistralai.client import Mistral
from vision_agents.core import Agent, Runner, User
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import deepgram, gemini, getstream

sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from vision_agents.plugins.voxtral import TTS as VoxtralTTS

logger = logging.getLogger(__name__)

load_dotenv()

PROJECT_ROOT = Path(__file__).resolve().parents[3]
REF_AUDIO_PATH = PROJECT_ROOT / "david.wav"

def create_saved_voice(audio_path: Path = REF_AUDIO_PATH) -> str:
    """
    Create a saved voice via the Mistral Voices API.

    Once created, the voice can be reused across requests by its ID,
    avoiding the need to pass ref_audio every time.

    Args:
        audio_path: Path to audio file (3-25s, single speaker, clean WAV/MP3).

    Returns:
        The voice ID string.
    """
    client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])

    sample_audio_b64 = base64.b64encode(audio_path.read_bytes()).decode()

    voice = client.audio.voices.create(
        name="david-cloned-voice",
        sample_audio=sample_audio_b64,
        sample_filename=audio_path.name,
        languages=["en"],
    )

    logger.info("Created voice: %s (id=%s)", voice.name, voice.id)
    return voice.id

async def create_agent_with_ref_audio(**kwargs) -> Agent:
    """
    Create an agent using on-the-fly voice cloning via ref_audio.

    Pass a base64-encoded audio clip directly to clone a voice without
    creating a saved voice first. Best for one-off use or experimentation.
    """
    ref_audio_b64 = base64.b64encode(REF_AUDIO_PATH.read_bytes()).decode()

    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Cloned Voice AI", id="agent"),
        instructions=(
            "You are a voice assistant that sounds like the person in the "
            "reference audio. Keep responses natural and conversational."
        ),
        tts=VoxtralTTS(
            model="voxtral-mini-tts-2603",
            ref_audio=ref_audio_b64,
            response_format="pcm",
        ),
        stt=deepgram.STT(eager_turn_detection=True),
        llm=gemini.LLM("gemini-3-flash-preview"),
    )
    return agent

async def create_agent_with_voice_id(**kwargs) -> Agent:
    """
    Create an agent using a previously saved voice ID.

    Saved voices provide consistent results and avoid sending the
    reference audio with every request. Set VOXTRAL_VOICE_ID in .env
    or pass voice_id in kwargs.
    """
    voice_id = kwargs.get("voice_id") or os.environ.get("VOXTRAL_VOICE_ID")
    if not voice_id:
        logger.info("No voice_id provided, creating a saved voice from %s...", REF_AUDIO_PATH.name)
        voice_id = create_saved_voice(REF_AUDIO_PATH)

    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Cloned Voice AI", id="agent"),
        instructions=(
            "You are a voice assistant with a custom cloned voice. "
            "Keep responses short, friendly, and natural."
        ),
        tts=VoxtralTTS(
            model="voxtral-mini-tts-2603",
            voice_id=voice_id,
            response_format="pcm",
        ),
        stt=deepgram.STT(eager_turn_detection=True),
        llm=gemini.LLM("gemini-3-flash-preview"),
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join the call and start the agent."""
    call = await agent.create_call(call_type, call_id)

    logger.info("Starting Voxtral Voice Cloning Agent...")

    async with agent.join(call):
        logger.info("Agent joined call with cloned voice")

        await asyncio.sleep(3)
        await agent.llm.simple_response(
            text="Hello! I'm speaking with a cloned voice powered by Voxtral TTS."
        )

        await agent.finish()

if __name__ == "__main__":
    Runner(
        AgentLauncher(
            create_agent=create_agent_with_ref_audio,
            join_call=join_call,
        )
    ).cli()

This Python script is similar to the one in the previous section. Here, we provide a path to the reference WAV file and create a Voxtral TTS-powered agent in Vision Agents to capture voice characteristics, including pitch, accent, gender, emotion, timbre, volume, and pacing.
Run your Python script to interact with the agent to hear the voice cloning feature in realtime.

Voice cloning example demonstrating a news anchor's voice

We have now demonstrated two voice cloning demos in Vision Agents using Voxtral TTS from Mistral AI. Let’s also look at voice cloning in general, use cases, and its limitations in current TTS models.

What is Voice Cloning?

Voice cloning is a technology or a service that creates replicas of any audible speech. The feature is already supported by many current open-source and commercial text-to-speech models like Qwen3-TTS, Gemini 3.1 Flash TTS, Inworld, AWS Polly and more. To choose a voice cloning-supported model for your project, check out the text-to-speech category on Hugging Face. For latency, quality, and price comparison, refer to the best text-to-speech models on Artificial Analysis.

AI Voice Cloning Use Cases

When building AI speech apps for prototyping or production and enterprise use cases, voice cloning can help create a specific tone and personality for marketing, sales, healthcare, customer service domains, and more.

Voiceovers: Voice clones are suitable for voiceovers in animated movies, storytelling, etc.
Long-Form Audio Generation: Clone and use replicas of real human voices for multi-speaker podcasts.
Sales: Use a well-crafted cloned voice for an agent that sells to customers.

Voice Cloning With Voxtral TTS

Although the Voxtral TTS model is not open source, we chose it for the demos in this article based on its quality and price. It consists of a small, multilingual, and cost-efficient audio generation (TTS) model for speech-enabled projects.

Multilingual Voice Cloning: As demonstrated in one of our previous demos, cloned voices can automatically speak in nine different languages.
Adjust Delivery: You have full control over voice characteristics, including emotion, clarity, pacing, and more.
Instant Cloning: Generate AI voices instantly using a 3-second WAV audio file.

Limitations of Voice Cloning in Voxtral TTS

The Voxtral TTS model performs zero-shot voice cloning very well. However, it has some constraints when integrating it into your AI speech projects.

Limited Language Support: The voice generation and cloning support is limited to only nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The quality of cloned voices in unsupported languages may degrade.
Reference Voice: The length of an input voice must be about 3 – 25 seconds. This is because the TTS model is trained to work well on 5 – 25s reference audio for instant cloning. Too-short clips may lack speaker characteristics, while too-long ones may be truncated or result in reduced quality.
Single Speaker Clip: The reference clip must have only one speaker. Overlapping voices and a little background noise can result in poor audio quality.
Licensing: The open-weight model on Hugging Face is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. To use it for commercial voice agents, you must use the hosted Mistral API (priced at $0.016 / 1k characters at launch) or obtain a commercial license.

Reference Audio Tips (for Voice Cloning)

Regardless of the TTS model you use for voice cloning, achieving better, more satisfying output depends on many factors. Let’s highlight some of these key factors.

Similar Output: To get a speech output that resembles the reference voice, ensure to supply 3 - 25 seconds of clean audio.
Prefer to a single speaker with minimal background noise.
Use an input audio with neutral intonation, stress, and rhythm to get better output. Excessive pausing and disfluencies can hinder the quality of the output voice.
Natural conversational speech in WAV format works best.
Reference Text: In addition to the input audio, you can provide a reference text transcript to improve quality.
Use an audio with expressive pitch to achieve better results, as flat voices may yield flat output.

How to Clone Any Voice in Minutes Using Voxtral TTS

What You Will Build

Requirements

Quick Start in Vision Agents

What is Voice Cloning?

AI Voice Cloning Use Cases

Voice Cloning With Voxtral TTS

Limitations of Voice Cloning in Voxtral TTS

Reference Audio Tips (for Voice Cloning)

Further Reading