How to Build a Local AI Voice Agent with Pocket TTS

Voice agents are getting better, but most text-to-speech pipelines still assume you’re okay with cloud APIs, large models, and unpredictable latency.

If you want fast, natural-sounding speech that runs entirely on your own hardware (no GPU, no network calls), you need a different approach.

In this tutorial, you’ll build a real-time AI voice agent that runs locally on your laptop using Pocket TTS, a lightweight 100M-parameter text-to-speech model. You’ll wire it into a full voice pipeline with Vision Agents, handling speech-to-text, LLM responses, and real-time audio delivery over Stream Video.

By the end, you’ll have a low-latency, offline-friendly voice agent you can run on CPU and extend for real production use cases.

What You Will Build

This guide will walk you through how to build a demo (similar to the one above) using this efficient 100M parameters model from Kyutai. The agent listens, transcribes speech, generates a response with an LLM, and speaks back using Pocket TTS, all with low latency and no GPU.

Problems with Current Text-to-Speech AI Models

When creating a custom voice AI pipeline for speech-enabled projects, there are effective, reliable solutions such as ElevenLabs, Deepgram, Inworld, and Cartesia for the text-to-speech (TTS) component. However, the underlying architecture of these AI services uses large language models (LLMs) and cannot run locally on consumer hardware like your laptop.

An open-source and frontier alternative to Pocket TTS is VibeVoice by Microsoft. It consists of 1.5B and 7B-parameter family of models that can run on your laptop’s CPU, but are not as convenient as Pocket TTS with only 100M parameters.

This creates a few practical constraints:

Mostly 1B+ Parameters: Small TTS models with 1B to about 8B parameters can be used on consumer electronics, but may run more slowly on your device.
Limited Voice Cloning Support: Voice cloning, the ability of a TTS model to reproduce speech based on a given audio input, is crucial in building speech-enabled apps. However, not all commercial and production-ready models in this category support speech cloning.
Latency: The larger the parameter size of a TTS model, the higher the computational power it requires. Choosing a larger voice model for your project can result in high latency in the agent/assistant’s responses.

Pocket TTS sidesteps these tradeoffs by keeping the model small while remaining fast enough to run locally on CPU, making it well-suited for low-latency, speech-enabled applications.

Quick Start in Python

To quickly set up and run our local Pocket TTS-powered agent in Python, let's follow these steps.

Set Up Your Python Environment and Install Vision Agents

Running the demo requires Python 3.13 or later installed on your machine. To integrate Pocket TTS into an existing voice AI pipeline, we will use Vision Agents to orchestrate our LLM (gemini-3-flash-preview), speech-to-text (STT) (Deepgram), and TTS components for a cohesive speech-enabled workflow.

Vision Agents enable developers and enterprise teams to build video, voice, and vision AI applications using an open-source framework, third-party model providers, and telephony services.

Run the commands below in the order shown to configure your Python working environment, install Vision Agents, and get the companion plug-ins up and running.

bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 1. Install Python 3.13 or later

# 2. Create a new Python project: Using uv is recommended
uv init pocket-tts-vision-agents-demo

cd pocket-tts-vision-agents-demo

# 3. Create and activate your environment
uv venv .venv
source .venv/bin/activate

# Store the required API credentials
touch .env

# Inside .env
# Stream API credentials
STREAM_API_KEY=...: https://beta.dashboard.getstream.io/signup/
STREAM_API_SECRET=...
EXAMPLE_BASE_URL=https://pronto-staging.getstream.io

# Vision Agents' plugins APIs credentials
DEEPGRAM_API_KEY=...: https://console.deepgram.com/signup # STT
GOOGLE_API_KEY=...: https://aistudio.google.com/api-keys # LLM

# 4. Install Vision Agents and required plugins
uv add vision-agents
uv add "vision-agents[pocket, getstream, deepgram, gemini]"

The Complete Sample Code With Step-by-Step Configuration

In the root of your generated uv project, rename main.py to pocket_tts_example.py and replace its content with this Python script.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
import asyncio
import logging
from concurrent.futures import ThreadPoolExecutor
from typing import Any, AsyncIterator, Iterator, Literal

import numpy as np
from scipy import signal

from getstream.video.rtc.track_util import AudioFormat, PcmData
from vision_agents.core import User, Agent, cli, tts
from vision_agents.core.agents import AgentLauncher
from vision_agents.core.warmup import Warmable
from vision_agents.plugins import getstream, pocket, deepgram, gemini
from pocket_tts import TTSModel

logger = logging.getLogger(__name__)

Voice = Literal["alba", "marius", "javert", "jean", "fantine", "cosette", "eponine", "azelma"]

class PocketTTS(tts.TTS, Warmable[tuple[TTSModel, Any]]):
    """
    Fixed Pocket TTS that correctly uses predefined voices without voice cloning.
    """

    def __init__(self, voice: Voice = "alba") -> None:
        super().__init__(provider_name="pocket")
        self.voice = voice
        self._model: TTSModel | None = None
        self._voice_state = None
        self._executor = ThreadPoolExecutor(max_workers=4)

    async def on_warmup(self) -> tuple[TTSModel, Any]:
        if self._model is not None and self._voice_state is not None:
            return (self._model, self._voice_state)

        loop = asyncio.get_running_loop()

        logger.info("Loading Pocket TTS model...")
        model = await loop.run_in_executor(self._executor, TTSModel.load_model)
        logger.info("Pocket TTS model loaded")

        # Pass voice NAME directly (not a path) to use predefined embeddings
        logger.info(f"Loading voice state for: {self.voice}")
        voice_state = await loop.run_in_executor(
            self._executor,
            lambda: model.get_state_for_audio_prompt(self.voice),
        )
        logger.info("Voice state loaded")
        return (model, voice_state)

    def on_warmed_up(self, resource: tuple[TTSModel, Any]) -> None:
        self._model, self._voice_state = resource

    async def _ensure_loaded(self) -> None:
        if self._model is None or self._voice_state is None:
            resource = await self.on_warmup()
            self.on_warmed_up(resource)

    async def stream_audio(
        self, text: str, *_, **__
    ) -> PcmData | Iterator[PcmData] | AsyncIterator[PcmData]:
        logger.info(f"🎤 TTS generating audio for: '{text}'")
        await self._ensure_loaded()
        assert self._model is not None
        assert self._voice_state is not None

        model = self._model
        voice_state = self._voice_state

        # Target sample rate for WebRTC compatibility
        target_sample_rate = 48000

        def _generate():
            print(f"[PocketTTS] Generating audio for: '{text}'")
            audio_tensor = model.generate_audio(voice_state, text)
            audio_np = audio_tensor.numpy()
            original_sample_rate = model.sample_rate
            print(f"[PocketTTS] Audio shape: {audio_np.shape}, dtype: {audio_np.dtype}")
            print(f"[PocketTTS] Audio range: min={audio_np.min():.4f}, max={audio_np.max():.4f}")
            print(f"[PocketTTS] Original sample rate: {original_sample_rate}Hz")

            # Resample to 48kHz for WebRTC compatibility
            if original_sample_rate != target_sample_rate:
                num_samples = int(len(audio_np) * target_sample_rate / original_sample_rate)
                audio_resampled = signal.resample(audio_np, num_samples)
                print(f"[PocketTTS] Resampled: {len(audio_np)} -> {len(audio_resampled)} samples ({original_sample_rate}Hz -> {target_sample_rate}Hz)")
            else:
                audio_resampled = audio_np

            # Ensure audio is 1D
            if len(audio_resampled.shape) > 1:
                audio_resampled = audio_resampled.flatten()
                print(f"[PocketTTS] Flattened audio to 1D: {audio_resampled.shape}")

            # Normalize and amplify the audio (in case it's too quiet)
            max_val = np.abs(audio_resampled).max()
            if max_val > 0:
                # Normalize to use 80% of the dynamic range
                audio_normalized = audio_resampled / max_val * 0.8
                print(f"[PocketTTS] Normalized audio: max was {max_val:.4f}, now {np.abs(audio_normalized).max():.4f}")
            else:
                audio_normalized = audio_resampled
                print("[PocketTTS] WARNING: Audio is silent (max=0)")

            pcm16 = (np.clip(audio_normalized, -1.0, 1.0) * 32767.0).astype(np.int16)
            print(f"[PocketTTS] PCM16 samples: {len(pcm16)}, min={pcm16.min()}, max={pcm16.max()}")
            return pcm16, target_sample_rate

        loop = asyncio.get_running_loop()
        samples, sample_rate = await loop.run_in_executor(self._executor, _generate)

        print(f"[PocketTTS] ✅ Audio ready: {len(samples)} samples at {sample_rate}Hz, duration: {len(samples)/sample_rate:.2f}s")
        return PcmData.from_numpy(
            samples, sample_rate=sample_rate, channels=1, format=AudioFormat.S16
        )

    async def stop_audio(self) -> None:
        pass

    async def close(self) -> None:
        await super().close()
        self._executor.shutdown(wait=False)

async def create_agent(**kwargs) -> Agent:
    """Create the agent with Pocket TTS."""
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Pocket AI", id="agent"),
        instructions="You are a helpful voice assistant. Keep responses brief and conversational.",
        tts=PocketTTS(voice="jean"),
        stt=deepgram.STT(eager_turn_detection=True),
        llm=gemini.LLM("gemini-3-flash-preview"),
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join the call and start the agent."""
    await agent.create_user()
    call = await agent.create_call(call_type, call_id)

    logger.info("🤖 Starting Pocket TTS Agent...")

    async with agent.join(call):
        logger.info("Agent joined call, waiting for participant...")

        # Wait for a human participant to join the call
        await agent.wait_for_participant()
        logger.info("Participant joined")

        # Greet the user
        await agent.say("Hello! I'm running Pocket TTS locally. How can I help you?")

        # Keep the agent running to listen and respond
        # The agent automatically handles: STT -> LLM -> TTS via the edge
        await agent.finish()

if __name__ == "__main__":
    cli(AgentLauncher(create_agent=create_agent, join_call=join_call))

Building your own app? Get early access to our Livestream or Video Calling API and launch in days!

In summary, you import getstream from the Vision Agents framework, along with the installed Python plugins pocket, deepgram, and gemini. You then initialize a new multimodal agent with voice, video, vision, and text capabilities using the Agent class in Vision Agents.

With your agent defined, the next step is to configure it with parameters for the various voice components. Finally, you create and join a new video call using Stream Video to enable real-time audio and video communication.

At the root of your project, run uv run pocket_tts_example.py. You can now have voice conversations with the agent, who will respond to you in real time using Pocket TTS.

Note: The above demo integrates an AI avatar with the Pocket TTS voice output. Check out the Vision Agents docs to get started with AI avatars.

How It Works

Although we focused on the Pocket TTS integration with Vision Agents in this tutorial, the multimodal AI pipeline uses four main components to create an agent that uses:

Pocket TTS for text-to-speech by running the SLM locally on your CPU.
Deepgram for speech-to-text.
Gemini as an LL for processing the voice input and output.
Stream for edge/real-time audio and video communication.

When you run the Python script, the user speaks through the built-in Stream Video integration with Vision Agents. Gemini 3 Flash takes the raw input audio and sends it to the Deepgram STT model. The system uses Deepgram to convert the audio input into transcription. The transcribed message is then sent to Gemini 3 Flash for processing. Finally, Pocket TTS converts the transcribed text back to speech, which users hear as a response from the voice agent.

Benefits of Pocket TTS

Pocket TTS is a fast, lightweight TTS plugin for Vision Agents and runs efficiently on CPU with low latency (~200ms).

Lightweight and Fast Response: It has only 100M parameters, and it requires no GPU to run.
Mobile Use Cases: Since it is small, it can easily be implemented for mobile use cases.
Voice Selection: Offers built-in support.
Voice Cloning: Reproduce human voices by giving samples of audio input.

Pocket TTS Usage in Vision Agents

You can use the model to build voice experiences across a broader range of use cases, such as voice-enabled home assistants. The code snippet below shows a simple usage of the plugin in Vision Agents.

python

1
2
3
4
5
6
7
8
9
10
from vision_agents.plugins import pocket

# Create TTS with default voice
tts = pocket.TTS()

# Or specify a built-in voice
tts = pocket.TTS(voice="marius")

# Or use a custom voice for cloning
tts = pocket.TTS(voice="path/to/your/voice.wav")

When working with the model, you can choose from a handful of male and female AI voices, including alba (Default voice), marius, javert, jean, fantine, cosette, eponine, and azelma.

Pocket TTS Limitations

Although Pocket TTS runs fast and reliably on CPUs, it lacks multilingual support. At the time of writing this article, the model supported only English.

If you want an open-source alternative to this model for multiple languages, you can use VibeVoice from Microsoft.

Where To Go Next

This article demonstrated one of the many use cases of Pocket TTS in building voice agents. The model is lightweight and suitable for running on both older and newer laptops and mobile devices.

The seamless integration of the Pocket TTS plugin in the latest release of Vision Agents turns your voice, vision, and video projects from prototypes into scalable, production-ready deployments with FastAPI.

Check out the following resources to learn more:

👷 Build functional AI apps with Vision Agents in minutes.
⭐ Star the repo on GitHub.
📖 Read the docs.
💬 Join our Discord community for support.
🔧 Contribute a plugin.