Vision Agents has out-of-the-box support for the LLM services and providers developers need to build voice, vision, and video AI applications. The framework also makes it easy to integrate custom AI services — either by following a step-by-step guide or by vibe coding them using SoTA models.
Let’s use Claude Opus 4.6 to create a custom text-to-speech (TTS) plugin with Kitten TTS and hook it into Vision Agents as a TTS component for voice applications.
Here’s a quick look at the plugin in action:
The Plugin Testing: Project and System Requirements
To vibe code the plugin, we will use Claude Opus 4.6 in Cursor. You can also use the model via any agentic coding platform you prefer. The sample demo creates an agent that uses:
- Kitten TTS to run locally on a CPU and under 25MB.
- Deepgram for speech-to-text.
- Gemini 3 Flash for LLM processing.
- Stream Video SDK for edge/realtime communication.
To run and test it, you should install Python 3.12 or a later version using Conda or uv. You also need API credentials for Stream, Google, and Deepgram. Create accounts and generate API keys from the providers using these links.
- Stream Account (for Vision Agents).
- Google AI Studio Account (for Gemini models).
- Deepgram (for a speech-to-text model).
What is Kitten TTS?
Kitten TTS is an open-source, local text-to-speech AI built from a series of tiny models that can run on laptops, smartphones, and wearables. It is small enough to run securely and privately in the browser and on any edge device without privacy issues and GPU requirements.
Refer to the following Kitten TTS models and download the one you prefer to test on Hugging Face.
| Model | Size | Parameters | Download |
|---|---|---|---|
| kitten-tts-mini | 80MB | 80M | Hugging Face |
| kitten-tts-micro | 41MB | 40M | Hugging Face |
| kitten-tts-nano | 56MB | 15M | Hugging Face |
| kitten-tts-nano-0.8-int8 | 25MB | 15M | Hugging Face |
With the models’ Apache-2.0 license, you can download any of the above and do whatever you want with them.
The best way to see how Kitten TTS works is to try the interactive speech generation playground on Hugging Face. For a basic audio generation in Python, run the following sample script.
uv pip install https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl
Basic Script
12345678910from kittentts import KittenTTS m = KittenTTS("KittenML/kitten-tts-mini-0.8") audio = m.generate("This high quality TTS model works without a GPU.", voice='Jasper' ) # available_voices : ['Bella', 'Jasper', 'Luna', 'Bruno', 'Rosie', 'Hugo', 'Kiki', 'Leo'] # Save the audio import soundfile as sf sf.write('output.wav', audio, 24000)
Vision Agents and AI Plugins
The sample code in the previous section demonstrates basic Kitten TTS usage in Python for generating speech. To use it in Vision Agents, it must be integrated as a plugin.
In Vision Agents, you can implement your own plugins to serve different purposes and capabilities, such as:
- Object Detection and Tracking: Detect and track image and video objects in real-time.
- Connecting to Local AI Models: Connect a local LLM service like Ollama to access free and open-source models.
- LLM Processing: Add LLM support for your favorite provider. By default, Vision Agents comes out-of-the-box with OpenAI, Anthropic, Qwen, XAI, Google Gemini, and more.
- Speech Recognition: Build a plugin for transcribing speech.
- Speech Synthesis: Create a text-to-speech component from your AI provider of choice and preference.
- Turn-Detection: Implement your own turn-detection for the voice AI pipeline, or use an existing open-source project.
- Image, Video, Vision Processing and Generation: Provide a plugin to handle media generation and processing. You can, for example, build a Lyria 3 integration for AI music generation in Vision Agents.
Create a custom-made AI plugin to extend Vision Agents by following this step-by-step guide.
Vibe Code Kitten TTS Integration With Vision Agents
The recommended way to vibe code a custom AI feature for Vision Agents is to use Agent Skills in your favorite IDE. However, SoTA models like Opus 4.6, Sonnet 4.6, GPT-5.3 Codex, and GPT-5.4 typically produce satisfying results, so an Agent Skill isn’t necessary in our use case.
Before you start, clone and test the completed vibe-coded project from GitHub.
Step 1: Initialize a New Python Project and Install Vision Agents
For better results with Opus, start with a fresh Python project and a Vision Agents installation to become familiar with the codebase.
12345678910# Create a Python Project uv init # Activate your environment uv venv source .venv/bin/activate # Install Vision Agents uv add vision-agents uv add "vision-agents[getstream]"
Step 2: Add a Prompt
In your favorite IDE, select Opus 4.6 from the model selector (Cursor), and send the following prompt.
123456789101112131415Use this codebase to create a custom Python text-to-speech (TTS) plugin for KittenTTS to connect Vision Agents so that it can be used with any AI provider. Steps Follow the Vision Agents Python plugin creation docs to do the implementation and generate all the required plugin directories and files: https://visionagents.ai/integrations/create-your-own-plugin. Kitten TTS on GitHub: https://github.com/KittenML/KittenTTS?tab=readme-ov-file Example Vision Agents TTS plugins for reference: https://github.com/GetStream/Vision-Agents/tree/main/plugins/pocket https://github.com/GetStream/Vision-Agents/tree/main/plugins/fish
With the codebase already set up and the necessary links added to the prompt, Opus will plan and generate a project structure similar to the image.
Vision Agents plugins wrap AI provider APIs with a consistent UI, enabling seamless integration with the open-source AI framework to perform specific functions for voice, video, and vision AI.
Building a Vision Agents plugin involves a few steps:
- Create a workspace in Python.
- Add a plugin directory under the appropriate
typefolder. - Add a
pyproject.tomlwithgetstream[webrtc]as a dependency. - Run tests from the project root.
Check out the complete list of built-in Vision Agents plugins on GitHub to learn more.
As shown in the image above, the generated project files include tts.py. It contains the actual TTS plugin implementation code, including the supported models and voices. The content of your tts.py will look like this.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130import asyncio import logging from concurrent.futures import ThreadPoolExecutor from typing import AsyncIterator, Iterator, Literal import numpy as np from getstream.video.rtc.track_util import AudioFormat, PcmData from vision_agents.core import tts from vision_agents.core.warmup import Warmable from kittentts import KittenTTS logger = logging.getLogger(__name__) SAMPLE_RATE = 24000 Voice = Literal[ "Bella", "Jasper", "Luna", "Bruno", "Rosie", "Hugo", "Kiki", "Leo", ] Model = Literal[ "KittenML/kitten-tts-mini-0.8", "KittenML/kitten-tts-micro-0.8", "KittenML/kitten-tts-nano-0.8", "KittenML/kitten-tts-nano-0.8-int8", ] class TTS(tts.TTS, Warmable[KittenTTS]): """ KittenTTS Text-to-Speech implementation for Vision Agents. An ultra-lightweight CPU-based TTS model from KittenML with high-quality voice synthesis. The model is under 25MB (int8) and runs without a GPU. """ def __init__( self, model: Model | str = "KittenML/kitten-tts-mini-0.8", voice: Voice | str = "Bella", speed: float = 1.0, client: KittenTTS | None = None, ) -> None: """ Initialize KittenTTS. Args: model: HuggingFace model ID or name. Defaults to kitten-tts-mini-0.8. voice: Voice name to use for synthesis. speed: Speech speed multiplier (1.0 = normal). client: Optional pre-initialized KittenTTS instance. """ super().__init__(provider_name="kittentts") self.model_name = model self.voice = voice self.speed = speed self._model: KittenTTS | None = client self._executor = ThreadPoolExecutor(max_workers=4) async def on_warmup(self) -> KittenTTS: if self._model is not None: return self._model loop = asyncio.get_running_loop() logger.info("Loading KittenTTS model: %s ...", self.model_name) model = await loop.run_in_executor( self._executor, lambda: KittenTTS(self.model_name), ) logger.info("KittenTTS model loaded successfully") return model def on_warmed_up(self, resource: KittenTTS) -> None: self._model = resource async def _ensure_loaded(self) -> None: """Ensure model is loaded.""" if self._model is None: resource = await self.on_warmup() self.on_warmed_up(resource) async def stream_audio( self, text: str, *_, **__ ) -> PcmData | Iterator[PcmData] | AsyncIterator[PcmData]: """ Convert text to speech using KittenTTS. Args: text: The text to convert to speech. Returns: PcmData containing the synthesized audio at 24kHz. """ await self._ensure_loaded() assert self._model is not None model = self._model voice = self.voice speed = self.speed def _generate(): audio_np = model.generate(text, voice=voice, speed=speed) audio_np = np.asarray(audio_np, dtype=np.float32) pcm16 = (np.clip(audio_np, -1.0, 1.0) * 32767.0).astype(np.int16) return pcm16 loop = asyncio.get_running_loop() samples = await loop.run_in_executor(self._executor, _generate) return PcmData.from_numpy( samples, sample_rate=SAMPLE_RATE, channels=1, format=AudioFormat.S16 ) async def stop_audio(self) -> None: """Stop audio playback (no-op for KittenTTS as it generates synchronously).""" logger.info("KittenTTS stop requested (no-op)") async def close(self) -> None: """Close the TTS and cleanup resources.""" await super().close() self._executor.shutdown(wait=False)
Step 3: Test the Kitten TTS Plugin
The generated Kitten TTS project code in Cursor includes examples for running and testing. To test successfully, all the missing components of the Vision Agents voice pipeline must be installed.
Without specifying in the prompt which speech-to-text (STT), LLM, and turn-detection services to use with the plugin you are vibe coding, Opus will add some supported AI services from the Vision Agents docs. For this project, it includes the implementation of Deepgram for STT, Gemini 3 Flash (LLM), and Smart-Turn for turn-detection. Aside from Opus’ implementation of these services for Kitten TTS, they must be installed manually.
uv add "vision-agents[deepgram, gemini, smart-turn]"
Next, set the following API credentials in your .env.
12345STREAM_API_KEY=... STREAM_API_SECRET=... EXAMPLE_BASE_URL=https://demo.visionagents.ai GOOGLE_API_KEY=... DEEPGRAM_API_KEY=...
Basic Kitten TTS Usage in Vision Agents
The Opus-generated plugin modified the project’s main.py to create a simple Kitten TTS demo that synthesizes speech locally and saves to a WAV file.
12345678910111213141516171819202122232425262728""" KittenTTS Vision Agents Plugin Quick demo: synthesize speech locally with KittenTTS and save to a WAV file. """ import asyncio from vision_agents.plugins.kittentts import TTS async def main(): tts = TTS( model="KittenML/kitten-tts-mini-0.8", voice="Bella", ) await tts.warmup() pcm = await tts.stream_audio("Hello from KittenTTS! This is an ultra-lightweight text-to-speech model.") wav_bytes = pcm.to_wav_bytes() with open("output.wav", "wb") as f: f.write(wav_bytes) print(f"Audio saved to output.wav ({len(pcm.samples)} samples at {pcm.sample_rate}Hz)") await tts.close() if __name__ == "__main__": asyncio.run(main())
Running the Python script above will output a WAV audio file for playback.
Interactive Local Kitten TTS Demo in Vision Agents
The Kitten TTS plugin also contains a fully working voice agent example in /example/kittentts_example.py for real-time speech generation and user-agent interactions.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465""" KittenTTS Example This example demonstrates KittenTTS integration with Vision Agents. This example creates an agent that uses: - KittenTTS for text-to-speech (runs locally on CPU, under 25MB) - Deepgram for speech-to-text - Gemini for LLM - GetStream for edge/real-time communication Requirements: - DEEPGRAM_API_KEY environment variable - GOOGLE_API_KEY environment variable - STREAM_API_KEY and STREAM_API_SECRET environment variables """ import asyncio import logging from dotenv import load_dotenv from vision_agents.core import Agent, Runner, User from vision_agents.core.agents import AgentLauncher from vision_agents.plugins import deepgram, gemini, getstream from vision_agents.plugins import kittentts logger = logging.getLogger(__name__) load_dotenv() async def create_agent(**kwargs) -> Agent: """Create the agent with KittenTTS.""" agent = Agent( edge=getstream.Edge(), agent_user=User(name="Kitten AI", id="agent"), instructions="You are a helpful voice assistant. Keep responses brief and conversational.", tts=kittentts.TTS( model="KittenML/kitten-tts-mini-0.8", # available_voices : ['Bella', 'Jasper', 'Luna', 'Bruno', 'Rosie', 'Hugo', 'Kiki', 'Leo'] voice="Bella", ), stt=deepgram.STT(eager_turn_detection=True), llm=gemini.LLM("gemini-3-flash-preview"), ) return agent async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: """Join the call and start the agent.""" await agent.create_user() call = await agent.create_call(call_type, call_id) logger.info("Starting KittenTTS Agent...") async with agent.join(call): logger.info("Agent joined call") await asyncio.sleep(3) await agent.llm.simple_response( text="Hello! I'm running KittenTTS, an ultra-lightweight text-to-speech model." ) await agent.finish() if __name__ == "__main__": Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
After fulfilling the API requirements for Stream, Google AI, and Deepgram, you should be able to run the voice agent and test the Kitten TTS speech generation in action.
Troubleshoot and Fix Errors
When you vibe code a plugin for Vision Agents, its implementation may have issues. Always try using the same model or other SoTA models to fix errors with the Add to Chat feature in your agentic coding tool.
Get inspiration from the built-in Vision Agents plugins, and instruct your coding model to check their implementation to fix the errors and issues you may encounter.
Best Practices for Vibecoding Your Plugin
To get the best outcome for your preferred coding agent when vibe coding a plugin for Vision Agents, start with a new Python project and Vision Agents Installation. Use existing and similar Vision Agents plugins as reference and inspiration for the coding assistant.
Sometimes, it’s also helpful to specify in your prompt that the agent tests and includes working examples to run and see how the plugin works.
Whenever you get stuck, head to the custom integration guide and the default AI providers that come with the Vision Agents installation.
