Vibe Code an AI Plugin with Opus 4.6 for Vision Agents in Python

Vision Agents has out-of-the-box support for the LLM services and providers developers need to build voice, vision, and video AI applications. The framework also makes it easy to integrate custom AI services — either by following a step-by-step guide or by vibe coding them using SoTA models.

Let’s use Claude Opus 4.6 to create a custom text-to-speech (TTS) plugin with Kitten TTS and hook it into Vision Agents as a TTS component for voice applications.

Here’s a quick look at the plugin in action:

The Plugin Testing: Project and System Requirements

To vibe code the plugin, we will use Claude Opus 4.6 in Cursor. You can also use the model via any agentic coding platform you prefer. The sample demo creates an agent that uses:

Kitten TTS to run locally on a CPU and under 25MB.
Deepgram for speech-to-text.
Gemini 3 Flash for LLM processing.
Stream Video SDK for edge/realtime communication.

To run and test it, you should install Python 3.12 or a later version using Conda or uv. You also need API credentials for Stream, Google, and Deepgram. Create accounts and generate API keys from the providers using these links.

Stream Account (for Vision Agents).
Google AI Studio Account (for Gemini models).
Deepgram (for a speech-to-text model).

What is Kitten TTS?

Kitten TTS is an open-source, local text-to-speech AI built from a series of tiny models that can run on laptops, smartphones, and wearables. It is small enough to run securely and privately in the browser and on any edge device without privacy issues and GPU requirements.

Refer to the following Kitten TTS models and download the one you prefer to test on Hugging Face.

Model	Size	Parameters	Download
kitten-tts-mini	80MB	80M	Hugging Face
kitten-tts-micro	41MB	40M	Hugging Face
kitten-tts-nano	56MB	15M	Hugging Face
kitten-tts-nano-0.8-int8	25MB	15M	Hugging Face

With the models’ Apache-2.0 license, you can download any of the above and do whatever you want with them.

The best way to see how Kitten TTS works is to try the interactive speech generation playground on Hugging Face. For a basic audio generation in Python, run the following sample script.

uv pip install https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl

Basic Script

python

1
2
3
4
5
6
7
8
9
10
from kittentts import KittenTTS
m = KittenTTS("KittenML/kitten-tts-mini-0.8")

audio = m.generate("This high quality TTS model works without a GPU.", voice='Jasper' )

# available_voices : ['Bella', 'Jasper', 'Luna', 'Bruno', 'Rosie', 'Hugo', 'Kiki', 'Leo']

# Save the audio
import soundfile as sf
sf.write('output.wav', audio, 24000)

Vision Agents and AI Plugins

The sample code in the previous section demonstrates basic Kitten TTS usage in Python for generating speech. To use it in Vision Agents, it must be integrated as a plugin.

In Vision Agents, you can implement your own plugins to serve different purposes and capabilities, such as:

Object Detection and Tracking: Detect and track image and video objects in real-time.
Connecting to Local AI Models: Connect a local LLM service like Ollama to access free and open-source models.
LLM Processing: Add LLM support for your favorite provider. By default, Vision Agents comes out-of-the-box with OpenAI, Anthropic, Qwen, XAI, Google Gemini, and more.
Speech Recognition: Build a plugin for transcribing speech.
Speech Synthesis: Create a text-to-speech component from your AI provider of choice and preference.
Turn-Detection: Implement your own turn-detection for the voice AI pipeline, or use an existing open-source project.
Image, Video, Vision Processing and Generation: Provide a plugin to handle media generation and processing. You can, for example, build a Lyria 3 integration for AI music generation in Vision Agents.

Create a custom-made AI plugin to extend Vision Agents by following this step-by-step guide.

Vibe Code Kitten TTS Integration With Vision Agents

The recommended way to vibe code a custom AI feature for Vision Agents is to use Agent Skills in your favorite IDE. However, SoTA models like Opus 4.6, Sonnet 4.6, GPT-5.3 Codex, and GPT-5.4 typically produce satisfying results, so an Agent Skill isn’t necessary in our use case.

Before you start, clone and test the completed vibe-coded project from GitHub.

Step 1: Initialize a New Python Project and Install Vision Agents

For better results with Opus, start with a fresh Python project and a Vision Agents installation to become familiar with the codebase.

bash

1
2
3
4
5
6
7
8
9
10
# Create a Python Project
uv init

# Activate your environment
uv venv
source .venv/bin/activate

# Install Vision Agents
uv add vision-agents
uv add "vision-agents[getstream]"

Step 2: Add a Prompt

Building your own app? Get early access to our Livestream or Video Calling API and launch in days!

In your favorite IDE, select Opus 4.6 from the model selector (Cursor), and send the following prompt.

markdown

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Use this codebase to create a custom Python text-to-speech (TTS) plugin for KittenTTS to connect Vision Agents so that it can be used with any AI provider. 

Steps

Follow the Vision Agents Python plugin creation docs to do the implementation and generate all the required plugin directories and files: 

https://visionagents.ai/integrations/create-your-own-plugin. 

Kitten TTS on GitHub: https://github.com/KittenML/KittenTTS?tab=readme-ov-file

Example Vision Agents TTS plugins for reference: 

https://github.com/GetStream/Vision-Agents/tree/main/plugins/pocket

https://github.com/GetStream/Vision-Agents/tree/main/plugins/fish

With the codebase already set up and the necessary links added to the prompt, Opus will plan and generate a project structure similar to the image.

Vision Agents plugins wrap AI provider APIs with a consistent UI, enabling seamless integration with the open-source AI framework to perform specific functions for voice, video, and vision AI.

Building a Vision Agents plugin involves a few steps:

Create a workspace in Python.
Add a plugin directory under the appropriate type folder.
Add a pyproject.toml with getstream[webrtc] as a dependency.
Run tests from the project root.

Check out the complete list of built-in Vision Agents plugins on GitHub to learn more.

As shown in the image above, the generated project files include tts.py. It contains the actual TTS plugin implementation code, including the supported models and voices. The content of your tts.py will look like this.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import asyncio
import logging
from concurrent.futures import ThreadPoolExecutor
from typing import AsyncIterator, Iterator, Literal

import numpy as np

from getstream.video.rtc.track_util import AudioFormat, PcmData
from vision_agents.core import tts
from vision_agents.core.warmup import Warmable

from kittentts import KittenTTS

logger = logging.getLogger(__name__)

SAMPLE_RATE = 24000

Voice = Literal[
    "Bella",
    "Jasper",
    "Luna",
    "Bruno",
    "Rosie",
    "Hugo",
    "Kiki",
    "Leo",
]

Model = Literal[
    "KittenML/kitten-tts-mini-0.8",
    "KittenML/kitten-tts-micro-0.8",
    "KittenML/kitten-tts-nano-0.8",
    "KittenML/kitten-tts-nano-0.8-int8",
]

class TTS(tts.TTS, Warmable[KittenTTS]):
    """
    KittenTTS Text-to-Speech implementation for Vision Agents.

    An ultra-lightweight CPU-based TTS model from KittenML with high-quality
    voice synthesis. The model is under 25MB (int8) and runs without a GPU.
    """

    def __init__(
        self,
        model: Model | str = "KittenML/kitten-tts-mini-0.8",
        voice: Voice | str = "Bella",
        speed: float = 1.0,
        client: KittenTTS | None = None,
    ) -> None:
        """
        Initialize KittenTTS.

        Args:
            model: HuggingFace model ID or name. Defaults to kitten-tts-mini-0.8.
            voice: Voice name to use for synthesis.
            speed: Speech speed multiplier (1.0 = normal).
            client: Optional pre-initialized KittenTTS instance.
        """
        super().__init__(provider_name="kittentts")

        self.model_name = model
        self.voice = voice
        self.speed = speed
        self._model: KittenTTS | None = client
        self._executor = ThreadPoolExecutor(max_workers=4)

    async def on_warmup(self) -> KittenTTS:
        if self._model is not None:
            return self._model

        loop = asyncio.get_running_loop()

        logger.info("Loading KittenTTS model: %s ...", self.model_name)
        model = await loop.run_in_executor(
            self._executor,
            lambda: KittenTTS(self.model_name),
        )
        logger.info("KittenTTS model loaded successfully")
        return model

    def on_warmed_up(self, resource: KittenTTS) -> None:
        self._model = resource

    async def _ensure_loaded(self) -> None:
        """Ensure model is loaded."""
        if self._model is None:
            resource = await self.on_warmup()
            self.on_warmed_up(resource)

    async def stream_audio(
        self, text: str, *_, **__
    ) -> PcmData | Iterator[PcmData] | AsyncIterator[PcmData]:
        """
        Convert text to speech using KittenTTS.

        Args:
            text: The text to convert to speech.

        Returns:
            PcmData containing the synthesized audio at 24kHz.
        """
        await self._ensure_loaded()
        assert self._model is not None

        model = self._model
        voice = self.voice
        speed = self.speed

        def _generate():
            audio_np = model.generate(text, voice=voice, speed=speed)
            audio_np = np.asarray(audio_np, dtype=np.float32)
            pcm16 = (np.clip(audio_np, -1.0, 1.0) * 32767.0).astype(np.int16)
            return pcm16

        loop = asyncio.get_running_loop()
        samples = await loop.run_in_executor(self._executor, _generate)

        return PcmData.from_numpy(
            samples, sample_rate=SAMPLE_RATE, channels=1, format=AudioFormat.S16
        )

    async def stop_audio(self) -> None:
        """Stop audio playback (no-op for KittenTTS as it generates synchronously)."""
        logger.info("KittenTTS stop requested (no-op)")

    async def close(self) -> None:
        """Close the TTS and cleanup resources."""
        await super().close()
        self._executor.shutdown(wait=False)

Step 3: Test the Kitten TTS Plugin

The generated Kitten TTS project code in Cursor includes examples for running and testing. To test successfully, all the missing components of the Vision Agents voice pipeline must be installed.

Without specifying in the prompt which speech-to-text (STT), LLM, and turn-detection services to use with the plugin you are vibe coding, Opus will add some supported AI services from the Vision Agents docs. For this project, it includes the implementation of Deepgram for STT, Gemini 3 Flash (LLM), and Smart-Turn for turn-detection. Aside from Opus’ implementation of these services for Kitten TTS, they must be installed manually.

uv add "vision-agents[deepgram, gemini, smart-turn]"

Next, set the following API credentials in your .env.

bash

1
2
3
4
5
STREAM_API_KEY=...
STREAM_API_SECRET=...
EXAMPLE_BASE_URL=https://demo.visionagents.ai
GOOGLE_API_KEY=...
DEEPGRAM_API_KEY=...

Basic Kitten TTS Usage in Vision Agents

The Opus-generated plugin modified the project’s main.py to create a simple Kitten TTS demo that synthesizes speech locally and saves to a WAV file.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
"""
KittenTTS Vision Agents Plugin

Quick demo: synthesize speech locally with KittenTTS and save to a WAV file.
"""

import asyncio
from vision_agents.plugins.kittentts import TTS

async def main():
    tts = TTS(
        model="KittenML/kitten-tts-mini-0.8",
        voice="Bella",
    )

    await tts.warmup()

    pcm = await tts.stream_audio("Hello from KittenTTS! This is an ultra-lightweight text-to-speech model.")

    wav_bytes = pcm.to_wav_bytes()
    with open("output.wav", "wb") as f:
        f.write(wav_bytes)
    print(f"Audio saved to output.wav ({len(pcm.samples)} samples at {pcm.sample_rate}Hz)")

    await tts.close()

if __name__ == "__main__":
    asyncio.run(main())

Running the Python script above will output a WAV audio file for playback.

Interactive Local Kitten TTS Demo in Vision Agents

The Kitten TTS plugin also contains a fully working voice agent example in /example/kittentts_example.py for real-time speech generation and user-agent interactions.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
"""
KittenTTS Example

This example demonstrates KittenTTS integration with Vision Agents.

This example creates an agent that uses:
- KittenTTS for text-to-speech (runs locally on CPU, under 25MB)
- Deepgram for speech-to-text
- Gemini for LLM
- GetStream for edge/real-time communication

Requirements:
- DEEPGRAM_API_KEY environment variable
- GOOGLE_API_KEY environment variable
- STREAM_API_KEY and STREAM_API_SECRET environment variables
"""

import asyncio
import logging

from dotenv import load_dotenv
from vision_agents.core import Agent, Runner, User
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import deepgram, gemini, getstream
from vision_agents.plugins import kittentts

logger = logging.getLogger(__name__)

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    """Create the agent with KittenTTS."""
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Kitten AI", id="agent"),
        instructions="You are a helpful voice assistant. Keep responses brief and conversational.",
        tts=kittentts.TTS(
            model="KittenML/kitten-tts-mini-0.8",
            # available_voices : ['Bella', 'Jasper', 'Luna', 'Bruno', 'Rosie', 'Hugo', 'Kiki', 'Leo']
            voice="Bella",
        ),
        stt=deepgram.STT(eager_turn_detection=True),
        llm=gemini.LLM("gemini-3-flash-preview"),
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join the call and start the agent."""
    await agent.create_user()
    call = await agent.create_call(call_type, call_id)

    logger.info("Starting KittenTTS Agent...")

    async with agent.join(call):
        logger.info("Agent joined call")

        await asyncio.sleep(3)
        await agent.llm.simple_response(
            text="Hello! I'm running KittenTTS, an ultra-lightweight text-to-speech model."
        )

        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

After fulfilling the API requirements for Stream, Google AI, and Deepgram, you should be able to run the voice agent and test the Kitten TTS speech generation in action.

Troubleshoot and Fix Errors

When you vibe code a plugin for Vision Agents, its implementation may have issues. Always try using the same model or other SoTA models to fix errors with the Add to Chat feature in your agentic coding tool.

Get inspiration from the built-in Vision Agents plugins, and instruct your coding model to check their implementation to fix the errors and issues you may encounter.

Best Practices for Vibecoding Your Plugin

To get the best outcome for your preferred coding agent when vibe coding a plugin for Vision Agents, start with a new Python project and Vision Agents Installation. Use existing and similar Vision Agents plugins as reference and inspiration for the coding assistant.

Sometimes, it’s also helpful to specify in your prompt that the agent tests and includes working examples to run and see how the plugin works.

Whenever you get stuck, head to the custom integration guide and the default AI providers that come with the Vision Agents installation.

Using Opus 4.6: Vibe Code a Custom Python Plugin for Vision Agents

The Plugin Testing: Project and System Requirements

What is Kitten TTS?

Vision Agents and AI Plugins

Vibe Code Kitten TTS Integration With Vision Agents

Step 1: Initialize a New Python Project and Install Vision Agents

Step 2: Add a Prompt

Step 3: Test the Kitten TTS Plugin

Troubleshoot and Fix Errors

Best Practices for Vibecoding Your Plugin