Kimi K2.5: Build a Video/Vision Agent in Python

Imagine pointing your webcam at everyday objects (or even sharing your screen with code) and having an AI instantly understand what it sees, reason through it step by step, and explain everything back to you in a natural voice.

That’s what Kimi K2.5 from Moonshot AI makes possible when accessed via its OpenAI-compatible API and wired into Vision Agents for seamless video, vision, and voice.

Kimi K2.5 is Moonshot’s latest open-source multimodal powerhouse: 1T-parameter MoE (32B active), 256k context, native vision understanding, agentic tool use, and strong coding and thinking capabilities trained on massive visual-text data.

In the demo, the agent greets you conversationally:

"Hello, I'm here and ready to help. I can assist you with looking at your camera feed if you have video enabled, helping with coding tasks via screen sharing, answering questions about what I can see or general topics.”

Here’s how to build the same video/vision/voice AI agent yourself in under five minutes.

What You’ll Build

A real-time voice and vision agent that analyzes your live camera feed, answers questions about what it sees, and helps with coding tasks (e.g., via screen share)
Leverages Kimi K2.5’s native multimodal understanding for accurate visual descriptions and reasoning
Natural, low-latency conversations with smooth turn-taking and helpful responses
Simple pipeline using OpenAI-compatible API access to Kimi K2.5 via Vision Agents

The Stack

LLM & Vision → Kimi K2.5 (kimi-k2.5-preview via Moonshot OpenAI-compatible API)
TTS → ElevenLabs
STT → Deepgram
Turn Detection → Smart-Turn
Transport → Stream WebRTC
Framework → Vision Agents (open-source)

Requirements (API Keys)

You’ll need API keys from:

Moonshot AI (for Kimi K2.5)
ElevenLabs (TTS)
Deepgram (STT)
Stream (API key & secret)

Step 1: Set Up the Project

shell

1
2
3
4
5
uv init kimi-k25-agent
cd kimi-k25-agent

uv add vision-agents
uv add "vision-agents[getstream, elevenlabs, deepgram, smart-turn, openai]"

Step 2: Full Working Code (main.py)

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import os
from dotenv import load_dotenv
from vision_agents.core import Agent, AgentLauncher, Runner, User
from vision_agents.plugins import openai, getstream, deepgram, elevenlabs, smart_turn

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    llm = openai.ChatCompletionsLLM(
        model="kimi-k2.5", 
        base_url="https://api.moonshot.ai/v1",
        api_key=os.getenv("MOONSHOT_API_KEY"),
    )

    # Create an agent with video understanding capabilities
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Video Assistant", id="agent"),
        instructions="You are a voice/video/vision agent powered by Kimi K2.5. You can answer questions about the users' video camera feed and help them perform coding tasks via screen sharing.",
        llm=llm,
        stt=deepgram.STT(),
        tts=elevenlabs.TTS(),
        turn_detection=smart_turn.TurnDetection(),
        processors=[],
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    await agent.create_user()
    call = await agent.create_call(call_type, call_id)

    async with agent.join(call):
        # The agent will automatically process video frames and respond to user input
        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Step 3: Run It

shell

1
2
3
4
5
6
7
8
export MOONSHOT_API_KEY=...
export ELEVENLABS_API_KEY=...
export DEEPGRAM_API_KEY=...
export STREAM_API_KEY=...
export STREAM_API_SECRET=...
export EXAMPLE_BASE_URL=https://demo.visionagents.ai

uv run kimi_k25_agent_example.py run

A browser tab opens with the video call UI. Join, enable your camera/mic, and ask what it sees or share your screen for coding help.

Kimi K2.5 handles the multimodal reasoning seamlessly.

Why We Love This Stack

Vision Agents makes it simple to plug Kimi K2.5 into a voice/vision pipeline via OpenAI-compatible API, with no extra wrappers needed.

Kimi K2.5 offers strong multimodal performance (vision, coding, agentic thinking) in an open-source-friendly package.

Everything runs open-source except the API calls.

Links & Resources

Try it out! Point your camera at something interesting or share code, and see how sharp Kimi K2.5 is. 👀

Kimi K2.5: Build a Video & Vision Agent in Python