Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub ->

Kimi K2.5: Build a Video & Vision Agent in Python

New
3 min read
Amos G.
Amos G.
Published February 11, 2026

Imagine pointing your webcam at everyday objects (or even sharing your screen with code) and having an AI instantly understand what it sees, reason through it step by step, and explain everything back to you in a natural voice.

That’s what Kimi K2.5 from Moonshot AI makes possible when accessed via its OpenAI-compatible API and wired into Vision Agents for seamless video, vision, and voice.

Kimi K2.5 is Moonshot’s latest open-source multimodal powerhouse: 1T-parameter MoE (32B active), 256k context, native vision understanding, agentic tool use, and strong coding and thinking capabilities trained on massive visual-text data.

In the demo, the agent greets you conversationally:

"Hello, I'm here and ready to help. I can assist you with looking at your camera feed if you have video enabled, helping with coding tasks via screen sharing, answering questions about what I can see or general topics.”

Here’s how to build the same video/vision/voice AI agent yourself in under five minutes.

What You’ll Build

  • A real-time voice and vision agent that analyzes your live camera feed, answers questions about what it sees, and helps with coding tasks (e.g., via screen share)
  • Leverages Kimi K2.5’s native multimodal understanding for accurate visual descriptions and reasoning
  • Natural, low-latency conversations with smooth turn-taking and helpful responses
  • Simple pipeline using OpenAI-compatible API access to Kimi K2.5 via Vision Agents

The Stack

Requirements (API Keys)

You’ll need API keys from:

  • Moonshot AI (for Kimi K2.5)
  • ElevenLabs (TTS)
  • Deepgram (STT)
  • Stream (API key & secret)

Step 1: Set Up the Project

shell
1
2
3
4
5
uv init kimi-k25-agent cd kimi-k25-agent uv add vision-agents uv add "vision-agents[getstream, elevenlabs, deepgram, smart-turn, openai]"

Step 2: Full Working Code (main.py)

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import os from dotenv import load_dotenv from vision_agents.core import Agent, AgentLauncher, Runner, User from vision_agents.plugins import openai, getstream, deepgram, elevenlabs, smart_turn load_dotenv() async def create_agent(**kwargs) -> Agent: llm = openai.ChatCompletionsLLM( model="kimi-k2.5", base_url="https://api.moonshot.ai/v1", api_key=os.getenv("MOONSHOT_API_KEY"), ) # Create an agent with video understanding capabilities agent = Agent( edge=getstream.Edge(), agent_user=User(name="Video Assistant", id="agent"), instructions="You are a voice/video/vision agent powered by Kimi K2.5. You can answer questions about the users' video camera feed and help them perform coding tasks via screen sharing.", llm=llm, stt=deepgram.STT(), tts=elevenlabs.TTS(), turn_detection=smart_turn.TurnDetection(), processors=[], ) return agent async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: await agent.create_user() call = await agent.create_call(call_type, call_id) async with agent.join(call): # The agent will automatically process video frames and respond to user input await agent.finish() if __name__ == "__main__": Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Step 3: Run It

shell
1
2
3
4
5
6
7
8
export MOONSHOT_API_KEY=... export ELEVENLABS_API_KEY=... export DEEPGRAM_API_KEY=... export STREAM_API_KEY=... export STREAM_API_SECRET=... export EXAMPLE_BASE_URL=https://demo.visionagents.ai uv run kimi_k25_agent_example.py run

A browser tab opens with the video call UI. Join, enable your camera/mic, and ask what it sees or share your screen for coding help.

Kimi K2.5 handles the multimodal reasoning seamlessly.

Why We Love This Stack

Vision Agents makes it simple to plug Kimi K2.5 into a voice/vision pipeline via OpenAI-compatible API, with no extra wrappers needed.

Kimi K2.5 offers strong multimodal performance (vision, coding, agentic thinking) in an open-source-friendly package.

Everything runs open-source except the API calls.

Try it out! Point your camera at something interesting or share code, and see how sharp Kimi K2.5 is. 👀

Integrating Video With Your App?
We've built a Video and Audio solution just for you. Check out our APIs and SDKs.
Learn more ->