Imagine pointing your webcam at everyday objects (or even sharing your screen with code) and having an AI instantly understand what it sees, reason through it step by step, and explain everything back to you in a natural voice.
That’s what Kimi K2.5 from Moonshot AI makes possible when accessed via its OpenAI-compatible API and wired into Vision Agents for seamless video, vision, and voice.
Kimi K2.5 is Moonshot’s latest open-source multimodal powerhouse: 1T-parameter MoE (32B active), 256k context, native vision understanding, agentic tool use, and strong coding and thinking capabilities trained on massive visual-text data.
In the demo, the agent greets you conversationally:
"Hello, I'm here and ready to help. I can assist you with looking at your camera feed if you have video enabled, helping with coding tasks via screen sharing, answering questions about what I can see or general topics.”
Here’s how to build the same video/vision/voice AI agent yourself in under five minutes.
What You’ll Build
- A real-time voice and vision agent that analyzes your live camera feed, answers questions about what it sees, and helps with coding tasks (e.g., via screen share)
- Leverages Kimi K2.5’s native multimodal understanding for accurate visual descriptions and reasoning
- Natural, low-latency conversations with smooth turn-taking and helpful responses
- Simple pipeline using OpenAI-compatible API access to Kimi K2.5 via Vision Agents
The Stack
- LLM & Vision → Kimi K2.5 (kimi-k2.5-preview via Moonshot OpenAI-compatible API)
- TTS → ElevenLabs
- STT → Deepgram
- Turn Detection → Smart-Turn
- Transport → Stream WebRTC
- Framework → Vision Agents (open-source)
Requirements (API Keys)
You’ll need API keys from:
- Moonshot AI (for Kimi K2.5)
- ElevenLabs (TTS)
- Deepgram (STT)
- Stream (API key & secret)
Step 1: Set Up the Project
12345uv init kimi-k25-agent cd kimi-k25-agent uv add vision-agents uv add "vision-agents[getstream, elevenlabs, deepgram, smart-turn, openai]"
Step 2: Full Working Code (main.py)
12345678910111213141516171819202122232425262728293031323334353637import os from dotenv import load_dotenv from vision_agents.core import Agent, AgentLauncher, Runner, User from vision_agents.plugins import openai, getstream, deepgram, elevenlabs, smart_turn load_dotenv() async def create_agent(**kwargs) -> Agent: llm = openai.ChatCompletionsLLM( model="kimi-k2.5", base_url="https://api.moonshot.ai/v1", api_key=os.getenv("MOONSHOT_API_KEY"), ) # Create an agent with video understanding capabilities agent = Agent( edge=getstream.Edge(), agent_user=User(name="Video Assistant", id="agent"), instructions="You are a voice/video/vision agent powered by Kimi K2.5. You can answer questions about the users' video camera feed and help them perform coding tasks via screen sharing.", llm=llm, stt=deepgram.STT(), tts=elevenlabs.TTS(), turn_detection=smart_turn.TurnDetection(), processors=[], ) return agent async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: await agent.create_user() call = await agent.create_call(call_type, call_id) async with agent.join(call): # The agent will automatically process video frames and respond to user input await agent.finish() if __name__ == "__main__": Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
Step 3: Run It
12345678export MOONSHOT_API_KEY=... export ELEVENLABS_API_KEY=... export DEEPGRAM_API_KEY=... export STREAM_API_KEY=... export STREAM_API_SECRET=... export EXAMPLE_BASE_URL=https://demo.visionagents.ai uv run kimi_k25_agent_example.py run
A browser tab opens with the video call UI. Join, enable your camera/mic, and ask what it sees or share your screen for coding help.
Kimi K2.5 handles the multimodal reasoning seamlessly.
Why We Love This Stack
Vision Agents makes it simple to plug Kimi K2.5 into a voice/vision pipeline via OpenAI-compatible API, with no extra wrappers needed.
Kimi K2.5 offers strong multimodal performance (vision, coding, agentic thinking) in an open-source-friendly package.
Everything runs open-source except the API calls.
Links & Resources
Try it out! Point your camera at something interesting or share code, and see how sharp Kimi K2.5 is. 👀