Build multi-modal AI applications using our new open-source Vision AI SDK.

Multi-Modal AI Agents. Sub-500ms. Production-ready.

Open-source Python framework for voice and video AI agents. Real-time WebRTC, 25+ integrations, native tool calling and pluggable vision pipelines.

Works with Stream, Tencent RTC, local devices or any WebRTC infrastructure.

agent.py (python)
1
2
3
4
5
6
7
8
9
10
agent = Agent( edge=getstream.Edge(), agent_user=agent_user, instructions="Read @golf_coach.md", llm=gemini.Realtime(fps=10), processors=[ultralytics.YOLOPoseProcessor( model_path="yolo11n-pose.pt", device="cuda" )], )
7.8K Github Stars | 624 Forks | 25+ Integrations | <500ms Join Latency
Core Capabilities

Production-grade infrastructure, not a demo wrapper.

Built for teams shipping real AI products. Every component is designed for horizontal scaling, observability, and zero-downtime deployments.

Real-time WebRTC

Stream audio and video frames directly to LLMs. Connects to Stream, Tencent RTC, local devices, or any standards-compliant WebRTC infrastructure.

Pluggable Vision Pipelines

Inject YOLO, Roboflow, or custom PyTorch/ONNX models directly into the real-time processing loop before or after the LLM call.

Smart Turn Detection

VAD, speaker diarization, and configurable smart turn-taking for natural, interruption-aware conversation flow.

Tool Calling & MCP

Execute code and APIs mid-conversation. Connect Linear, Twilio, weather APIs, or any standard MCP server.

Memory & RAG

TurboPuffer vector search integration. Agents recall context across sessions via Stream Chat, persistent, queryable, real.

Phone Integration

Inbound and outbound telephony via Twilio with bidirectional audio streaming. SIP trunking supported.

Kubernetes Ready

Built-in HTTP server with Prometheus metrics endpoints and stateless agent design for horizontal pod scaling.

Text Back-channel

Push silent instructions or context to a running agent during a live call. Useful for human-in-the-loop orchestration.

Edge Agnostic

Bring your own edge: Stream, Tencent RTC, local devices, or any WebRTC-compatible network. Swap transports with a one-line config change.

Examples

See Vision Agents in action

Complex multi-modal use cases solved in dozens of lines, not thousands.

Voice

Build a storytelling agent with expressive speech using Cartesia's Sonic 3 TTS

eCommerce

Sell a used item with a product page, image, title, description, and price.

Embedded Devices

ESP32-S3 joins a video call, captures input, encodes, and publishes.

Edge Agnostic

Your infrastructure. Your choice.

Vision Agents doesn't lock you to a single transport layer. Bring your own edge network: cloud, on-prem, or local. Swap providers with a one-line config change.
Any WebRTC-compatible network works. The Edge interface is open: implement it for your own infrastructure in <50 lines.

Stream

Cloud

Stream's globally distributed WebRTC infrastructure. Sub-500ms join, auto-scaling TURN/STUN, 333k free participant-minutes/month for developers.

>_ edge=getstream.Edge()

Tencent RTC (Coming Soon)

Cloud

Tencent's real-time communication network with strong coverage across Asia-Pacific. Drop-in replacement using the same Agent API.

>_ edge=tencent.Edge()

Local / Custom

On Prem

Run entirely on local devices or private infrastructure. Ideal for air-gapped deployments, on-device inference, or custom WebRTC servers.

>_ edge=local.Edge(host=...)

Integrations
Drop-in across the whole stack

LLM

OpenAI, Gemini, xAI (Grok), OpenRouter, HuggingFace, Kimi AI, AWS Nova

Voice & Audio

ElevenLabs, Deepgram, AWS Polly, Cartesia

Vision & Detection

YOLO (Ultralytics), Roboflow, Custom PyTorch, ONNX Runtime

Infrastructure & Telephony

Stream, Tencent RTC (Coming Soon), Local Devices, Twilio, Kubernetes, Prometheus, TurboPuffer

VISION AGENTS

Why partner with us?

  • Reach developers building next-gen apps with real-time video, audio, and agents
  • Ship your latest model versions instantly to all Stream Vision Agent users
  • Unlock co-marketing, joint demos, and shared customer opportunities
  • Become a plug-and-play capability inside production-grade agent workflows

Ready for production.

Install the framework and connect to Stream, Tencent RTC, local devices, or any WebRTC-compatible network. No vendor lock-in. MIT license.

Community & Open Source

Join the Community

Follow Stream on X, star the Vision Agents GitHub repo, and join the discussion on Discord to try demos, share feedback, and contribute.