Build multi-modal AI applications using our new open-source Vision AI SDK .

Multi-Modal AI Agents. Sub-500ms. Production-ready.

Open-source Python framework for voice and video AI agents. Real-time WebRTC, 25+ integrations, native tool calling, and pluggable vision pipelines — works with Stream, Tencent RTC, local devices, or any WebRTC infrastructure.

7.7K Github Stars | 624 Forks | 25+ Integrations | <500ms Join Latency

Core Capabilities

Production-grade infrastructure, not a demo wrapper.

Built for teams shipping real AI products. Every component is designed for horizontal scaling, observability, and zero-downtime deployments.

Real-time WebRTC

Stream audio and video frames directly to LLMs. Connects to Stream, Tencent RTC, local devices, or any standards-compliant WebRTC infrastructure.

Pluggable Vision Pipelines

Inject YOLO, Roboflow, or custom PyTorch/ONNX models directly into the real-time processing loop before or after the LLM call.

Smart Turn Detection

VAD, speaker diarization, and configurable smart turn-taking for natural, interruption-aware conversation flow.

Tool Calling & MCP

Execute code and APIs mid-conversation. Connect Linear, Twilio, weather APIs, or any standard MCP server.

Memory & RAG

TurboPuffer vector search integration. Agents recall context across sessions via Stream Chat — persistent, queryable, real.

Phone Integration

Inbound and outbound telephony via Twilio with bidirectional audio streaming. SIP trunking supported.

Kubernetes Ready

Built-in HTTP server with Prometheus metrics endpoints and stateless agent design for horizontal pod scaling.

Text Back-channel

Push silent instructions or context to a running agent during a live call. Useful for human-in-the-loop orchestration.

Edge Agnostic

Bring your own edge: Stream, Tencent RTC, local devices, or any WebRTC-compatible network. Swap transports with a one-line config change.

Examples

See Vision Agents in action

Complex multi-modal use cases solved in dozens of lines, not thousands.

Voice

Build a storytelling agent with expressive speech using Cartesia’s Sonic 3 TTS

eCommerce

Sell a used item with a product page, image, title, description, and price.

Embedded Devices

ESP32-S3 joins a video call, captures input, encodes, and publishes.

Edge Agnostic

Your infrastructure. Your choice.

Vision Agents doesn't lock you to a single transport layer. Bring your own edge network — cloud, on-prem, or local. Swap providers with a one-line config change.

Any WebRTC-compatible network works. The Edge interface is open — implement it for your own infrastructure in <50 lines.

Stream

Cloud

Stream's globally distributed WebRTC infrastructure. Sub-500ms join, auto-scaling TURN/STUN, 333k free participant-minutes/month for developers.

>_ edge=getstream.Edge()

Tencent RTC (Coming Soon)

Cloud

Tencent's real-time communication network with strong coverage across Asia-Pacific. Drop-in replacement using the same Agent API.

>_ edge=tencent.Edge()

Local / Custom

On Prem

Run entirely on local devices or private infrastructure. Ideal for air-gapped deployments, on-device inference, or custom WebRTC servers.

>_ edge=local.Edge(host=...)

Integrations
Drop-in across the whole stack

LLM

OpenAI, Gemini, xAI (Grok), OpenRouter, HuggingFace, Kimi AI, AWS Nova

Voice & Audio

ElevenLabs, Deepgram, AWS Polly, Cartesia

Vision & Detection

YOLO (Ultralytics), Roboflow, Custom PyTorch, ONNX Runtime

Infrastructure & Telephony

Stream, Tencent RTC (Coming Soon), Local Devices, Twilio, Kubernetes, Prometheus, TurboPuffer

VISION AGENTS

Why partner with us?

  • Reach developers building next-gen apps with real-time video, audio, and agents
  • Ship your latest model versions instantly to all Stream Vision Agent users
  • Unlock co-marketing, joint demos, and shared customer opportunities
  • Become a plug-and-play capability inside production-grade agent workflows

Ready for production.

Install the framework and connect to Stream, Tencent RTC, local devices, or any WebRTC-compatible network. No vendor lock-in. MIT license.

Community & Open Source

Join the Community

Follow Stream on X, star the Vision Agents GitHub repo, and join the discussion on Discord to try demos, share feedback, and contribute. Â