Vision Agents is a new, open-source framework from Stream that helps developers quickly build low-latency vision AI applications. The project is completely open-source and ships with over ten out-of-the-box integrations, including day one support for leading real-time voice and video models like OpenAI Realtime and Gemini Live.
Text-to-speech, speech-to-text, and speech-to-speech models are also natively supported and easily interchangeable with your favorite LLM. Whether you are combining an LLM from Anthropic, a voice from Cartesia, or transcriptions from Deepgram, Vision Agents simplifies the integration process by providing a single generic Agent class that automatically handles the complexities of managing tracks, video subscriptions, and converting between the different response types.
Getting your first Vision Agent running is simple:
async def start_agent() -> None:
    llm = gemini.Realtime()
    # create an agent to run with Stream's edge, Gemini llm
    agent = agents.Agent(
        edge=getstream.Edge(), # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
        agent_user=User(name="My happy AI friend", id="agent"), # the user object for the agent (name, image etc)
        instructions="You're a voice AI assistant. Keep responses short and conversational. Don't use special characters or formatting. Be friendly and helpful.",
        processors=[], # processors can fetch extra data, check images/audio data or transform video
        # llm with tts & stt. if you use a realtime (sts capable) llm the tts, stt and vad aren't needed
        llm=llm,
    )
    await agent.create_user()
In addition to real-time models, the framework also allows developers to build with non-WebRTC models and custom video processors.Â
Processors are powerful extensions that enable developers to integrate with the underlying WebRTC track information. These can be used to process video frames in real-time with any computer vision model, allowing for an infinite number of demos.Â
For example, our out-of-the-box integration with Ultralytics YOLO + Gemini Live makes for a fun golf coach:
async def start_agent() -> None:
    agent = Agent(
        edge=getstream.Edge(), # use stream for edge video transport
        agent_user=User(name="AI golf coach"),
        instructions="Read @golf_coach.md", # read the golf coach markdown instructions
        llm=gemini.Realtime(fps=10), # Careful with FPS can get expensive
        # llm=openai.Realtime(fps=10), use this to switch to openai
        processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")], # realtime pose detection with yolo
    )
    await agent.create_user()
Read the full tutorial and sample on our GitHub.
What Can I Build?Â
Sports coaching is just one of the types of applications you can build with the power of low-latency video and AI. Here are a few other use cases to get you started:Â
-
Manufacturing anomaly detection: Runs per-frame object/defect detection with a processor (e.g., YOLO). Auto-flags issues and tool-calls to pause the line or open a ticket.
-
Meeting assistant: Joins meetings, sees/hears context, diarizes, summarizes, and files Linear issues via tool calling.
-
Gaming/coaching: Provides real-time on-screen guidance (e.g., Dota coach or GeoGuessr "where am I?" helper).
-
Avatars & immersive: Brings external video avatars in as tracks and lip-syncs them via speech-to-speech.
-
Wearables/on-device vision: Streams WebRTC from cameras/glasses and responds by voice or on-screen prompts.
-
Accessibility: Generates live captions + contextual scene descriptions in real-time.
-
Robotics/IoT: Enables real-time perception with actuation via tools.
Why It's Different: Built Video-First
Most frameworks started with voice and later bolted on video. Vision Agents were built video-first:
-
True real-time via WebRTC → Stream directly to model providers that support it for instant visual understanding.
-
Interval/processor pipeline → For providers without WebRTC, process frames with pluggable video processors (e.g., YOLO, Roboflow, or custom PyTorch/ONNX) before/after model calls.
-
Turn detection & diarization → Keep conversations natural; know when the agent should speak or stay quiet and who's talking.
-
Voice activity detection (VAD) → Trigger actions intelligently and use resources efficiently.
-
Speech↔Text↔Speech → Enable low-latency loops for smooth, conversational voice UX.
-
Tool/function calling → Execute arbitrary code and APIs mid-conversation. Create Linear issues, query weather, trigger telephony, or hit internal services.
-
Built-in memory via Stream Chat → Agents recall context naturally across turns and sessions.
-
Text back-channel → Message the agent silently during a call.
"Video AI is powerful because it mirrors how people naturally experience the world. We don't just speak and listen—we see and interact. Our framework enables AI to do the same, allowing it to perceive and engage with the world in a natural, human-like way." — Neevash Ramdial, Director of Marketing
What's Next
In the short term, we will be adding support for more providers across all categories of AI, LLMs, computer vision models, speech/text engines, and more. If you're an AI company and would like to partner with us to add first-party support for your models, please reach out to nash@getstream.io. We would be excited to collaborate!Â
"Vision AI is like ChatGPT in 2022. It's really fun to see how it works and what's possible. Anything from live coaching, to sports, to physical therapy, robotics, drones, etc." — Thierry Schellenbach, CEO & Co-Founder
How To Get Involved
You can support the project by trying the demos, starring the GitHub repo, filing issues, and contributing adapters or processors.
We're not locking you in. The AI Agent's core is open-platform. You can run them on Stream Video for the tightest integration (memory, messaging, moderation, and recordings) or pair them with any WebRTC video SDK. Â
Either way, you own the stack.
