What Is a Vision Agent? Real-Time AI That Can See and Hear

A vision agent is an AI agent that watches a live video and audio stream and responds in real time. You will also see it called a video AI agent or a multimodal agent. The label moves around, the definition does not: an agent that can see and hear what is happening, and act on it while it is still happening.

That last part is the hard part. Plenty of systems can describe an image you upload. A vision agent has to do it on a live stream, fast enough that the reply still makes sense by the time it arrives. A golf swing is over in about a second. A coaching tip that lands two seconds later is useless.

This post covers what a vision agent is, how it differs from the voice agents and image tools you have probably already seen, and how to build one with Vision Agents, the open-source Python framework from Stream.

A vision agent versus everything it gets confused with

The term is muddy because three different things share the words.

There are batch computer vision pipelines. You hand them a folder of images or a recorded clip, they label objects, and you read the results later. Useful, but offline. Nothing is live.

There are voice agents, they listen and talk, often very well, but they are blind. They cannot tell you your form is off, or that the package on the porch is not the one you ordered, because they never see it.

There is also LandingAI's VisionAgent, which is a separate product that generates vision code from a prompt. Good tool, different job. It writes scripts. It does not join a live call and react inside it.

A vision agent in the sense this post means is the combination the other three leave out. Live video in. Live audio in. A model that understands both. A response that comes back inside the same conversation, in well under a second.

Why this is hard to build from scratch

Wire it together yourself and the stack adds up fast. You need a video transport that works in a browser and on phones. You need speech to text, or a realtime model that takes audio directly. You need an LLM, and expressive text to speech if the agent talks back. If the agent has to see, you need a computer vision model running on the video frames. And all of it has to stay inside the latency budget that makes real-time feel real.

Each piece is a separate SDK with its own quirks. Most of the effort is not the AI, it's keeping all of the plumbing, codecs, data formats, timing, handoffs and context in sync with the different models and the rest of your stack.

Our framework, Vision Agents, exists to take the plumbing off your plate. You pick the models and it handles the transport, the turn-taking, the frame handling, and the wiring between them.

The smallest vision agent

Here is a complete voice agent. About 18 lines, running on Gemini Realtime over Stream's edge network.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from dotenv import load_dotenv

from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import getstream, gemini

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Assistant", id="agent"),
        instructions="You're a helpful voice assistant. Be concise.",
        llm=gemini.Realtime(),
    )

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response("Greet the user")
        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Install it with uv and run it:

bash

1
2
3
uv init --python 3.12 my-agent && cd my-agent
uv add "vision-agents[getstream,gemini]" python-dotenv
uv run main.py run

The code above is doing a few things:

Edge transport: This is the WebRTC network where our agent runs and communicates. In the example above, we're using Stream's Video API as the edge transport but Vision Agents is built from the ground up to be fully open, including to external edge transports such as Tencent RTC for low-latency in Asia and China, and a local transport for robotics where going off to an external network adds too much of an overhead.
Instructions: Here you can define the behaviour of the agent. It can be a simple one line instruction such as the case in our example or it can be an elaborate guide contained in a dedicated markdown file which you can reference inline using @my-file.md.
LLM: Some models have built-in speech and text handling in addition to the "brain" or language model. The framework classifies these as "Realtime" LLMs and they include models such as OpenAI Realtime, Gemini Live, Qwen OMNI, Nova Sonic and so on.

Using the built-in CLI, it prints a join link which we can then open and interact with the agent in a browser.

This example serves one purpose and one purpose only, a simple voice agent we can provide some simple instructions too and define a few external functions to call using the framework's @llm.register_function decorator. It represents the floor of what's possible and is the base from which we can customise and build more elaborate agents. For example, a simple permutation of this agent would be to bring a custom speech-to-text and text-to-speech model (defined by using the stt= and tts= parameters in the agent respective), allowing us to have full control over the entire agent pipeline.

Building your own app? Get access to our Livestream or Video Calling API and launch in days!

Good vision starts with good voice

A real-time vision agent is a conversation that also has eyes; if the talking part is clumsy, the seeing part does not save it. An agent that talks over you, or takes a beat too long to answer, feels broken no matter how sharp its object detection is.

So the parts that make voice work are not a side feature here, they are the foundation. It includes features like turn detection that knows when you have actually finished speaking, voice activity detection (VAD), speech to text and text to speech across providers like Deepgram, ElevenLabs, and Cartesia. Above all, the latency also matters, there's no point in building an agent that has high latency and leaves the user waiting before replying, the entire interaction needs to feel fast and natural with low latency across the entire pipeline, from the WebRTC layer, LLM generation and ASR pipelines.

That foundation is why the smallest thing you can build here is a voice agent, the 18 lines above, no camera involved. If all you need is a fast, natural voice agent that reasons over a knowledge base and calls tools, Vision Agents does that well on its own.

Adding eyes

The thing that turns a voice agent into a vision agent is a processor. A processor runs a model on the video frames as they pass through, and the agent reasons over the result alongside everything it hears.

Here is the shape of the golf coach example, a YOLO pose model reads the swing frame by frame while Gemini Realtime gives feedback, capped at 10 frames per second.

python

1
2
3
4
5
6
7
agent = Agent(
    edge=getstream.Edge(),
    agent_user=agent_user,
    instructions="Read @golf_coach.md",
    llm=gemini.Realtime(fps=10),
    processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt", device="cuda")],
)

That processors list is all it takes to run any computer vision library in real-time with your agent. You can run YOLO, Roboflow, or your own PyTorch or ONNX model on the stream, before or after the LLM call. The agent is not just describing the video. Combine LLMs, processors, and ASR pipelines and you can build agents that read facial expressions and gaze alongside speech, such as emotional support tools or dating coaches, that pick up on how someone looks and speak to it. We built a demo exactly like this together with Anam AI and Inworld, check it out here.

Two ways to run the model

There are two ways in which developers can build agent pipelines, and the right one depends on how much control you want.

The first is realtime, you send audio and video straight to a provider's realtime API over WebRTC or WebSocket, and Gemini or OpenAI handles speech in and speech out. Fewer moving parts, lowest latency.

The second is a pipeline you assemble yourself: speech to text, then an LLM, then text to speech, each one a provider you choose. More parts, and more control over each step.

It is the same Agent object either way. You swap plugins, you do not rewrite the program. There are 30+ integrations to pick from, including OpenAI, Gemini, Anthropic, Deepgram, ElevenLabs, YOLO, and Roboflow.

What people build with it

The same pattern shows up across very different industries. A few that map to working examples in the repo:

Telehealth: look up patient information securely mid-call and let patients speak naturally instead of filling out intake forms
Support and repair agents: connect existing data sources over RAG, MCP and function calling, then add live video so the agent can see the issue the user is encountering, not just hear about it
Fitness and sports coaching: the in-repo golf coach (YOLO pose + Gemini Live) and a live sports commentator (Roboflow + LLM)
Marketplaces: guide users through listing creation so they get items live faster
Dating and social: interactive avatars and companions, so users aren't stuck talking to a floating orb
Retail: virtual try-on (Decart Lucy) with recommendations paired to what the user is looking at

The security camera example in the repo (YOLOv11 + Gemini, with package-theft detection) is a good one to read if you want to see a processor pipeline doing real work end to end.

Start here

The fastest path to get started is our quickstart paired with a free Stream account, a Google AI Studio key, and the 18 lines above get you a working voice agent. If you are using a coding tool like Cursor or Claude Code, our MCP server and skill can also speed up your development. If you already have an LLM or AI stack, you can simply bring those keys across and use them directly with the plugins in Vision Agents.

Quickstart: https://visionagents.ai/introduction/quickstart
GitHub: https://github.com/GetStream/vision-agents

The free Stream tier is 333,000 participant minutes a month, which is enough to build and test something real before cost is a question.