Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub ->

Build a Vision AI Agent with Gemini 3 in < 3 Minutes

New
2 min read
Amos G.
Amos G.
Published December 3, 2025

We released support for Google's new Gemini 3 models inside Vision Agents — the open-source Python framework for building real-time voice and video AI applications.

In this 3-minute video demo, you'll see how to spin up a fully functional vision-enabled voice agent that can see your screen (or webcam), reason with Gemini 3 Pro Preview, and talk back to you naturally, all in pure Python.

What You'll Learn

  • Install Vision Agents + the new Gemini plugin

  • Use gemini-3-pro-preview as your LLM with a single line

  • Build a live video-call agent that can see and describe anything on your screen in real time

  • Customize reasoning depth (low/high thinking level)

Get Started in 60 Seconds

1. Create a fresh project (we recommend uv).

bash
1
2
3
4
5
# Initialize a new Python project uv init # Activate your environment uv venv && source .venv/bin/activate

2. Install Vision Agents + project plugins.

bash
1
2
3
4
5
# Install Vision Agents uv add vision-agents # Install required plugins uv add "vision-agents[getstream, gemini, elevenlabs, deepgram, smart-turn]"

You'll also need:

Minimal Working Example

Rename your [main.py](http://main.py) to gemini_vision_demo.py and replace its content with this sample code.

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import asyncio import logging from dotenv import load_dotenv from vision_agents.core import User, Agent, cli from vision_agents.core.agents import AgentLauncher from vision_agents.plugins import elevenlabs, getstream, smart_turn, gemini, deepgram logger = logging.getLogger(__name__) load_dotenv() async def create_agent(**kwargs) -> Agent: """Create the agent with Inworld AI TTS.""" agent = Agent( edge=getstream.Edge(), agent_user=User(name="Friendly AI", id="agent"), instructions="You are a friendly AI assistant powered by Gemini 3. You are able to answer questions and help with tasks. You carefully observe a users' camera feed and respond to their questions and tasks.", tts=elevenlabs.TTS(), stt=deepgram.STT(), # Gemini 3 model llm=gemini.LLM("gemini-3-pro-preview"), turn_detection=smart_turn.TurnDetection(), ) return agent async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: """Join the call and start the agent.""" # Ensure the agent user is created await agent.create_user() # Create a call call = await agent.create_call(call_type, call_id) logger.info("🤖 Starting Inworld AI Agent...") # Have the agent join the call/room with await agent.join(call): logger.info("Joining call") logger.info("LLM ready") await asyncio.sleep(5) await agent.llm.simple_response(text="Describe what you currently see") await agent.finish() # Run till the call ends if __name__ == "__main__": cli(AgentLauncher(create_agent=create_agent, join_call=join_call))

Run it:

uv run gemini_vision_demo.py

A browser tab opens with a Stream Video call. Click "Join call", grant camera/mic/screen permissions, and say something like:

"Okay, I'm going to share my screen — tell me what you see!"

Gemini 3 will instantly analyze your screen and respond with surprisingly detailed descriptions, all in a natural spoken voice.

Gemini 3 brings better reasoning and multimodal understanding, and Vision Agents makes it simple to turn that power into interactive voice/video experiences. No React, no WebRTC boilerplate, just Python.

Try it today! 🚀