Google dropped Gemini 3 Flash, a fast multimodal model that excels at video understanding, live frame analysis, and object detection. Plus, it’s cost-effective and offers low latency.
In this quick demo, we use it to build a vision AI app in under five minutes that watches your camera feed in real time, accurately describes what you're holding, and answers follow-up questions.
The agent instantly recognizes a small black camera ("It appears to be a compact digital camera or a small action-style camera") and later correctly identifies a silver pen.
Here’s exactly how to build the same vision AI app yourself.
What You’ll Build
- A real-time vision AI agent that sees everything in your camera feed and describes it accurately
- Perfect for object detection, scene understanding, activity recognition, or just asking "What am I holding right now?"
- Natural voice interaction with interruption support and smooth back-and-forth
- Powered by Gemini 3 Flash for state-of-the-art video reasoning at minimal cost and latency
The Stack
- LLM & Vision → Gemini 3 Flash
- TTS → Inworld AI (expressive character voices)
- STT → Deepgram
- Turn Detection → Smart-Turn
- Transport → Stream WebRTC
- Framework → Vision Agents (open-source)
Requirements (API Keys)
You’ll need API keys from:
- Google AI Studio (for Gemini 3 Flash)
- Inworld AI (TTS)
- Deepgram (STT)
- Stream (API key & secret for WebRTC)
Step 1: Set Up the Project
12345uv init gemini-vision-agent cd gemini-vision-agent uv add vision-agents uv add "vision-agents[getstream, gemini, inworld, deepgram, smart-turn]"
Step 2: Full Working Code (main.py)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354import asyncio import os from vision_agents import Agent, register from vision_agents.llm import GeminiLLM from vision_agents.tts import InworldTTS from vision_agents.stt import DeepgramSTT from vision_agents.turn_detection import SmartTurn from vision_agents.stream import StreamVideoCall async def main(): # 1. Gemini 3 Flash as the multimodal brain llm = GeminiLLM( model="gemini-3-flash-preview", api_key=os.getenv("GOOGLE_API_KEY") ) # 2. Voice pipeline tts = InworldTTS(api_key=os.getenv("INWORLD_API_KEY")) stt = DeepgramSTT(api_key=os.getenv("DEEPGRAM_API_KEY")) turn_detector = SmartTurn() # 3. Create the vision agent agent = Agent( llm=llm, tts=tts, stt=stt, turn_detector=turn_detector, name="Vision Assistant", system_prompt=""" You are an expert vision AI assistant. Analyze the live camera feed in real time. Describe objects clearly, notice when they change, and answer questions accurately. Be concise, helpful, and speak naturally. """ ) register(agent) # 4. Launch real-time video call call = StreamVideoCall( api_key=os.getenv("STREAM_API_KEY"), api_secret=os.getenv("STREAM_API_SECRET"), call_type="default", call_id="gemini-vision-demo" ) await call.join() print("Vision AI app ready! Open this URL in your browser:") print(call.url) # 5. Run the agent await agent.run(call) if __name__ == "__main__": asyncio.run(main())
Step 3: Run It
Store the following credentials in your \.env\ file:
123456789101112# Create a .env file in your project's root and add these API credentials touch .env GOOGLE_API_KEY=... INWORLD_API_KEY=... DEEPGRAM_API_KEY=... STREAM_API_KEY=... STREAM_API_SECRET=... EXAMPLE_BASE_URL=https://pronto-staging.getstream.io uv run main.py
Just hold something up to your camera and ask away. Watch Gemini 3 Flash nail it.
Why We Love This Stack
Vision Agents handles live video streaming, frame sampling, turn detection, and orchestration in less than 100 lines.
Gemini 3 Flash gives you best-in-class video understanding at high speed and a tiny cost.
Inworld + Deepgram + Stream deliver production-ready voice and transport.
Plus, it’s fully open-source except for the API calls.
Links & Resources
- Run the full demo in GitHub
- Vision AI Agents repo
- Vision AI Agents docs
- Gemini plugin
Go try it: hold up something from your desk and see how accurately it describes it. 📷