Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub ->

Build a Gemini 3 Flash-Powered AI App in Python

New
3 min read
Amos G.
Amos G.
Published January 20, 2026

Google dropped Gemini 3 Flash, a fast multimodal model that excels at video understanding, live frame analysis, and object detection. Plus, it’s cost-effective and offers low latency.

In this quick demo, we use it to build a vision AI app in under five minutes that watches your camera feed in real time, accurately describes what you're holding, and answers follow-up questions.

The agent instantly recognizes a small black camera ("It appears to be a compact digital camera or a small action-style camera") and later correctly identifies a silver pen.

Here’s exactly how to build the same vision AI app yourself.

What You’ll Build

  • A real-time vision AI agent that sees everything in your camera feed and describes it accurately
  • Perfect for object detection, scene understanding, activity recognition, or just asking "What am I holding right now?"
  • Natural voice interaction with interruption support and smooth back-and-forth
  • Powered by Gemini 3 Flash for state-of-the-art video reasoning at minimal cost and latency

The Stack

Requirements (API Keys)

You’ll need API keys from:

  • Google AI Studio (for Gemini 3 Flash)
  • Inworld AI (TTS)
  • Deepgram (STT)
  • Stream (API key & secret for WebRTC)

Step 1: Set Up the Project

bash
1
2
3
4
5
uv init gemini-vision-agent cd gemini-vision-agent uv add vision-agents uv add "vision-agents[getstream, gemini, inworld, deepgram, smart-turn]"

Step 2: Full Working Code (main.py)

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import asyncio import os from vision_agents import Agent, register from vision_agents.llm import GeminiLLM from vision_agents.tts import InworldTTS from vision_agents.stt import DeepgramSTT from vision_agents.turn_detection import SmartTurn from vision_agents.stream import StreamVideoCall async def main(): # 1. Gemini 3 Flash as the multimodal brain llm = GeminiLLM( model="gemini-3-flash-preview", api_key=os.getenv("GOOGLE_API_KEY") ) # 2. Voice pipeline tts = InworldTTS(api_key=os.getenv("INWORLD_API_KEY")) stt = DeepgramSTT(api_key=os.getenv("DEEPGRAM_API_KEY")) turn_detector = SmartTurn() # 3. Create the vision agent agent = Agent( llm=llm, tts=tts, stt=stt, turn_detector=turn_detector, name="Vision Assistant", system_prompt=""" You are an expert vision AI assistant. Analyze the live camera feed in real time. Describe objects clearly, notice when they change, and answer questions accurately. Be concise, helpful, and speak naturally. """ ) register(agent) # 4. Launch real-time video call call = StreamVideoCall( api_key=os.getenv("STREAM_API_KEY"), api_secret=os.getenv("STREAM_API_SECRET"), call_type="default", call_id="gemini-vision-demo" ) await call.join() print("Vision AI app ready! Open this URL in your browser:") print(call.url) # 5. Run the agent await agent.run(call) if __name__ == "__main__": asyncio.run(main())

Step 3: Run It

Store the following credentials in your \.env\ file:

shell
1
2
3
4
5
6
7
8
9
10
11
12
# Create a .env file in your project's root and add these API credentials touch .env GOOGLE_API_KEY=... INWORLD_API_KEY=... DEEPGRAM_API_KEY=... STREAM_API_KEY=... STREAM_API_SECRET=... EXAMPLE_BASE_URL=https://pronto-staging.getstream.io uv run main.py

Just hold something up to your camera and ask away. Watch Gemini 3 Flash nail it.

Why We Love This Stack

Vision Agents handles live video streaming, frame sampling, turn detection, and orchestration in less than 100 lines.

Gemini 3 Flash gives you best-in-class video understanding at high speed and a tiny cost.

Inworld + Deepgram + Stream deliver production-ready voice and transport.

Plus, it’s fully open-source except for the API calls.

Go try it: hold up something from your desk and see how accurately it describes it. 📷

Integrating Video With Your App?
We've built a Video and Audio solution just for you. Check out our APIs and SDKs.
Learn more ->