Build a Gemini 3 Flash-Powered AI App in Python

Google dropped Gemini 3 Flash, a fast multimodal model that excels at video understanding, live frame analysis, and object detection. Plus, it’s cost-effective and offers low latency.

In this quick demo, we use it to build a vision AI app in under five minutes that watches your camera feed in real time, accurately describes what you're holding, and answers follow-up questions.

The agent instantly recognizes a small black camera ("It appears to be a compact digital camera or a small action-style camera") and later correctly identifies a silver pen.

Here’s exactly how to build the same vision AI app yourself.

What You’ll Build

A real-time vision AI agent that sees everything in your camera feed and describes it accurately
Perfect for object detection, scene understanding, activity recognition, or just asking "What am I holding right now?"
Natural voice interaction with interruption support and smooth back-and-forth
Powered by Gemini 3 Flash for state-of-the-art video reasoning at minimal cost and latency

The Stack

LLM & Vision → Gemini 3 Flash
TTS → Inworld AI (expressive character voices)
STT → Deepgram
Turn Detection → Smart-Turn
Transport → Stream WebRTC
Framework → Vision Agents (open-source)

Requirements (API Keys)

You’ll need API keys from:

Google AI Studio (for Gemini 3 Flash)
Inworld AI (TTS)
Deepgram (STT)
Stream (API key & secret for WebRTC)

Step 1: Set Up the Project

bash

1
2
3
4
5
uv init gemini-vision-agent
cd gemini-vision-agent

uv add vision-agents
uv add "vision-agents[getstream, gemini, inworld, deepgram, smart-turn]"

Step 2: Full Working Code (main.py)

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import asyncio
import os
from vision_agents import Agent, register
from vision_agents.llm import GeminiLLM
from vision_agents.tts import InworldTTS
from vision_agents.stt import DeepgramSTT
from vision_agents.turn_detection import SmartTurn
from vision_agents.stream import StreamVideoCall

async def main():
    # 1. Gemini 3 Flash as the multimodal brain
    llm = GeminiLLM(
        model="gemini-3-flash-preview",
        api_key=os.getenv("GOOGLE_API_KEY")
    )

    # 2. Voice pipeline
    tts = InworldTTS(api_key=os.getenv("INWORLD_API_KEY"))
    stt = DeepgramSTT(api_key=os.getenv("DEEPGRAM_API_KEY"))
    turn_detector = SmartTurn()

    # 3. Create the vision agent
    agent = Agent(
        llm=llm,
        tts=tts,
        stt=stt,
        turn_detector=turn_detector,
        name="Vision Assistant",
        system_prompt="""
        You are an expert vision AI assistant. Analyze the live camera feed in real time.
        Describe objects clearly, notice when they change, and answer questions accurately.
        Be concise, helpful, and speak naturally.
        """
    )

    register(agent)

    # 4. Launch real-time video call
    call = StreamVideoCall(
        api_key=os.getenv("STREAM_API_KEY"),
        api_secret=os.getenv("STREAM_API_SECRET"),
        call_type="default",
        call_id="gemini-vision-demo"
    )

    await call.join()
    print("Vision AI app ready! Open this URL in your browser:")
    print(call.url)

    # 5. Run the agent
    await agent.run(call)

if __name__ == "__main__":
    asyncio.run(main())

Step 3: Run It

Store the following credentials in your \.env\ file:

shell

1
2
3
4
5
6
7
8
9
10
11
12
# Create a .env file in your project's root and add these API credentials

touch .env

GOOGLE_API_KEY=...
INWORLD_API_KEY=...
DEEPGRAM_API_KEY=...
STREAM_API_KEY=...
STREAM_API_SECRET=...
EXAMPLE_BASE_URL=https://pronto-staging.getstream.io

uv run main.py

Just hold something up to your camera and ask away. Watch Gemini 3 Flash nail it.

Why We Love This Stack

Vision Agents handles live video streaming, frame sampling, turn detection, and orchestration in less than 100 lines.

Gemini 3 Flash gives you best-in-class video understanding at high speed and a tiny cost.

Inworld + Deepgram + Stream deliver production-ready voice and transport.

Plus, it’s fully open-source except for the API calls.

Links & Resources

Go try it: hold up something from your desk and see how accurately it describes it. 📷