Build a Voice AI Agent with Gemini 3.1 Pro

Build an AI Travel Advisor That Speaks with Gemini 3.1 Pro

Amos G.

Published March 11, 2026

Most LLMs are great at thinking, but making them speak naturally is a different challenge. Gemini 3.1 Pro changes that.

This new model from Google brings significantly improved reasoning, longer context, and better tool-use capabilities, making it one of the best choices (at the time of writing) for building conversational voice agents.

In this guide, we’ll use Gemini 3.1 Pro (both the standard preview and custom tools variant) as the brain for a real-time voice AI agent and travel advisor built with Vision AI Agents.

What You’ll Build

A real-time voice AI agent that tells a charming story on command and gives travel advice about Europe
Natural, coherent spoken responses with strong reasoning and storytelling ability
Support for both Gemini 3.1 Pro Preview and the Custom Tools variant
Low-latency voice pipeline using Vision Agents and Stream

The Stack

LLM → Gemini 3.1 Pro (via Vision Agents Gemini plugin)
TTS → ElevenLabs
STT → Deepgram
Turn Detection → Smart-Turn
Transport → Stream WebRTC
Framework → Vision Agents (open-source)

Requirements

Google API key (from Google AI Studio)
ElevenLabs API key
Deepgram API key
Stream API key & secret

Step 1: Install the Plugins

uv add vision-agents
uv add "vision-agents[getstream, gemini, deepgram, elevenlabs, smart-turn]"

Step 2: Full Working Code (main.py)

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from vision_agents.core import Agent, Runner, User
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import deepgram, gemini, getstream, elevenlabs

async def create_agent(**kwargs) -> Agent:
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Assistant", id="agent"),
        instructions="You're a helpful voice/vision AI assistant powered by Gemini 3.1 Pro. Keep replies short and conversational. Be concise and to the point. Always describe what you see in the user's video camera feed.",
        stt=deepgram.STT(eager_turn_detection=True),
        tts=elevenlabs.TTS(),
        llm=gemini.LLM("gemini-3.1-pro-preview"),
    )

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    await agent.create_user()
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response("Greet the user")
        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Step 3: Run It

shell

1
2
3
4
5
6
7
8
export GEMINI_API_KEY=...
export ELEVENLABS_API_KEY=...
export DEEPGRAM_API_KEY=...
export STREAM_API_KEY=...
export STREAM_API_SECRET=...
EXAMPLE_BASE_URL=https://demo.visionagents.ai

uv run main.py run

Join the call and ask the agent to tell a story or give travel advice. You’ll hear Gemini 3.1 Pro respond clearly and naturally.

Why We Love This Setup

Gemini 3.1 Pro brings noticeably better reasoning and coherence to voice agents compared to previous versions.

The Vision Agents Gemini plugin makes switching between the standard preview and custom tools variant extremely easy; just change the model name.

You get strong storytelling, advice-giving, and conversational ability with minimal code.

Links & Resources

Try asking the agent for travel recommendations in Europe, and test both Gemini 3.1 Pro variants to see which one feels more natural in conversation. ✈️

Integrating Video With Your App?

We've built a Video and Audio solution just for you.
Check out our APIs and SDKs.

Learn more