Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub ->

Build a Voice AI App in Python: Grok-4 + Fish Audio + Deepgram

New
3 min read
Amos G.
Amos G.
Published January 16, 2026

xAI's Grok-4 delivers strong reasoning with a 256k context window, native tool use, and multimodal support. We love it for natural, low-latency voice conversations.

Pair it with Fish Audio's high-quality, expressive TTS (known for realistic prosody, emotion control, and voice cloning via short references) and Deepgram's fast, accurate STT, and you get a custom voice pipeline that's natural-sounding and responsive.

In this quick demo, the agent greets the user conversationally: "Hey there... Looks like the roles might be reversed here." It then introduces itself as Grok built by xAI, and handles the interaction smoothly with realistic voice output via Fish Audio.

Here's exactly how to build the same realistic voice AI app yourself in under five minutes.

What You'll Build

  • A conversational voice AI agent that introduces itself as Grok, handles natural dialogue, and responds with personality and helpfulness

  • Real-time voice input processed by Deepgram STT → Grok-4 reasoning → expressive Fish Audio TTS output

  • Smooth, interruption-friendly interactions with smart turn detection for fluid back-and-forth conversations

  • Fully custom pipeline orchestrated by Vision Agents, running over Stream's production-grade WebRTC for sub-second latency

The Stack

Requirements (API Keys) 

You'll need API keys from:

  • xAI (for Grok-4)

  • Fish Audio (TTS, supports reference voices for cloning)

  • Deepgram (STT)

  • Stream (API key & secret for WebRTC)

Step 1: Set Up the Project

bash
1
2
3
4
5
6
uv init grok-voice-agent cd grok-voice-agent uv add vision-agents uv add "vision-agents[getstream, deepgram, smart-turn]" # Fish Audio plugin (assuming it's available; check pypi or visionagents.ai for latest) uv pip install vision-agents-plugins-fish

Step 2: Full Working Code (main.py)

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import asyncio import os from vision_agents import Agent, register from vision_agents.llm import XAILLM # or OpenRouterLLM for Grok-4 from vision_agents.tts import FishAudioTTS from vision_agents.stt import DeepgramSTT from vision_agents.turn_detection import SmartTurn from vision_agents.stream import StreamVideoCall async def main(): # 1. Grok-4 LLM (use xAI direct or OpenRouter) llm = XAILLM( api_key=os.getenv("XAI_API_KEY"), model="grok-4" # or "grok-4-fast" for optimized version ) # 2. Voice components tts = FishAudioTTS( api_key=os.getenv("FISH_AUDIO_API_KEY"), reference_id="your_reference_voice_id" # optional for cloned/custom voice ) stt = DeepgramSTT(api_key=os.getenv("DEEPGRAM_API_KEY")) turn_detector = SmartTurn() # 3. Create the agent with Grok personality agent = Agent( llm=llm, tts=tts, stt=stt, turn_detector=turn_detector, name="Grok Voice Agent", system_prompt=""" You are Grok, built by xAI. Be helpful, witty, and maximally truthful. Respond naturally in conversation, just like in the Grok chatbot. """ ) # 4. Register agent and launch Stream call register(agent) call = StreamVideoCall( api_key=os.getenv("STREAM_API_KEY"), api_secret=os.getenv("STREAM_API_SECRET"), call_type="default", call_id="grok-voice-demo" ) await call.join() print("Voice AI app ready! Open this URL in your browser:") print(call.url) # 5. Run the agent await agent.run(call) if __name__ == "__main__": asyncio.run(main())

Step 3: Run It

bash
1
2
3
4
5
6
7
8
export XAI_API_KEY=... export FISH_AUDIO_API_KEY=... export DEEPGRAM_API_KEY=... export STREAM_API_KEY=... export STREAM_API_SECRET=... export EXAMPLE_BASE_URL=https://pronto-staging.getstream.io uv run main.py

A browser tab opens with the video call UI. Join, speak, and experience Grok-4 responding in a natural, expressive voice powered by Fish Audio.

Example from the video:

Agent: "Hey there. Looks like the roles might be reversed here. I'm Grok, an AI built by xAI... What can I do for you?"

Why We Love This Stack 

Vision Agents lets you mix custom voice components (like Fish Audio TTS) in <100 lines, handling orchestration, interruptions, and streaming automatically.

Grok-4 brings sharp reasoning and personality, Fish Audio delivers lifelike speech with cloning support, and Deepgram + Stream keep everything low-latency and production-ready.

It's fully open-source except for API calls. Prototype to deploy fast.

Give it a try, and experiment with different Grok agent personalities. 🥸

Integrating Video With Your App?
We've built a Video and Audio solution just for you. Check out our APIs and SDKs.
Learn more ->