Build a Voice AI App in Python: Grok-4 + Fish Audio + Deepgram

xAI's Grok-4 delivers strong reasoning with a 256k context window, native tool use, and multimodal support. We love it for natural, low-latency voice conversations.

Pair it with Fish Audio's high-quality, expressive TTS (known for realistic prosody, emotion control, and voice cloning via short references) and Deepgram's fast, accurate STT, and you get a custom voice pipeline that's natural-sounding and responsive.

In this quick demo, the agent greets the user conversationally: "Hey there... Looks like the roles might be reversed here." It then introduces itself as Grok built by xAI, and handles the interaction smoothly with realistic voice output via Fish Audio.

Here's exactly how to build the same realistic voice AI app yourself in under five minutes.

What You'll Build

A conversational voice AI agent that introduces itself as Grok, handles natural dialogue, and responds with personality and helpfulness
Real-time voice input processed by Deepgram STT → Grok-4 reasoning → expressive Fish Audio TTS output
Smooth, interruption-friendly interactions with smart turn detection for fluid back-and-forth conversations
Fully custom pipeline orchestrated by Vision Agents, running over Stream's production-grade WebRTC for sub-second latency

The Stack

LLM → Grok-4 (xAI)
TTS → Fish Audio (realistic, emotional synthesis)
STT → Deepgram
Turn Detection → Smart-Turn
Transport → Stream WebRTC
Framework → Vision Agents (open-source)

Requirements (API Keys)

You'll need API keys from:

xAI (for Grok-4)
Fish Audio (TTS, supports reference voices for cloning)
Deepgram (STT)
Stream (API key & secret for WebRTC)

Step 1: Set Up the Project

bash

1
2
3
4
5
6
uv  init  grok-voice-agent
cd  grok-voice-agent
uv  add  vision-agents
uv  add  "vision-agents[getstream, deepgram, smart-turn]"
#  Fish  Audio  plugin  (assuming  it's  available;  check  pypi  or  visionagents.ai  for  latest)
uv  pip install vision-agents-plugins-fish

Step 2: Full Working Code (main.py)

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import  asyncio
import  os
from  vision_agents  import  Agent,  register
from  vision_agents.llm  import  XAILLM # or OpenRouterLLM for Grok-4
from  vision_agents.tts  import  FishAudioTTS
from  vision_agents.stt  import  DeepgramSTT
from  vision_agents.turn_detection  import  SmartTurn
from  vision_agents.stream  import  StreamVideoCall

async  def  main():
    # 1. Grok-4 LLM (use xAI direct or OpenRouter)
    llm  =  XAILLM(
        api_key=os.getenv("XAI_API_KEY"),
        model="grok-4" # or "grok-4-fast" for optimized version
    )

    # 2. Voice components
    tts  =  FishAudioTTS(
        api_key=os.getenv("FISH_AUDIO_API_KEY"),
        reference_id="your_reference_voice_id" # optional for cloned/custom voice
    )
    stt  =  DeepgramSTT(api_key=os.getenv("DEEPGRAM_API_KEY"))
    turn_detector  =  SmartTurn()

    # 3. Create the agent with Grok personality
    agent  =  Agent(
        llm=llm,
        tts=tts,
        stt=stt,
        turn_detector=turn_detector,
        name="Grok Voice Agent",
        system_prompt="""
        You are Grok, built by xAI. Be helpful, witty, and maximally truthful.
        Respond naturally in conversation, just like in the Grok chatbot.
        """
    )

    # 4. Register agent and launch Stream call
    register(agent)

    call  =  StreamVideoCall(
        api_key=os.getenv("STREAM_API_KEY"),
        api_secret=os.getenv("STREAM_API_SECRET"),
        call_type="default",
        call_id="grok-voice-demo"
    )

    await  call.join()
    print("Voice AI app ready! Open this URL in your browser:")
    print(call.url)

    # 5. Run the agent
    await  agent.run(call)

if  __name__  ==  "__main__":
    asyncio.run(main())

Step 3: Run It

bash

1
2
3
4
5
6
7
8
export  XAI_API_KEY=...
export  FISH_AUDIO_API_KEY=...
export  DEEPGRAM_API_KEY=...
export  STREAM_API_KEY=...
export  STREAM_API_SECRET=...
export EXAMPLE_BASE_URL=https://pronto-staging.getstream.io

uv run  main.py

A browser tab opens with the video call UI. Join, speak, and experience Grok-4 responding in a natural, expressive voice powered by Fish Audio.

Example from the video:

Agent: "Hey there. Looks like the roles might be reversed here. I'm Grok, an AI built by xAI... What can I do for you?"

Why We Love This Stack

Vision Agents lets you mix custom voice components (like Fish Audio TTS) in <100 lines, handling orchestration, interruptions, and streaming automatically.

Grok-4 brings sharp reasoning and personality, Fish Audio delivers lifelike speech with cloning support, and Deepgram + Stream keep everything low-latency and production-ready.

It's fully open-source except for API calls. Prototype to deploy fast.

Links & Resources

Give it a try, and experiment with different Grok agent personalities. 🥸