xAI's Grok-4 delivers strong reasoning with a 256k context window, native tool use, and multimodal support. We love it for natural, low-latency voice conversations.
Pair it with Fish Audio's high-quality, expressive TTS (known for realistic prosody, emotion control, and voice cloning via short references) and Deepgram's fast, accurate STT, and you get a custom voice pipeline that's natural-sounding and responsive.
In this quick demo, the agent greets the user conversationally: "Hey there... Looks like the roles might be reversed here." It then introduces itself as Grok built by xAI, and handles the interaction smoothly with realistic voice output via Fish Audio.
Here's exactly how to build the same realistic voice AI app yourself in under five minutes.
What You'll Build
-
A conversational voice AI agent that introduces itself as Grok, handles natural dialogue, and responds with personality and helpfulness
-
Real-time voice input processed by Deepgram STT → Grok-4 reasoning → expressive Fish Audio TTS output
-
Smooth, interruption-friendly interactions with smart turn detection for fluid back-and-forth conversations
-
Fully custom pipeline orchestrated by Vision Agents, running over Stream's production-grade WebRTC for sub-second latency
The Stack
-
LLM → Grok-4 (xAI)
-
TTS → Fish Audio (realistic, emotional synthesis)
-
STT → Deepgram
-
Turn Detection → Smart-Turn
-
Transport → Stream WebRTC
-
Framework → Vision Agents (open-source)
Requirements (API Keys)
You'll need API keys from:
-
xAI (for Grok-4)
-
Fish Audio (TTS, supports reference voices for cloning)
-
Deepgram (STT)
-
Stream (API key & secret for WebRTC)
Step 1: Set Up the Project
123456uv init grok-voice-agent cd grok-voice-agent uv add vision-agents uv add "vision-agents[getstream, deepgram, smart-turn]" # Fish Audio plugin (assuming it's available; check pypi or visionagents.ai for latest) uv pip install vision-agents-plugins-fish
Step 2: Full Working Code (main.py)
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556import asyncio import os from vision_agents import Agent, register from vision_agents.llm import XAILLM # or OpenRouterLLM for Grok-4 from vision_agents.tts import FishAudioTTS from vision_agents.stt import DeepgramSTT from vision_agents.turn_detection import SmartTurn from vision_agents.stream import StreamVideoCall async def main(): # 1. Grok-4 LLM (use xAI direct or OpenRouter) llm = XAILLM( api_key=os.getenv("XAI_API_KEY"), model="grok-4" # or "grok-4-fast" for optimized version ) # 2. Voice components tts = FishAudioTTS( api_key=os.getenv("FISH_AUDIO_API_KEY"), reference_id="your_reference_voice_id" # optional for cloned/custom voice ) stt = DeepgramSTT(api_key=os.getenv("DEEPGRAM_API_KEY")) turn_detector = SmartTurn() # 3. Create the agent with Grok personality agent = Agent( llm=llm, tts=tts, stt=stt, turn_detector=turn_detector, name="Grok Voice Agent", system_prompt=""" You are Grok, built by xAI. Be helpful, witty, and maximally truthful. Respond naturally in conversation, just like in the Grok chatbot. """ ) # 4. Register agent and launch Stream call register(agent) call = StreamVideoCall( api_key=os.getenv("STREAM_API_KEY"), api_secret=os.getenv("STREAM_API_SECRET"), call_type="default", call_id="grok-voice-demo" ) await call.join() print("Voice AI app ready! Open this URL in your browser:") print(call.url) # 5. Run the agent await agent.run(call) if __name__ == "__main__": asyncio.run(main())
Step 3: Run It
12345678export XAI_API_KEY=... export FISH_AUDIO_API_KEY=... export DEEPGRAM_API_KEY=... export STREAM_API_KEY=... export STREAM_API_SECRET=... export EXAMPLE_BASE_URL=https://pronto-staging.getstream.io uv run main.py
A browser tab opens with the video call UI. Join, speak, and experience Grok-4 responding in a natural, expressive voice powered by Fish Audio.
Example from the video:
Agent: "Hey there. Looks like the roles might be reversed here. I'm Grok, an AI built by xAI... What can I do for you?"
Why We Love This Stack
Vision Agents lets you mix custom voice components (like Fish Audio TTS) in <100 lines, handling orchestration, interruptions, and streaming automatically.
Grok-4 brings sharp reasoning and personality, Fish Audio delivers lifelike speech with cloning support, and Deepgram + Stream keep everything low-latency and production-ready.
It's fully open-source except for API calls. Prototype to deploy fast.
Links & Resources
-
Full source code
-
xAI Grok-4 (or via OpenRouter)
Give it a try, and experiment with different Grok agent personalities. 🥸