The future of software is conversational and interactive. For developers, unlocking this frontier means moving beyond traditional text inputs to agents that can seamlessly see, hear, and speak.
Our goal is to demonstrate a powerful, flexible architecture that achieves this. This allows us to build truly expressive, realtime-latency AI applications.
To illustrate, consider our core demo: a next-generation sports coaching application built with the Vision Agents SDK and Inworld's Text-to-Speech (TTS) engine.
You'll build a digital companion that helps you complete different bodyweight exercises, giving you detailed feedback on your form and execution. It does this by receiving your live video feed and commenting in a nuanced, natural voice that reacts in real-time and provides conversational feedback. This level of responsiveness is delivered through dynamic, instantaneous voice synthesis.
Here’s a demo of what we’re building in this guide:
Why Use This Stack?
The goal of this project is to give developers an open platform for building expressive, real-time vision and audio applications rapidly. To get there, the stack focuses on performance, flexibility, and quality.
It uses the Vision Agents SDK for the multimodal processing backbone, Inworld for state-of-the-art conversational audio, and Next.js for the frontend.
Below, we’ll quickly review these tech choices.
Inworld: Voice Synthesis Designed for Conversation
When building an interactive agent, the quality and speed of the Text-to-Speech (TTS) engine directly influences the user experience. A high-latency or robotic voice breaks immersion and frustrates the user. Inworld was selected specifically to deliver a conversational experience that feels instantaneous and natural.
- State-of-the-Art Quality: Inworld's models consistently rank highly in industry evaluations. For instance, their performance is validated on leaderboards like the Hugging Face TTS Arena or Artificial Analysis Speech Arena Leaderboard, demonstrating superior voice quality (high Speaker Similarity/SIM score) and clarity (low Word Error Rate/WER). This ensures the agent's voice is virtually indistinguishable from human speech.
- New Models: Models like inworld-tts-1.5-mini and inworld-tts-1.5-max have reached new levels of realism and immersion, all while maintaining incredible speed, precision, and pricing ($5/M chars and $10/M chars, which is the same as for their previous models).
- Speed: P90 time-to-audio latency <250ms for Max and <130ms for Mini (4x faster than prior generations) for natural back-and-forth conversations
- Quality: 30% greater expressiveness and 40% reduction in word error rate (WER), significantly reducing hallucinations, cutoffs, and artifacts vs. prior generations even in complex, long form conversations
- Accessibility: Their language support now spans 15 languages, with the addition of Hindi, Arabic, and Hebrew. They also have an expanded voice library and improved instant voice cloning.
- Competitive Cost-Efficiency: Offering top-tier quality and performance at a highly competitive price point makes Inworld an accessible and scalable solution for developers looking to deploy real-time voice agents to production.
Vision Agents SDK: Powers the Conversation
The SDK is built to be provider-agnostic, allowing developers to integrate their preferred models for Speech-to-Text (STT), Large Language Models (LLMs), and Text-to-Speech (TTS), making it the ideal orchestration layer for this integration.
- Modular Architecture: Vision Agents is designed with a component-based system where elements like the LLM, VAD (Voice Activity Detection), STT, and TTS are loosely coupled and independently configurable. This modularity is essential, as it allows us to easily swap out parts of our stack without needing to rewrite the entire things.
- Bring Your Own Keys (BYOK) for LLMs: The platform's commitment to allowing developers to "bring their own keys" for upstream AI services is a critical feature. This not only empowers developers with choice but also enables rapid experimentation and iteration. For complex multimodal use cases, being able to quickly test different LLM providers and models - optimizing for factors like latency, reasoning capability, or cost - is a significant competitive advantage.
- Video or Voice Agent mode: Vision Agents is built to support Video AI use cases by default, including support for running VLM models, arbitrary computer vision models as processors, and more. In addition to video, it also excels for voice agents, including support for phone calling (inbound and outbound), RAG support, function calling, MCP, and more.
Frontend: Next.js for Performance and Velocity
The final component is the interface that connects the user to the agent. A performant and maintainable frontend framework is essential for rapid deployment and scaling.
- Proven, Fast Stack: Next.js, built on React provides a robust and well-documented foundation for building modern web applications. Its file-based routing and established patterns allow for a fast development cycle and quick iteration.
- Real-Time Readiness: While Vision Agents handles the real-time socket connections for video/audio transport, Next.js provides the stability and performance needed to handle a high-volume, interactive web application. Features such as Server-Side Rendering (SSR) and Automatic Code Splitting ensure the application remains fast and responsive on the client side, which is crucial for a seamless user experience in a real-time agent application.
Building the Backend
The core of our interactive agent is the backend system, responsible for handling real-time audio streams, processing user input via a Large Language Model (LLM), and generating expressive voice responses using Inworld. We built this system using a Python-based setup leveraging the flexibility and speed of the Vision Agents SDK.
In this section, we’ll walk through the implementation.
1. Vision Agents: The Python-Based Foundation
Project Setup
The first thing we set up is the project itself and install all necessary dependencies. We’re using uv as a project and package manager, and first set up the project using this command:
1uv init my-project
After that, we cd into the project folder and then install all the necessary dependencies (we’ll go over why we need each of them in the next section):
12cd my-project uv add "vision-agents[getstream,gemini,inworld,deepgram,smart_turn]" python-dotenv
With this, the project is set up properly. Inside the pyproject.toml file, we can see all the installed dependencies.
Prerequisite Environment Setup
Before diving into the code, our setup requires several API keys, which are securely loaded from a .env file using the python-dotenv library.
INWORLD_API_KEY: For Inworld Text-to-Speech service.STREAM_API_KEYandSTREAM_API_SECRET: For the underlying real-time communication provided by Stream (accessed via getstream.Edge).DEEPGRAM_API_KEY: For the Speech-to-Text service.
12345from dotenv import load_dotenv # ... other imports load_dotenv()
2. The create_agent Factory Function
The primary logic resides within the create_agent asynchronous function, which instantiates and configures the Agent object. This is where all the chosen services are wired together.
Step 2.1: Configuring Communication and Core Services
We initialize the Agent and pass it instances of the chosen providers:
- Real-Time Edge:
getstream.Edge()handles the live video/audio transport. - Speech-to-Text:
deepgram.STT()transcribes the user's spoken input into text. - Turn Detection:
smart_turn.TurnDetection()ensures the agent knows precisely when the user has finished speaking, preventing interruptions and ensuring a natural conversational flow.
Step 2.2: Integrating the Inworld TTS Plugin
The key integration is achieved by instantiating the inworld.TTS plugin. We provide a specific voice_id and the inworld-tts-1.5-max model for the highest quality and speed.
123456# ... inside create_agent ... tts=inworld.TTS( voice_id=voice_setup[current_voice]['voice_id'], model_id="inworld-tts-1.5-max" ), # ...
Step 2.3: Selecting the LLM
We use Google's Gemini 2.5 Flash for the agent's intelligence using the llm parameter. This model is fast, efficient, and excels at instruction following, making it a strong choice for real-time conversational tasks where latency is critical.
3. The Role of Instructions and Character Design
In a conversational agent, the instructions provided to the LLM are the "operating system" for the character, ensuring the agent remains in character (in our case: a sports coach), adheres to length limits, and, most importantly, leverages the advanced features of Inworld as the TTS provider. Also, we add the relevant information on how to properly execute the exercises as context so the agent knows what to look for.
The agent's instructions serve two primary functions:
- Exercise Technique: We instruct the LLM to have knowledge of how to perform each of the exercises and workouts. This is structured out into a separate Markdown file to make it easier to separate out the different information types we want to hand to the LLM. Here’s how the (simplified) content of that file looks:
# Exercise technique instruction library
Short, form-focused technique notes designed for concise coaching prompts.
## Pushups
**How it works**
- Start in a high plank: hands under shoulders, body in one straight line.
- Lower as a single unit by bending elbows (~30–45° from torso), then press the floor away to return.
**Common mistakes**
- Hips sag or pike (broken body line).
- Elbows flare straight out to the sides.
- Shrugging shoulders up toward ears / losing control at the bottom.
- Partial range: barely bending elbows or not getting chest close to the floor.
**Fix / cues**
- “Ribs down, glutes tight, brace like a plank.”
- “Elbows back at 30–45°, screw hands into the floor.”
- “Shoulders away from ears; control down, pause, press.”
- “Lower chest between hands; use knees-elevated/bench.”
## Squats
...
- Activating Inworld Audio Markups: To access Inworld's expressive voice capabilities (emotions, pauses, non-verbal cues), the LLM must be explicitly told how to format its output. We enforce this with the instruction: "Make heavy use of @inworld-audio-guide.md to generate speech." This forces the LLM to output specialized markup alongside the response text, which the Inworld TTS plugin then correctly parses and executes. Here’s the example of a possible markdown file that contains these instructions:
## Audio Markup Rules
### Emotion and Delivery Style Tags
Place these at the BEGINNING of text segments to control how the following text is spoken:
- [happy] - Use for positive, enthusiastic, or joyful responses
- [sad] - Use for empathetic, disappointing, or melancholic content
- [angry] - Use for firm corrections or expressing frustration
- [surprised] - Use for unexpected discoveries or amazement
- [fearful] - Use for warnings or expressing concern
- [disgusted] - Use for expressing strong disapproval
- [laughing] - Use when text should be delivered with laughter
- [whispering] - Use for secrets, quiet emphasis, or intimate tone
### Non-Verbal Vocalization Tags
Insert these EXACTLY WHERE the sound should occur in your text:
- [breathe] - Add between thoughts or before important statements
- [clear_throat] - Use before corrections or important announcements
- [cough] - Use sparingly for realism
- [laugh] - Insert after humor or when expressing amusement
- [sigh] - Use to express resignation, relief, or empathy
- [yawn] - Use when expressing tiredness or boredom
## Example Response Patterns
For a helpful response:
[happy] I'd be glad to help you with that! [breathe] Here's what you need to know...
For delivering bad news:
[sad] Unfortunately, that's not possible. [sigh] Let me explain why...
## Critical Rules
- **NEVER use multiple emotion tags in the same text segment** - only one at the beginning
- **NEVER place non-verbal tags at the beginning** - they go where the sound occurs
- **ALWAYS consider the emotional context** of the user's message
- **KEEP usage natural** - if unsure whether to add a tag, don't
- **REMEMBER these are experimental** and only work in English
Example Instruction String:
You are a highly motivated fitness coach. Answer short and precisely. You help a user to perform the following exercises: @exercise-technique.md. Give appropriate feedback on the form. Ensure always to respond using language features defined in @inworld-audio-instructions.md.
4. Code Snapshot: Assembling the Agent
The final create_agent function encapsulates this entire configuration:
123456789101112async def create_agent(**kwargs) -> Agent: """Create the agent with Inworld AI TTS.""" agent = Agent( edge=getstream.Edge(), agent_user=User(name="Character", id="agent"), instructions=f"You are a highly motivated fitness coach. Answer short and precise. You help a user to perform [...]", tts=inworld.TTS(model_id="inworld-tts-1.5-max"), stt=deepgram.STT(), llm=gemini.LLM("gemini-2.5-flash"), turn_detection=smart_turn.TurnDetection(), ) return agent
5. Running the Agent: join_call and CLI
The join_call function handles the final steps: creating the agent user, initializing the call, and having the agent join the room where the conversation will take place. The cli utility from Vision Agents handles argument parsing, allowing the developer to easily run the agent with different call types and IDs from the command line (and easily deploy it).
The resulting system is a robust, low-latency, and highly expressive conversational agent, built quickly using a modular architecture.
Frontend
While the backend agent is responsible for intelligence and voice synthesis, the frontend is the interface that handles the user's real-time audio/video input and displays the agent's output. Given our choice of Next.js, we utilize the Stream Video React SDK to manage the low-latency, real-time communication layer with minimal boilerplate.
Integrating the Stream Video component into a React/Next.js application is streamlined and developer-friendly, abstracting away the complexities of WebRTC and SFU management.
Let’s look at the steps.
Step 1: Install the SDK
The first step is to incorporate the necessary package. The @stream-io/video-react-sdk provides the core client logic, hooks, and a comprehensive set of composable UI components needed to build the video experience.
npm install @stream-io/video-react-sdk
Step 2: Initialize the StreamVideoClient
The StreamVideoClient acts as the connection hub for all video and audio interactions. It must be initialized with the application's API credentials and the current user's identity. In a typical Next.js application, this initialization occurs once, often at the highest-level component (e.g., within a context provider) where the user's presence is required.
The client instance is created using a robust singleton pattern (getOrCreateInstance), ensuring resources are managed efficiently.
123456789101112131415import { StreamVideoClient, User } from '@stream-io/video-react-sdk'; // Define the user (authentication/session logic provides the token and ID) const user: User = { id: 'user-id', name: 'Demo User', image: 'https://getstream.io/random_png/?name=Demo+User', }; // Initialize the client with credentials const client = StreamVideoClient.getOrCreateInstance({ apiKey: 'your-api-key', user, token: 'user-token' });
Step 3: Instantiate and Join the Call
Once the client is ready, we instantiate the specific call. The client.call() method takes a call type ('default' for basic video calls) and a unique call ID, which corresponds to the room the Python agent is configured to join. The subsequent call.join() method handles the essential networking tasks, connecting the client to the Selective Forwarding Unit (SFU) and facilitating media transport.
12345// Create a call instance that matches the agent's room ID const call = client.call('default', 'my-call-id'); // Connect to the call. `create: true` handles persistence. await call.join({ create: true });
Step 4: Structuring the Real-Time Context
To make the client and call instances accessible throughout the component tree, we wrap the application with the appropriate providers. We also ensure the necessary CSS is imported for a functional and clean interface.
<StreamVideo client={client}>: Provides the global client context to all child components.<StreamCall call={call}>: Provides the context for the active call, making stream data, participant lists, and controls available to the video UI components.
This final step creates the necessary data flow for the user's browser to send their audio to the Vision Agents backend and receive the real-time, expressive voice synthesized by Inworld.
123456789101112131415161718import { StreamVideo, StreamCall, StreamTheme } from '@stream-io/video-react-sdk'; import '@stream-io/video-react-sdk/dist/css/styles.css'; export default function MyVideoApp() { return ( <StreamVideo client={client}> <StreamTheme> <StreamCall call={call}> {/* Our custom video layout or Stream's UI components go here */} </StreamCall> </StreamTheme> </StreamVideo> ); }
Summary
This completes the full stack: from the Next.js frontend handling the presentation, the Stream layer managing real-time transport, the Vision Agents SDK orchestrating the multimodal pipeline, and Inworld delivering the expressive, low-latency voice.
Helpful links:
- Inworld TTS: Documentation on how to use Inworld’s TTS capabilities
- Vision Agents Documentation: all you need to know to use the Vision Agents SDK
- Inworld Next-Gen Voice: Explore Inworld's expressive voice models and instant voice cloning to bring real-time conversational applications and characters to life.
- Vision Agents SDK: Fork the repository, explore the examples, and see how easy it is to build your own real-time vision and audio applications.
There are truly no limits to your creativity when you combine an open vision platform with next-generation conversational voice.
Start building the future of multimodal agents today. 🚀
