Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub ->

Tutorials

How to Build a Local AI Voice Agent with Pocket TTS

Voice agents are getting better, but most text-to-speech pipelines still assume you’re okay with cloud APIs, large models, and unpredictable latency. If you want fast, natural-sounding speech that runs entirely on your own hardware (no GPU, no network calls), you need a different approach. In this tutorial, you’ll build a real-time AI voice agent that
Read more ->
9 min read

Add Life-Like Voices to Your AI Apps with Inworld and Vision Agents

The future of software is conversational and interactive. For developers, unlocking this frontier means moving beyond traditional text inputs to agents that can seamlessly see, hear, and speak. Our goal is to demonstrate a powerful, flexible architecture that achieves this. This allows us to build truly expressive, realtime-latency AI applications. To illustrate, consider our core
Read more ->
11 min read

Build a Gemini 3 Flash-Powered AI App in Python

Google dropped Gemini 3 Flash, a fast multimodal model that excels at video understanding, live frame analysis, and object detection. Plus, it’s cost-effective and offers low latency. In this quick demo, we use it to build a vision AI app in under five minutes that watches your camera feed in real time, accurately describes what
Read more ->
3 min read

Build a Voice AI App in Python: Grok-4 + Fish Audio + Deepgram

xAI's Grok-4 delivers strong reasoning with a 256k context window, native tool use, and multimodal support. We love it for natural, low-latency voice conversations. Pair it with Fish Audio's high-quality, expressive TTS (known for realistic prosody, emotion control, and voice cloning via short references) and Deepgram's fast, accurate STT, and you get a custom voice
Read more ->
3 min read

Clone MedTalk: HIPAA-Ready Video and Chat Consultations in Flutter

Telehealth is transforming the way patients and providers connect, offering faster access to care and reducing barriers caused by distance or scheduling. A critical part of this experience is enabling secure, real-time video consultations alongside features like chat messaging for sharing updates, questions, and follow-ups. With Stream's healthcare chat solution, developers can build HIPAA-ready communication
Read more ->
24 min read

Build a Voice-Controlled GitHub Agent in Python (MCP + Vision Agents)

Turn any GitHub repo into a voice assistant: ask about branches, open issues, create pull requests, list contributors—all via natural conversation.  Powered by OpenAI's Realtime API for low-latency voice, GitHub's Model Context Protocol (MCP) for secure repo actions, and Vision Agents for seamless orchestration. In the demo, the agent understands spoken repo names (even when
Read more ->
4 min read

Build a Drive-Thru Voice AI Ordering System With Gemini Live Speech-to-Speech

Drive-thru ordering is a deceptively hard real-time problem. Background noise, interruptions, fast-paced conversations, and the need for low-latency responses all push traditional voice systems to their limits. Modern speech-to-speech models change that equation by making natural, interruptible conversations possible without stitching together separate STT, LLM, and TTS pipelines. In this tutorial, you’ll create a real-time
Read more ->
9 min read

Build a Realtime Video Restyling Agent with Gemini 3 + Decart AI

Google's Gemini 3, released November 18, 2025, gives you multimodal reasoning and tool-use for building response-accurate AI applications. Let's combine it with Decart AI and other leading LLM services to turn casual voice commands into artistic live video style changes, no extra scaffolding required. Pair it with Decart AI's Mirage LSD, the first live-stream diffusion
Read more ->
4 min read