Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub ->

Tutorials: Vision

Build a Voice-Controlled GitHub Agent in Python (MCP + Vision Agents)

Turn any GitHub repo into a voice assistant: ask about branches, open issues, create pull requests, list contributors—all via natural conversation.  Powered by OpenAI's Realtime API for low-latency voice, GitHub's Model Context Protocol (MCP) for secure repo actions, and Vision Agents for seamless orchestration. In the demo, the agent understands spoken repo names (even when
Read more ->
4 min read

Build a Drive-Thru Voice AI Ordering System With Gemini Live Speech-to-Speech

Drive-thru ordering is a deceptively hard real-time problem. Background noise, interruptions, fast-paced conversations, and the need for low-latency responses all push traditional voice systems to their limits. Modern speech-to-speech models change that equation by making natural, interruptible conversations possible without stitching together separate STT, LLM, and TTS pipelines. In this tutorial, you’ll create a real-time
Read more ->
9 min read

Build a Realtime Video Restyling Agent with Gemini 3 + Decart AI

Google's Gemini 3, released November 18, 2025, gives you multimodal reasoning and tool-use for building response-accurate AI applications. Let's combine it with Decart AI and other leading LLM services to turn casual voice commands into artistic live video style changes, no extra scaffolding required. Pair it with Decart AI's Mirage LSD, the first live-stream diffusion
Read more ->
4 min read

Build an AI Math & Physics Agent with DeepSeek v3.2

DeepSeek recently released a powerful new model, DeepSeek-V3.2, that's now instantly accessible via OpenRouter. In under 5 minutes, you can turn it into a real-time, voice-enabled math and physics agent that not only solves problems but also explains its reasoning out loud. DeepSeek's latest open-source reasoning and agent-AI model, V3.2, leverages the new DeepSeek Sparse
Read more ->
4 min read

Build a Vision AI Agent with Gemini 3 in < 3 Minutes

We released support for Google's new Gemini 3 models inside Vision Agents — the open-source Python framework for building real-time voice and video AI applications. In this 3-minute video demo, you'll see how to spin up a fully functional vision-enabled voice agent that can see your screen (or webcam), reason with Gemini 3 Pro Preview,
Read more ->
2 min read

Build an Electronics Setup & Repair Assistant Using Baseten and Qwen3-VL

This tutorial demonstrates how to build an electronic device setup and repair assistant in Python with voice capabilities using Qwen3-VL hosted on Baseten. The assistant analyzes what a user shows on camera (like cables, ports, device components, or error states) and guides them step-by-step through setup or repair tasks. It’s designed to reduce confusion during
Read more ->
8 min read

Build an AI Voice Yoga Instructor in Python

Large Language Models (LLMs) have been improving recently and are often used for building conversational applications for speech and transcription. From answering location-based questions to managing a work calendar, voice AI assistants are becoming an everyday part of both personal and professional life. In this tutorial, we’ll take those same technologies a step further, using
Read more ->
8 min read