Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub

Tutorials: Vision

Build a Restaurant Reservation AI Agent With Turbopuffer and Twilio

Let’s build a restaurant reservation system to speak with a voice agent via a real-time phone call. The service will have three main features: Agent Outbound Call: The agent can act as both a customer helper and a restaurant assistant. For example, it can be configured as an AI restaurant employee that calls customers back
Read more
12 min read

Grok TTS + Vision: Build a Healthcare Appointment Agent

This step-by-step guide will help you build an AI front-desk receptionist that interacts with patients through conversations, assesses their conditions, and advises whether to visit a doctor or seek online medical advice. When an agent can see the patient’s condition in real time, it can make a smarter recommendation, saving patients an unnecessary trip to
Read more
12 min read

Build a Local AI Agent with Qwen 3.5 Small on macOS

Qwen 3.5 Small is a new family of lightweight, high-performance models from Alibaba (0.8B, 2B, 4B, and 9B parameters) that are now available on Ollama. These models support multimodal input, native tool calling, and strong reasoning, all while running efficiently on laptops, Macs, and even mobile/IoT devices. In this demo, the agent runs completely locally
Read more
3 min read

Using Opus 4.6: Vibe Code a Custom Python Plugin for Vision Agents

Vision Agents has out-of-the-box support for the LLM services and providers developers need to build voice, vision, and video AI applications. The framework also makes it easy to integrate custom AI services — either by following a step-by-step guide or by vibe coding them using SoTA models. Let’s use Claude Opus 4.6 to create a
Read more
9 min read

Build an AI Travel Advisor That Speaks with Gemini 3.1 Pro

Most LLMs are great at thinking, but making them speak naturally is a different challenge. Gemini 3.1 Pro changes that. This new model from Google brings significantly improved reasoning, longer context, and better tool-use capabilities, making it one of the best choices (at the time of writing) for building conversational voice agents. In this guide,
Read more
2 min read

Add Text-to-Speech to Apps with Cartesia Sonic 3 & Vision Agents

Realistic text-to-speech was one of the hardest parts of building voice agents. Most models either sounded robotic, introduced noticeable latency, or required complex integration that slowed down prototyping. Cartesia Sonic 3 changes that equation. Released late 2025, it combines sub-200 ms first-chunk latency, strong emotional expressiveness, multilingual support, and the ability to clone voices from
Read more
2 min read

ElevenLabs with Vision Agents: Add Text-to-Speech in a Few Lines of Code

ElevenLabs delivers some of the most lifelike and expressive text-to-speech voices out there. Its natural intonation, emotion, and multilingual support make your AI agents sound genuinely human. And, with the ElevenLabs plugin for Vision Agents, integration is a one-liner affair: import, initialize (with optional voice/model tweaks), and pass it to your agent. No messing around
Read more
3 min read

Kimi K2.5: Build a Video & Vision Agent in Python

Imagine pointing your webcam at everyday objects (or even sharing your screen with code) and having an AI instantly understand what it sees, reason through it step by step, and explain everything back to you in a natural voice. That’s what Kimi K2.5 from Moonshot AI makes possible when accessed via its OpenAI-compatible API and
Read more
3 min read