Tutorials: Vision
Add Text-to-Speech to Apps with Cartesia Sonic 3 & Vision Agents
Realistic text-to-speech was one of the hardest parts of building voice agents. Most models either sounded robotic, introduced noticeable latency, or required complex integration that slowed down prototyping. Cartesia Sonic 3 changes that equation. Released late 2025, it combines sub-200 ms first-chunk latency, strong emotional expressiveness, multilingual support, and the ability to clone voices from
Read more
2 min read
ElevenLabs with Vision Agents: Add Text-to-Speech in a Few Lines of Code
ElevenLabs delivers some of the most lifelike and expressive text-to-speech voices out there. Its natural intonation, emotion, and multilingual support make your AI agents sound genuinely human. And, with the ElevenLabs plugin for Vision Agents, integration is a one-liner affair: import, initialize (with optional voice/model tweaks), and pass it to your agent. No messing around
Read more
3 min read
Kimi K2.5: Build a Video & Vision Agent in Python
Imagine pointing your webcam at everyday objects (or even sharing your screen with code) and having an AI instantly understand what it sees, reason through it step by step, and explain everything back to you in a natural voice. That’s what Kimi K2.5 from Moonshot AI makes possible when accessed via its OpenAI-compatible API and
Read more
3 min read
Create Speech-to-Text Experiences with ElevenLabs Scribe v2 Realtime & Vision Agents
ElevenLabs released Scribe v2 Realtime, an ultra-low latency speech-to-text model with ~150ms end-to-end transcription, supporting 90+ languages and claiming the lowest Word Error Rate in benchmarks for major languages and accents. It's built specifically for agentic apps, live meetings, note-taking, and conversational AI, where every millisecond and every word matters. In this demo, Scribe v2
Read more
2 min read
How to Build a Local AI Voice Agent with Pocket TTS
Voice agents are getting better, but most text-to-speech pipelines still assume you’re okay with cloud APIs, large models, and unpredictable latency. If you want fast, natural-sounding speech that runs entirely on your own hardware (no GPU, no network calls), you need a different approach. In this tutorial, you’ll build a real-time AI voice agent that
Read more
9 min read
Add Life-Like Voices to Your AI Apps with Inworld and Vision Agents
The future of software is conversational and interactive. For developers, unlocking this frontier means moving beyond traditional text inputs to agents that can seamlessly see, hear, and speak. Our goal is to demonstrate a powerful, flexible architecture that achieves this. This allows us to build truly expressive, realtime-latency AI applications. To illustrate, consider our core
Read more
11 min read
Build a Gemini 3 Flash-Powered AI App in Python
Google dropped Gemini 3 Flash, a fast multimodal model that excels at video understanding, live frame analysis, and object detection. Plus, it’s cost-effective and offers low latency. In this quick demo, we use it to build a vision AI app in under five minutes that watches your camera feed in real time, accurately describes what
Read more
3 min read
Build a Voice AI App in Python: Grok-4 + Fish Audio + Deepgram
xAI's Grok-4 delivers strong reasoning with a 256k context window, native tool use, and multimodal support. We love it for natural, low-latency voice conversations. Pair it with Fish Audio's high-quality, expressive TTS (known for realistic prosody, emotion control, and voice cloning via short references) and Deepgram's fast, accurate STT, and you get a custom voice
Read more
3 min read