Tutorials
Build an AI Travel Advisor That Speaks with Gemini 3.1 Pro
Most LLMs are great at thinking, but making them speak naturally is a different challenge. Gemini 3.1 Pro changes that. This new model from Google brings significantly improved reasoning, longer context, and better tool-use capabilities, making it one of the best choices (at the time of writing) for building conversational voice agents. In this guide,
Read more
2 min read
Add Text-to-Speech to Apps with Cartesia Sonic 3 & Vision Agents
Realistic text-to-speech was one of the hardest parts of building voice agents. Most models either sounded robotic, introduced noticeable latency, or required complex integration that slowed down prototyping. Cartesia Sonic 3 changes that equation. Released late 2025, it combines sub-200 ms first-chunk latency, strong emotional expressiveness, multilingual support, and the ability to clone voices from
Read more
2 min read
ElevenLabs with Vision Agents: Add Text-to-Speech in a Few Lines of Code
ElevenLabs delivers some of the most lifelike and expressive text-to-speech voices out there. Its natural intonation, emotion, and multilingual support make your AI agents sound genuinely human. And, with the ElevenLabs plugin for Vision Agents, integration is a one-liner affair: import, initialize (with optional voice/model tweaks), and pass it to your agent. No messing around
Read more
3 min read
Kimi K2.5: Build a Video & Vision Agent in Python
Imagine pointing your webcam at everyday objects (or even sharing your screen with code) and having an AI instantly understand what it sees, reason through it step by step, and explain everything back to you in a natural voice. That’s what Kimi K2.5 from Moonshot AI makes possible when accessed via its OpenAI-compatible API and
Read more
3 min read
Build an Instagram-Style For-You Feed in React Native
Personalized content feeds keep users engaged by surfacing content they're most likely to enjoy. In this tutorial, you’ll build an Instagram-style “For You” feed in React Native Expo that recommends images and videos to users based on their interests and content popularity. Get a free Stream account and use your API credentials to get started.
Read more
9 min read
Create Speech-to-Text Experiences with ElevenLabs Scribe v2 Realtime & Vision Agents
ElevenLabs released Scribe v2 Realtime, an ultra-low latency speech-to-text model with ~150ms end-to-end transcription, supporting 90+ languages and claiming the lowest Word Error Rate in benchmarks for major languages and accents. It's built specifically for agentic apps, live meetings, note-taking, and conversational AI, where every millisecond and every word matters. In this demo, Scribe v2
Read more
2 min read
Building A2UI-Powered Interfaces with Stream Chat
A2UI (Agent-to-UI) is a protocol designed by Google to standardize how AI agents communicate with user interfaces. Instead of tightly coupling agents to specific frontends, A2UI defines a clear contract for intent, state, and actions - making it easier to build interactive, agent-driven experiences that are portable, composable, and UI-agnostic. As AI systems move from
Read more
9 min read
How to Build a Local AI Voice Agent with Pocket TTS
Voice agents are getting better, but most text-to-speech pipelines still assume you’re okay with cloud APIs, large models, and unpredictable latency. If you want fast, natural-sounding speech that runs entirely on your own hardware (no GPU, no network calls), you need a different approach. In this tutorial, you’ll build a real-time AI voice agent that
Read more
9 min read