Stream Blog
Open Vision Agents by Stream: Open Source SDK for Building Low-Latency Vision AI Apps
The 8 Best Platforms To Build Voice AI Agents
The 6 Best LLM Tools To Run Models Locally
Using Stream to Build a Livestream Chat App in Next.js
Developer’s Guide to Ultralytics YOLO: From Theory to Real-Time Pose Detection
In most of the world, if you’re YOLO’ing, you’re jumping out of a plane, asking out your future spouse, or eating gas station sushi. In vision AI, You’re Only Looking Once. Ultralytics’ YOLO is a real-time object detection framework with a simple premise: instead of scanning an image multiple times to find and classify objects,
Build a Local AI Agent with Qwen 3.5 Small on macOS
Qwen 3.5 Small is a new family of lightweight, high-performance models from Alibaba (0.8B, 2B, 4B, and 9B parameters) that are now available on Ollama. These models support multimodal input, native tool calling, and strong reasoning, all while running efficiently on laptops, Macs, and even mobile/IoT devices. In this demo, the agent runs completely locally
Using Opus 4.6: Vibe Code a Custom Python Plugin for Vision Agents
Vision Agents has out-of-the-box support for the LLM services and providers developers need to build voice, vision, and video AI applications. The framework also makes it easy to integrate custom AI services — either by following a step-by-step guide or by vibe coding them using SoTA models. Let’s use Claude Opus 4.6 to create a
Developer’s Guide to Building Vision AI Pipelines Using Grok
Grok tends to fly under the radar. While ChatGPT, Claude, and Gemini have found their footing in enterprise workflows and agentic toolchains, Grok remains mostly associated with X, which has overshadowed some genuinely strong capabilities. Chief among them is vision: Grok can understand and generate images, produce entire videos from a single prompt, and with
Build an AI Travel Advisor That Speaks with Gemini 3.1 Pro
Most LLMs are great at thinking, but making them speak naturally is a different challenge. Gemini 3.1 Pro changes that. This new model from Google brings significantly improved reasoning, longer context, and better tool-use capabilities, making it one of the best choices (at the time of writing) for building conversational voice agents. In this guide,
coches.net (Formerly Adevinta) Increases Buyer-Seller Transactions
coches.net is an expert mobility marketplace and market leader in Spain. At the time of its integration with Stream, coches.net operated as part of Adevinta, alongside other major European marketplace apps, including Milanuncios and Fotocasa. As a high-volume consumer marketplace, coches.net connects buyers and sellers at moments of high intent. Questions, negotiations, coordination, and decisions
Add Text-to-Speech to Apps with Cartesia Sonic 3 & Vision Agents
Realistic text-to-speech was one of the hardest parts of building voice agents. Most models either sounded robotic, introduced noticeable latency, or required complex integration that slowed down prototyping. Cartesia Sonic 3 changes that equation. Released late 2025, it combines sub-200 ms first-chunk latency, strong emotional expressiveness, multilingual support, and the ability to clone voices from
ElevenLabs with Vision Agents: Add Text-to-Speech in a Few Lines of Code
ElevenLabs delivers some of the most lifelike and expressive text-to-speech voices out there. Its natural intonation, emotion, and multilingual support make your AI agents sound genuinely human. And, with the ElevenLabs plugin for Vision Agents, integration is a one-liner affair: import, initialize (with optional voice/model tweaks), and pass it to your agent. No messing around
