Stream Blog
Open Vision Agents by Stream: Open Source SDK for Building Low-Latency Vision AI Apps
The 8 Best Platforms To Build Voice AI Agents
The 6 Best LLM Tools To Run Models Locally
Using Stream to Build a Livestream Chat App in Next.js
Add Text-to-Speech to Apps with Cartesia Sonic 3 & Vision Agents
Realistic text-to-speech was one of the hardest parts of building voice agents. Most models either sounded robotic, introduced noticeable latency, or required complex integration that slowed down prototyping. Cartesia Sonic 3 changes that equation. Released late 2025, it combines sub-200 ms first-chunk latency, strong emotional expressiveness, multilingual support, and the ability to clone voices from
ElevenLabs with Vision Agents: Add Text-to-Speech in a Few Lines of Code
ElevenLabs delivers some of the most lifelike and expressive text-to-speech voices out there. Its natural intonation, emotion, and multilingual support make your AI agents sound genuinely human. And, with the ElevenLabs plugin for Vision Agents, integration is a one-liner affair: import, initialize (with optional voice/model tweaks), and pass it to your agent. No messing around
Lokal Scales Chat, Video, & Audio—Reaching 100M Downloads
For teams building social and community apps, speed is often the difference between learning early and falling behind. Lokal understands this better than most. Over the past seven years, the company has launched more than 60 apps across 10 categories, reaching nearly 100 million downloads across its ecosystem. Rather than betting everything on a single
When To Choose Long Polling vs Websockets for Real-Time Feeds
Real-time feeds are practically table stakes for modern applications. Users expect instant messaging, activity streams, and collaboration throughout a product. Product managers see competitors shipping these features and add them to the roadmap. Developers reach for WebSockets or start polling an endpoint. But real-time infrastructure that works in development often fails in production. Connections drop
Kimi K2.5: Build a Video & Vision Agent in Python
Imagine pointing your webcam at everyday objects (or even sharing your screen with code) and having an AI instantly understand what it sees, reason through it step by step, and explain everything back to you in a natural voice. That’s what Kimi K2.5 from Moonshot AI makes possible when accessed via its OpenAI-compatible API and
Build an Instagram-Style For-You Feed in React Native
Personalized content feeds keep users engaged by surfacing content they’re most likely to enjoy. In this tutorial, you’ll build an Instagram-style “For You” feed in React Native Expo that recommends images and videos to users based on their interests and content popularity. Get a free Stream account and use your API credentials to get started.
Create Speech-to-Text Experiences with ElevenLabs Scribe v2 Realtime & Vision Agents
ElevenLabs released Scribe v2 Realtime, an ultra-low latency speech-to-text model with ~150ms end-to-end transcription, supporting 90+ languages and claiming the lowest Word Error Rate in benchmarks for major languages and accents. It’s built specifically for agentic apps, live meetings, note-taking, and conversational AI, where every millisecond and every word matters. In this demo, Scribe v2
How Text-to-Speech Works: Neural Models, Latency, and Deployment
Not long ago, text-to-speech (TTS) was a laughing stock. Robotic, obviously synthetic output that made customer service jokes write themselves and relegated TTS to accessibility contexts where users had no alternative. Now, you may have listened to text-to-speech today without even realizing. AI-generated podcasts, automated customer service calls, voice assistants that actually sound like assistants.
