Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub

Stream Blog

Open Vision Agents by Stream: Open Source SDK for Building Low-Latency Vision AI Apps

Vision Agents is a new, open-source framework from Stream that helps developers quickly build low-latency vision AI applications. The project is completely open-source and ships with over ten out-of-the-box integrations, including day one support for leading real-time voice and video models like OpenAI Realtime and Gemini Live. Text-to-speech, speech-to-text, and speech-to-speech models are also natively
Read more
4 min read

The 8 Best Platforms To Build Voice AI Agents

Voice assistants like Siri and Alexa are great for non-trivial everyday personal assistive tasks. However, they are limited in providing accurate answers to complex questions, real-time information, handling turns, and user interruptions. Get started! Activate your free Stream account today and start prototyping your own voice AI agent! Try asking Siri about the best things
Read more
13 min

The 6 Best LLM Tools To Run Models Locally

Running large language models (LLMs) like DeepSeek Chat, ChatGPT, and Claude usually involves sending data to servers managed by DeepSeek, OpenAI, and other AI model providers. While these services are secure, some businesses prefer to keep their data offline for greater privacy. Get started! Activate your free Stream account today and start prototyping with the
Read more
12 min

Using Stream to Build a Livestream Chat App in Next.js

I always wondered how to create the dynamic chat experience of livestreams, like those found on YouTube, but with an added convenience of allowing anyone to participate without logging in. Get started! Activate your free Stream account today and start prototyping livestream video. With Next.js and Stream, I was able to successfully create that experience.
Read more
8 min

Add Text-to-Speech to Apps with Cartesia Sonic 3 & Vision Agents

Realistic text-to-speech was one of the hardest parts of building voice agents. Most models either sounded robotic, introduced noticeable latency, or required complex integration that slowed down prototyping. Cartesia Sonic 3 changes that equation. Released late 2025, it combines sub-200 ms first-chunk latency, strong emotional expressiveness, multilingual support, and the ability to clone voices from

Read more
2 min read

ElevenLabs with Vision Agents: Add Text-to-Speech in a Few Lines of Code

ElevenLabs delivers some of the most lifelike and expressive text-to-speech voices out there. Its natural intonation, emotion, and multilingual support make your AI agents sound genuinely human. And, with the ElevenLabs plugin for Vision Agents, integration is a one-liner affair: import, initialize (with optional voice/model tweaks), and pass it to your agent. No messing around

Read more
3 min read

Lokal Scales Chat, Video, & Audio—Reaching 100M Downloads

For teams building social and community apps, speed is often the difference between learning early and falling behind. Lokal understands this better than most. Over the past seven years, the company has launched more than 60 apps across 10 categories, reaching nearly 100 million downloads across its ecosystem. Rather than betting everything on a single

Read more
5 min read

When To Choose Long Polling vs Websockets for Real-Time Feeds

Real-time feeds are practically table stakes for modern applications. Users expect instant messaging, activity streams, and collaboration throughout a product. Product managers see competitors shipping these features and add them to the roadmap. Developers reach for WebSockets or start polling an endpoint. But real-time infrastructure that works in development often fails in production. Connections drop

Read more
18 min read

Kimi K2.5: Build a Video & Vision Agent in Python

Imagine pointing your webcam at everyday objects (or even sharing your screen with code) and having an AI instantly understand what it sees, reason through it step by step, and explain everything back to you in a natural voice. That’s what Kimi K2.5 from Moonshot AI makes possible when accessed via its OpenAI-compatible API and

Read more
3 min read

Build an Instagram-Style For-You Feed in React Native

Personalized content feeds keep users engaged by surfacing content they’re most likely to enjoy. In this tutorial, you’ll build an Instagram-style “For You” feed in React Native Expo that recommends images and videos to users based on their interests and content popularity. Get a free Stream account and use your API credentials to get started.

Read more
9 min read

Create Speech-to-Text Experiences with ElevenLabs Scribe v2 Realtime & Vision Agents

ElevenLabs released Scribe v2 Realtime, an ultra-low latency speech-to-text model with ~150ms end-to-end transcription, supporting 90+ languages and claiming the lowest Word Error Rate in benchmarks for major languages and accents. It’s built specifically for agentic apps, live meetings, note-taking, and conversational AI, where every millisecond and every word matters. In this demo, Scribe v2

Read more
2 min read

How Text-to-Speech Works: Neural Models, Latency, and Deployment

Not long ago, text-to-speech (TTS) was a laughing stock. Robotic, obviously synthetic output that made customer service jokes write themselves and relegated TTS to accessibility contexts where users had no alternative. Now, you may have listened to text-to-speech today without even realizing. AI-generated podcasts, automated customer service calls, voice assistants that actually sound like assistants.

Read more
17 min read