Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub ->

Stream Blog

Open Vision Agents by Stream: Open Source SDK for Building Low-Latency Vision AI Apps

Vision Agents is a new, open-source framework from Stream that helps developers quickly build low-latency vision AI applications. The project is completely open-source and ships with over ten out-of-the-box integrations, including day one support for leading real-time voice and video models like OpenAI Realtime and Gemini Live. Text-to-speech, speech-to-text, and speech-to-speech models are also natively
Read more ->
4 min read

The 8 Best Platforms To Build Voice AI Agents

Voice assistants like Siri and Alexa are great for non-trivial everyday personal assistive tasks. However, they are limited in providing accurate answers to complex questions, real-time information, handling turns, and user interruptions. Get started! Activate your free Stream account today and start prototyping your own voice AI agent! Try asking Siri about the best things
Read more ->
13 min

The 6 Best LLM Tools To Run Models Locally

Running large language models (LLMs) like DeepSeek Chat, ChatGPT, and Claude usually involves sending data to servers managed by DeepSeek, OpenAI, and other AI model providers. While these services are secure, some businesses prefer to keep their data offline for greater privacy. Get started! Activate your free Stream account today and start prototyping with the
Read more ->
12 min

Using Stream to Build a Livestream Chat App in Next.js

I always wondered how to create the dynamic chat experience of livestreams, like those found on YouTube, but with an added convenience of allowing anyone to participate without logging in. Get started! Activate your free Stream account today and start prototyping livestream video. With Next.js and Stream, I was able to successfully create that experience.
Read more ->
8 min

Add Life-Like Voices to Your AI Apps with Inworld and Vision Agents

The future of software is conversational and interactive. For developers, unlocking this frontier means moving beyond traditional text inputs to agents that can seamlessly see, hear, and speak. Our goal is to demonstrate a powerful, flexible architecture that achieves this. This allows us to build truly expressive, realtime-latency AI applications. To illustrate, consider our core

Read more ->
11 min read

Furnished Finder Builds Trusted Rental Marketplace with Stream Chat & AI Moderation

Furnished Finder is the leading two-sided marketplace for monthly furnished rentals, connecting landlords with traveling professionals, remote workers, and relocating families seeking stays of 30 days or more. With a growing network of over 300,000 listings and more than 240,000 landlords nationwide, Furnished Finder connects over 6 million renters annually to real homes in real

Read more ->
5 min read

Build a Gemini 3 Flash-Powered AI App in Python

Google dropped Gemini 3 Flash, a fast multimodal model that excels at video understanding, live frame analysis, and object detection. Plus, it’s cost-effective and offers low latency. In this quick demo, we use it to build a vision AI app in under five minutes that watches your camera feed in real time, accurately describes what

Read more ->
3 min read

Vision Agents v0.3: Deployments, HTTP Support, & 10 New Plugins

Two months after v0.2, we’re excited to share Vision Agents v0.3—our next significant milestone towards running agents in production at scale. While v0.2 introduced the foundation for building realtime multimodal AI agents, v0.3 takes these agents from prototype to production. This release brings the infrastructure you need to deploy agents at scale: HTTP APIs, observability,

Read more ->
7 min read

Peerspace Scales Messaging Safely With Stream Chat & AI Moderation

Peerspace is the leading marketplace for booking unique spaces for meetings, productions, and events. The platform connects guests with hosts through real-time, in-app messaging, enabling seamless coordination, faster bookings, and stronger trust on both sides of the marketplace. For Peerspace, keeping conversations inside the platform is a strategic priority. In-app messaging reduces reliance on third-party

Read more ->
4 min read

Build a Voice AI App in Python: Grok-4 + Fish Audio + Deepgram

xAI’s Grok-4 delivers strong reasoning with a 256k context window, native tool use, and multimodal support. We love it for natural, low-latency voice conversations. Pair it with Fish Audio’s high-quality, expressive TTS (known for realistic prosody, emotion control, and voice cloning via short references) and Deepgram’s fast, accurate STT, and you get a custom voice

Read more ->
3 min read

The 2026 Python Libraries for Real-Time Multimodal Agents

Every vision-language model tutorial shows you the same thing: send an image to GPT-4o, get a description back. Ten lines of Python. Done. response = client.chat.completions.create(     model="gpt-4o",     messages=[{         "role": "user",         "content": [             {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},             {"type": "text", "text": "What's in this image?"}         ]     }] ) Real applications need something different. A security camera

Read more ->
20 min read

Seeing with GPT‑4o: Building with OpenAI’s Vision Capabilities

Over the last few years, developers have gone from using language models for text-only chat to relying on them as general-purpose perception systems. You’re not only building chatbots; you’re building apps that use text, audio, and vision to understand and act on the world around them. GPT-4o is the most capable step yet: a single

Read more ->
14 min read