Vision Agents v0.5.0: Anam Avatars, Stability Improvements, & Plugins

It's been a busy period since our last release, and now it’s time to share Vision Agents v0.5.0 — a step toward making production-grade multimodal AI agents easy to build and deploy.

While previous versions laid the groundwork for real-time voice, video, and Vision Agents, v0.5.0 focuses on stability at scale and even more expressive integrations. This release introduces support for Anam avatars, fixes long-running resource leaks that could impact memory in production deployments, and brings several new plugins and examples to help you build richer experiences faster.

Anam Avatar Integration: Bring Your Agents to Life with Synchronized Video

One of the highlights of v0.5.0 is native support for Anam avatars. You can now stream real-time audio from your agent (or any TTS) directly to an Anam avatar and receive synchronized video frames back — all with proper interruption handling when the user starts speaking.

The new vision-agents-plugins-anam package makes it easy to add expressive, human and animated-like video avatars to your agents. It handles audio resampling to Anam’s requirements, video resolution configuration, and graceful session management.

Check out the included anam example to see it in action with a complete agent pipeline.

For developers, this opens up exciting possibilities for coaching agents, virtual assistants, customer support, and any experience where a visual avatar makes the interaction feel more natural and engaging.

Improved Memory Management

Running agents at scale requires rock-solid resource management. In v0.5.0 we addressed a significant stability issue where HTTP clients and WebSocket connections from STT, TTS, and Stream Edge plugins.

This could lead to gradual memory growth and orphaned connections in long-running or high-throughput deployments. We've now ensured proper shutdowns across the board:

The StreamEdge now correctly closes the AsyncStream client and handles connection cleanup defensively.
ElevenLabs TTS and Deepgram STT plugins properly close their underlying HTTP clients.
Added tests to verify cleanup behavior.

If you've been monitoring memory usage in production, this change should deliver a noticeable improvement and make your agents far more reliable over extended runs.

LocalEdge: Run Agents on Your Machine

The new LocalEdge lets you swap out the cloud-based Stream edge for direct local I/O — your microphone for input, your speakers for output, and optionally your camera for video. Media stays on your machine; only the LLM, STT, and TTS calls go over the network.

It works as a drop-in replacement for getstream.Edge(). When you start the agent, it walks you through selecting your input, output, and video devices interactively, or you can configure them in code. This makes it the fastest way to prototype a new agent (no call setup, no browser, no Stream account needed to get started), and it's also the right foundation for desktop agents, kiosks, and on-device deployments where a cloud video call doesn't make sense.

Check out examples/10_local_transport_example to try it.

Production Deployment with Helm Charts and Monitoring

For teams taking Vision Agents to production on Kubernetes, v0.5.0 now includes a Helm chart to get you started. The chart packages a Vision Agent deployment with an optional Redis dependency for state and caching, configurable via a single redis.deploy.enabled flag. It's a clean starting point — sensible defaults, easy to extend with your own values, and ready to slot into an existing cluster or CI/CD pipeline.

Check out examples/07_k8s_deploy_example for the full setup.

Massively Improved Deepgram TTS Latency

If you've been using Deepgram for text-to-speech, v0.5.0 delivers a major latency improvement. We replaced the previous HTTP-based per-call synthesis with a persistent WebSocket connection that streams audio per utterance rather than generating it all at once.

The WebSocket stays open and gets reused across calls, eliminating the connection overhead that added up in back-and-forth conversations. We also removed unnecessary silence padding that was being inserted before and after each TTS response, and added proper support for interrupting and cancelling in-flight synthesis. The net result is noticeably faster audio responses — your agent starts speaking sooner after generating its reply, and turn-taking feels tighter overall.

Expanded Plugin Ecosystem and New Examples

v0.5.0 continues to grow the plugin library and the collection of ready-to-run examples.

New provider integrations:

AssemblyAI — STT via Universal-3 Pro with real-time streaming speaker diarization, configurable silence thresholds, and key terms prompting. Great for meeting assistants or any agent that needs to know who is speaking.
Anam avatar plugin for synchronized video avatars.

Plugin updates:

HuggingFace — new detection processor for vision models, plus function calling support.
Enhanced integrations for Anthropic (Claude), AWS Bedrock/Polly, Cartesia, Deepgram (STT + TTS), and ElevenLabs.

New examples:

Software Sales Assistant (real-time meeting coach)
MiMo via OpenRouter --- an easy way to experiment with open-weight reasoning models in your agent pipelines
Local transport demo
Video moderation using Roboflow

These additions, combined with improvements to transcript buffering, turn detection, audio queuing, and reconnection logic, give you more flexible building blocks for multimodal agents in production.

One More Thing: New Splash Screen!

We saved the best for last. Vision Agents now greets you with a splash screen when your agent starts up — a pixel-art banner and the version number. That's it. It doesn't do anything. It just looks cool, and you'll know your code is running. We should have done it in v0.1.

(If you hate fun, you can disable this by running with –no-splash. It also won’t appear in non-interactive terminals.)

Getting Started

Trying Vision Agents v0.5.0 is straightforward:

bash

1
2
3
uv add vision-agents
# or to include all plugins
uv add "vision-agents[all-plugins]"

Then explore the new examples in the repository — many come with clear READMEs, .env.example files, and architecture notes to help you get up and running quickly.

Want to try the Anam avatar right away? Head to the Anam avatar plugin and run the example on GitHub.

Join the Community

Vision Agents is open source and community-driven. Whether you're building customer support agents, real-time coaches, moderation tools, or something completely new, we'd love to see what you create.

Try the examples, deploy an agent, and let us know what you build. Share your projects on X, open a discussion on GitHub, engage on Discord, or reach out if you'd like to integrate Stream’s real-time video and chat components into your app.

Vision Agents v0.5.0 Release: Local Hardware I/O, Anam Avatars, and Faster Deepgram TTS