Two months after v0.2, we're excited to share Vision Agents v0.3—our next significant milestone towards running agents in production at scale. While v0.2 introduced the foundation for building realtime multimodal AI agents, v0.3 takes these agents from prototype to production. This release brings the infrastructure you need to deploy agents at scale: HTTP APIs, observability, session management, and real-world integrations.
With 10 new plugins spanning AWS, NVIDIA, HuggingFace, and more, plus powerful phone integration and RAG capabilities, v0.3 allows you to build production-grade AI systems that handle everything from customer support calls to yoga instructors.
Agent HTTP Server: Production Deployments
Starting with 0.3, you can now deploy your agents as HTTP services with a built-in FastAPI server and REST endpoints.
1234567from vision_agents.core import Runner, AgentLauncher launcher = AgentLauncher(create_agent=create_agent, join_call=join_call) runner = Runner(launcher) if __name__ == "__main__": runner.cli()
Then start your server:
1uv run agent_example.py serve --host 0.0.0.0 --port 8000
The HTTP API provides clean REST endpoints for managing agent sessions:
Start a session (POST /sessions):
1234{ "call_id": "my-call-123", "call_type": "default" }
The server includes session management, CORS configuration, health checks, and authentication hooks. You can customize permissions with dependency injection:
123456from vision_agents.core.runner.http.dependencies import can_start_session # Add your auth logic async def my_auth_check(request: Request) -> bool: # Your authorization logic here return True
See the full example at examples/08_agent_server_example/agent_server_example.py.
Metrics & Observability: Production Monitoring
Production requires visibility. v0.3 includes Prometheus integration with metrics for LLM, STT, TTS, and turn detection.
Setup is straightforward—configure OpenTelemetry before importing Vision Agents:
12345678910111213141516171819202122from opentelemetry import metrics from opentelemetry.sdk.metrics import MeterProvider from opentelemetry.exporter.prometheus import PrometheusMetricReader from prometheus_client import start_http_server # Start Prometheus HTTP server start_http_server(9464) # Configure OpenTelemetry reader = PrometheusMetricReader() provider = MeterProvider(metric_readers=[reader]) metrics.set_meter_provider(provider) # Now import vision_agents from vision_agents.core import Agent from vision_agents.core.observability import MetricsCollector # Create your agent agent = Agent(...) # Opt-in to metrics collection collector = MetricsCollector(agent)
Metrics are automatically tracked:
- LLM metrics:
llm_latency_ms,llm_time_to_first_token_ms,llm_tokens_input,llm_tokens_output,llm_tool_calls,llm_tool_latency_ms - STT metrics:
stt_latency_ms,stt_audio_duration_ms,stt_errors - TTS metrics:
tts_latency_ms,tts_audio_duration_ms,tts_characters,tts_errors - Turn metrics:
turn_duration_ms,turn_trailing_silence_ms
You can also query metrics via the HTTP API (GET /sessions/{session_id}/metrics):
12345678910111213{ "session_id": "abc-def-123", "call_id": "my-call-123", "metrics": { "llm_latency_ms": 450, "llm_tokens_input": 1234, "llm_tokens_output": 567, "stt_latency_ms": 180, "tts_latency_ms": 220 }, "session_started_at": "2026-01-19T10:30:00Z", "metrics_generated_at": "2026-01-19T10:35:00Z" }
Hook these up to Grafana for real-time dashboards. Check out the complete example at examples/06_prometheus_metrics_example/prometheus_metrics_example.py.
Phone Integration + RAG: AI That Answers Calls
v0.3 brings support for Turbopuffer and Twilio, allowing for powerful phone integration with RAG capabilities, perfect for customer support, reservations, and information hotlines.
The integration supports both inbound (receive calls) and outbound (make calls) modes via Twilio. Combined with RAG, your agent can answer questions using your company's knowledge base.
Two RAG Backend Options
Option 1: Gemini File Search (default)—Gemini's built-in RAG with automatic chunking and indexing:
1234567891011121314from vision_agents.plugins import gemini # Create file search store file_search_store = await gemini.create_file_search_store( name="stream-product-knowledge", knowledge_dir=KNOWLEDGE_DIR, extensions=[".md"], ) # Use with LLM llm = gemini.LLM( "gemini-2.5-flash-lite", tools=[gemini.tools.FileSearch(file_search_store)], )
Option 2: TurboPuffer with function calling for hybrid search:
12345678910111213from vision_agents.plugins import turbopuffer rag = await turbopuffer.create_rag( namespace="stream-product-knowledge", knowledge_dir=KNOWLEDGE_DIR, extensions=[".md"], ) @llm.register_function( description="Search Stream's product knowledge base" ) async def search_knowledge(query: str) -> str: return await rag.search(query, top_k=3)
Phone Call Workflow
The phone integration handles the full lifecycle:
- Twilio triggers webhook on
/twilio/voice - Server starts preparing the agent and call
- Bi-directional media stream connects via WebSocket
- Agent attaches to the phone user and starts responding
- Session runs until the call ends
123456789101112131415161718192021222324252627282930@app.post("/twilio/voice") async def twilio_voice_webhook( data: twilio.CallWebhookInput = Depends(twilio.CallWebhookInput.as_form), ): call_id = str(uuid.uuid4()) async def prepare_call(): agent = await create_agent() phone_user = User(name=f"Call from {data.from_number}", id=f"phone-{sanitized}") stream_call = await agent.create_call("default", call_id=call_id) return agent, phone_user, stream_call twilio_call = call_registry.create(call_id, data, prepare=prepare_call) url = f"wss://{NGROK_URL}/twilio/media/{call_id}/{twilio_call.token}" return twilio.create_media_stream_response(url) @app.websocket("/twilio/media/{call_id}/{token}") async def media_stream(websocket: WebSocket, call_id: str, token: str): twilio_stream = twilio.TwilioMediaStream(websocket) await twilio_stream.accept() agent, phone_user, stream_call = await twilio_call.await_prepare() await twilio.attach_phone_to_call(stream_call, twilio_stream, phone_user.id) async with agent.join(stream_call, participant_wait_timeout=0): await agent.llm.simple_response( "Greet the caller warmly and ask what kind of app they're building." ) await twilio_stream.run()
The phone integration uses mulaw audio encoding at 8kHz for compatibility with telephony networks. Combine this with Deepgram STT and Cartesia TTS for natural conversations.
See the full implementation at examples/03_phone_and_rag_example/inbound_phone_and_rag_example.py.
Gemini Tools Ecosystem
Beyond File Search, Gemini offers a tools ecosystem for production agents:
- GoogleSearch: Ground responses with current web data
- CodeExecution: Run Python code for calculations
- URLContext: Read specific web pages
- GoogleMaps: Location-aware queries (Preview)
- ComputerUse: Interact with browser UIs (Preview)
These tools make it easy to build agents that go beyond simple Q&A.
Security Camera: Real-World AI Vision
Want to see Vision Agents in action? The security camera example showcases face recognition, package detection, and event-driven architecture.
Features:
- Real-time face detection from camera feed
- 30-minute sliding window of detected faces
- Package theft detection with wanted poster generation
- Video overlay with visitor count and face thumbnails
- LLM integration for natural language queries
123456789101112131415161718192021from vision_agents.plugins import gemini, getstream, deepgram, elevenlabs security_processor = SecurityCameraProcessor( fps=5, time_window=1800, # 30 minutes thumbnail_size=80, detection_interval=2.0, model_path="weights_custom.pt", package_conf_threshold=0.7, max_tracked_packages=1, ) agent = Agent( edge=getstream.Edge(), agent_user=User(name="Security AI", id="agent"), instructions="Read @instructions.md", processors=[security_processor], llm=gemini.LLM("gemini-2.5-flash-lite"), tts=elevenlabs.TTS(), stt=deepgram.STT(eager_turn_detection=True), )
The processor uses YOLO for package detection and face_recognition for visitor tracking. When a package disappears with a visitor in frame, the system automatically generates a wanted poster and optionally posts it to X.
Register functions for natural language interaction:
123456789101112131415@llm.register_function( description="Get the number of unique visitors detected in the last 30 minutes." ) async def get_visitor_count() -> Dict[str, Any]: count = security_processor.get_visitor_count() return { "unique_visitors": count, "time_window": "30 minutes", } @llm.register_function( description="Register the current person's face with a name" ) async def remember_my_face(name: str) -> Dict[str, Any]: return security_processor.register_current_face_as(name)
Users can ask questions like "How many visitors came by today?" or say "Remember me as John" to register their face.
Full code at examples/05_security_camera_example/security_camera_example.py.
10 New Plugins: Expanding the Ecosystem
v0.3 brings 8 new plugins across realtime LLMs, vision models, voice, and infrastructure.
These include plugins such as out of the box support for AWS Nova 2 (vision_agents.plugins.aws), Amazon Bedrock's native realtime speech-to-speech model with bidirectional streaming:
1234567from vision_agents.plugins import aws agent = Agent( llm=aws.Realtime(model="amazon.nova-2-sonic-v1:0"), edge=getstream.Edge(), agent_user=User(name="Nova Agent", id="agent"), )
Nova 2 handles audio I/O directly with server-side VAD and low latency. The Vision Agents Bedrock plugin also automatically handles session management and resumption beyond AWS Bedrock's default session limit, enabling long-lived agent interactions.
Our support for Vision Models also got a big upgrade in 0.3 with out-of-the-box support for NVIDIA Cosmos 2 (vision_agents.plugins.nvidia), their new advanced video understanding VLM and out-of-the-box support for Roboflow (vision_agents.plugins.roboflow) object detection with cloud and local RF-DETR models.
12345678from vision_agents.plugins import nvidia processor = nvidia.CosmosProcessor( model="cosmos-2", prompt="Describe what you see in detail", ) agent = Agent(processors=[processor], ...)
123456789101112131415from vision_agents.plugins import roboflow # Local detection with RF-DETR processor = roboflow.RoboflowLocalDetectionProcessor( model_id="rfdetr-base", confidence_threshold=0.5, draw_bboxes=True, ) agent = Agent(processors=[processor], ...) @agent.events.subscribe async def on_detection_completed(event: roboflow.DetectionCompletedEvent): for obj in event.detected_objects: print(f"Detected {obj.class_name} at confidence {obj.confidence}")
HuggingFace Inference (vision_agents.plugins.huggingface)
HuggingFace is one of the largest communities of AI builders and open source in the world. Starting in 0.3, developers can now run their favourite open-weight models directly in Vision Agents using our HuggingFace Inference package. Whether you’re planning to run the latest Llama model or the newest VLM model, these can now be accessed from a single unified API provided they’re deployed to HuggingFace Inference:
123456from vision_agents.plugins import huggingface llm = huggingface.LLM(model="meta-llama/Llama-3.3-70B-Instruct") # Or use VLM vlm = huggingface.VLM(model="meta-llama/Llama-3.2-11B-Vision-Instruct")
Getting Started
Try Vision Agents v0.3 in minutes:
1234567891011# Clone the repo git clone https://github.com/GetStream/Vision-Agents # Sync deps cd vision-agents uv venv --python 3.12 uv sync # Run example of your choice cd examples/simple_agent_example uv run simple_agent_example.py
Join the Community
Vision Agents is open source and community-driven. This release includes 79 commits from multiple contributors building the future of multimodal AI.
- ⭐ Star the repo on GitHub
- 📖 Read the documentation
- 💬 Join our Discord community
- 🔧 Contribute a plugin
Try the examples, deploy an agent, and let us know what you build. We're excited to see what you create with v0.3.
Built by the team at Stream with contributions from the community.
