Vision Agents v0.3: Deployments, HTTP support, & 10 New Plugins

Two months after v0.2, we're excited to share Vision Agents v0.3—our next significant milestone towards running agents in production at scale. While v0.2 introduced the foundation for building realtime multimodal AI agents, v0.3 takes these agents from prototype to production. This release brings the infrastructure you need to deploy agents at scale: HTTP APIs, observability, session management, and real-world integrations.

With 10 new plugins spanning AWS, NVIDIA, HuggingFace, and more, plus powerful phone integration and RAG capabilities, v0.3 allows you to build production-grade AI systems that handle everything from customer support calls to yoga instructors.

Agent HTTP Server: Production Deployments

Starting with 0.3, you can now deploy your agents as HTTP services with a built-in FastAPI server and REST endpoints.

python

1
2
3
4
5
6
7
from vision_agents.core import Runner, AgentLauncher

launcher = AgentLauncher(create_agent=create_agent, join_call=join_call)
runner = Runner(launcher)

if __name__ == "__main__":
   runner.cli()

Then start your server:

bash

1
uv run agent_example.py serve --host 0.0.0.0 --port 8000

The HTTP API provides clean REST endpoints for managing agent sessions:

Start a session (POST /sessions):

json

1
2
3
4
{
 "call_id": "my-call-123",
 "call_type": "default"
}

The server includes session management, CORS configuration, health checks, and authentication hooks. You can customize permissions with dependency injection:

python

1
2
3
4
5
6
from vision_agents.core.runner.http.dependencies import can_start_session

# Add your auth logic
async def my_auth_check(request: Request) -> bool:
   # Your authorization logic here
   return True

See the full example at examples/08_agent_server_example/agent_server_example.py.

Metrics & Observability: Production Monitoring

Production requires visibility. v0.3 includes Prometheus integration with metrics for LLM, STT, TTS, and turn detection.

Setup is straightforward—configure OpenTelemetry before importing Vision Agents:

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from prometheus_client import start_http_server

# Start Prometheus HTTP server
start_http_server(9464)

# Configure OpenTelemetry
reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)

# Now import vision_agents
from vision_agents.core import Agent
from vision_agents.core.observability import MetricsCollector

# Create your agent
agent = Agent(...)

# Opt-in to metrics collection
collector = MetricsCollector(agent)

Metrics are automatically tracked:

LLM metrics: llm_latency_ms, llm_time_to_first_token_ms, llm_tokens_input, llm_tokens_output, llm_tool_calls, llm_tool_latency_ms
STT metrics: stt_latency_ms, stt_audio_duration_ms, stt_errors
TTS metrics: tts_latency_ms, tts_audio_duration_ms, tts_characters, tts_errors
Turn metrics: turn_duration_ms, turn_trailing_silence_ms

You can also query metrics via the HTTP API (GET /sessions/{session_id}/metrics):

json

1
2
3
4
5
6
7
8
9
10
11
12
13
{
 "session_id": "abc-def-123",
 "call_id": "my-call-123",
 "metrics": {
   "llm_latency_ms": 450,
   "llm_tokens_input": 1234,
   "llm_tokens_output": 567,
   "stt_latency_ms": 180,
   "tts_latency_ms": 220
 },
 "session_started_at": "2026-01-19T10:30:00Z",
 "metrics_generated_at": "2026-01-19T10:35:00Z"
}

Hook these up to Grafana for real-time dashboards. Check out the complete example at examples/06_prometheus_metrics_example/prometheus_metrics_example.py.

Phone Integration + RAG: AI That Answers Calls

v0.3 brings support for Turbopuffer and Twilio, allowing for powerful phone integration with RAG capabilities, perfect for customer support, reservations, and information hotlines.

The integration supports both inbound (receive calls) and outbound (make calls) modes via Twilio. Combined with RAG, your agent can answer questions using your company's knowledge base.

Two RAG Backend Options

Option 1: Gemini File Search (default)—Gemini's built-in RAG with automatic chunking and indexing:

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from vision_agents.plugins import gemini

# Create file search store
file_search_store = await gemini.create_file_search_store(
   name="stream-product-knowledge",
   knowledge_dir=KNOWLEDGE_DIR,
   extensions=[".md"],
)

# Use with LLM
llm = gemini.LLM(
   "gemini-2.5-flash-lite",
   tools=[gemini.tools.FileSearch(file_search_store)],
)

Option 2: TurboPuffer with function calling for hybrid search:

python

1
2
3
4
5
6
7
8
9
10
11
12
13
from vision_agents.plugins import turbopuffer

rag = await turbopuffer.create_rag(
   namespace="stream-product-knowledge",
   knowledge_dir=KNOWLEDGE_DIR,
   extensions=[".md"],
)

@llm.register_function(
   description="Search Stream's product knowledge base"
)
async def search_knowledge(query: str) -> str:
   return await rag.search(query, top_k=3)

Phone Call Workflow

The phone integration handles the full lifecycle:

Twilio triggers webhook on /twilio/voice
Server starts preparing the agent and call
Bi-directional media stream connects via WebSocket
Agent attaches to the phone user and starts responding
Session runs until the call ends

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@app.post("/twilio/voice")
async def twilio_voice_webhook(
   data: twilio.CallWebhookInput = Depends(twilio.CallWebhookInput.as_form),
):
   call_id = str(uuid.uuid4())

   async def prepare_call():
       agent = await create_agent()
       phone_user = User(name=f"Call from {data.from_number}", id=f"phone-{sanitized}")
       stream_call = await agent.create_call("default", call_id=call_id)
       return agent, phone_user, stream_call

   twilio_call = call_registry.create(call_id, data, prepare=prepare_call)
   url = f"wss://{NGROK_URL}/twilio/media/{call_id}/{twilio_call.token}"

   return twilio.create_media_stream_response(url)

@app.websocket("/twilio/media/{call_id}/{token}")
async def media_stream(websocket: WebSocket, call_id: str, token: str):
   twilio_stream = twilio.TwilioMediaStream(websocket)
   await twilio_stream.accept()

   agent, phone_user, stream_call = await twilio_call.await_prepare()
   await twilio.attach_phone_to_call(stream_call, twilio_stream, phone_user.id)

   async with agent.join(stream_call, participant_wait_timeout=0):
       await agent.llm.simple_response(
           "Greet the caller warmly and ask what kind of app they're building."
       )
       await twilio_stream.run()

Integrate LLMs fast! Our UI components are perfect for any AI chatbot interface right out of the box. Try them today and launch tomorrow!

The phone integration uses mulaw audio encoding at 8kHz for compatibility with telephony networks. Combine this with Deepgram STT and Cartesia TTS for natural conversations.

See the full implementation at examples/03_phone_and_rag_example/inbound_phone_and_rag_example.py.

Gemini Tools Ecosystem

Beyond File Search, Gemini offers a tools ecosystem for production agents:

GoogleSearch: Ground responses with current web data
CodeExecution: Run Python code for calculations
URLContext: Read specific web pages
GoogleMaps: Location-aware queries (Preview)
ComputerUse: Interact with browser UIs (Preview)

These tools make it easy to build agents that go beyond simple Q&A.

Security Camera: Real-World AI Vision

Want to see Vision Agents in action? The security camera example showcases face recognition, package detection, and event-driven architecture.

Features:

Real-time face detection from camera feed
30-minute sliding window of detected faces
Package theft detection with wanted poster generation
Video overlay with visitor count and face thumbnails
LLM integration for natural language queries

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from vision_agents.plugins import gemini, getstream, deepgram, elevenlabs

security_processor = SecurityCameraProcessor(
   fps=5,
   time_window=1800,  # 30 minutes
   thumbnail_size=80,
   detection_interval=2.0,
   model_path="weights_custom.pt",
   package_conf_threshold=0.7,
   max_tracked_packages=1,
)

agent = Agent(
   edge=getstream.Edge(),
   agent_user=User(name="Security AI", id="agent"),
   instructions="Read @instructions.md",
   processors=[security_processor],
   llm=gemini.LLM("gemini-2.5-flash-lite"),
   tts=elevenlabs.TTS(),
   stt=deepgram.STT(eager_turn_detection=True),
)

The processor uses YOLO for package detection and face_recognition for visitor tracking. When a package disappears with a visitor in frame, the system automatically generates a wanted poster and optionally posts it to X.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
@llm.register_function(
   description="Get the number of unique visitors detected in the last 30 minutes."
)
async def get_visitor_count() -> Dict[str, Any]:
   count = security_processor.get_visitor_count()
   return {
       "unique_visitors": count,
       "time_window": "30 minutes",
   }

@llm.register_function(
   description="Register the current person's face with a name"
)
async def remember_my_face(name: str) -> Dict[str, Any]:
   return security_processor.register_current_face_as(name)

Users can ask questions like "How many visitors came by today?" or say "Remember me as John" to register their face.

Full code at examples/05_security_camera_example/security_camera_example.py.

10 New Plugins: Expanding the Ecosystem

v0.3 brings 8 new plugins across realtime LLMs, vision models, voice, and infrastructure.

These include plugins such as out of the box support for AWS Nova 2 (vision_agents.plugins.aws), Amazon Bedrock's native realtime speech-to-speech model with bidirectional streaming:

python

1
2
3
4
5
6
7
from vision_agents.plugins import aws

agent = Agent(
   llm=aws.Realtime(model="amazon.nova-2-sonic-v1:0"),
   edge=getstream.Edge(),
   agent_user=User(name="Nova Agent", id="agent"),
)

Nova 2 handles audio I/O directly with server-side VAD and low latency. The Vision Agents Bedrock plugin also automatically handles session management and resumption beyond AWS Bedrock's default session limit, enabling long-lived agent interactions.

Our support for Vision Models also got a big upgrade in 0.3 with out-of-the-box support for NVIDIA Cosmos 2 (vision_agents.plugins.nvidia), their new advanced video understanding VLM and out-of-the-box support for Roboflow (vision_agents.plugins.roboflow) object detection with cloud and local RF-DETR models.

python

1
2
3
4
5
6
7
8
from vision_agents.plugins import nvidia

processor = nvidia.CosmosProcessor(
   model="cosmos-2",
   prompt="Describe what you see in detail",
)

agent = Agent(processors=[processor], ...)

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from vision_agents.plugins import roboflow

# Local detection with RF-DETR
processor = roboflow.RoboflowLocalDetectionProcessor(
   model_id="rfdetr-base",
   confidence_threshold=0.5,
   draw_bboxes=True,
)

agent = Agent(processors=[processor], ...)

@agent.events.subscribe
async def on_detection_completed(event: roboflow.DetectionCompletedEvent):
   for obj in event.detected_objects:
       print(f"Detected {obj.class_name} at confidence {obj.confidence}")

HuggingFace Inference (`vision_agents.plugins.huggingface`)

HuggingFace is one of the largest communities of AI builders and open source in the world. Starting in 0.3, developers can now run their favourite open-weight models directly in Vision Agents using our HuggingFace Inference package. Whether you’re planning to run the latest Llama model or the newest VLM model, these can now be accessed from a single unified API provided they’re deployed to HuggingFace Inference:

python

1
2
3
4
5
6
from vision_agents.plugins import huggingface

llm = huggingface.LLM(model="meta-llama/Llama-3.3-70B-Instruct")

# Or use VLM
vlm = huggingface.VLM(model="meta-llama/Llama-3.2-11B-Vision-Instruct")

Getting Started

Try Vision Agents v0.3 in minutes:

bash

1
2
3
4
5
6
7
8
9
10
11
# Clone the repo 
git clone https://github.com/GetStream/Vision-Agents

# Sync deps 
cd vision-agents
uv venv --python 3.12
uv sync 

# Run example of your choice 
cd examples/simple_agent_example 
uv run simple_agent_example.py

Join the Community

Vision Agents is open source and community-driven. This release includes 79 commits from multiple contributors building the future of multimodal AI.

Try the examples, deploy an agent, and let us know what you build. We're excited to see what you create with v0.3.

Built by the team at Stream with contributions from the community.

Vision Agents v0.3: Deployments, HTTP Support, & 10 New Plugins

Agent HTTP Server: Production Deployments

Metrics & Observability: Production Monitoring

Phone Integration + RAG: AI That Answers Calls

Two RAG Backend Options

Phone Call Workflow

Gemini Tools Ecosystem

Security Camera: Real-World AI Vision

10 New Plugins: Expanding the Ecosystem

HuggingFace Inference (vision_agents.plugins.huggingface)

Getting Started

Join the Community

HuggingFace Inference (`vision_agents.plugins.huggingface`)