Build with Qwen 3.5 Small: Local Vision & Voice Agent

Qwen 3.5 Small is a new family of lightweight, high-performance models from Alibaba (0.8B, 2B, 4B, and 9B parameters) that are now available on Ollama.

These models support multimodal input, native tool calling, and strong reasoning, all while running efficiently on laptops, Macs, and even mobile/IoT devices.

In this demo, the agent runs completely locally using the 3.5:2b model and accurately describes what it sees in the live camera feed:

“I see a man in a black cap and a dark jacket. He appears to be looking down and not engaging with the camera.”

Here’s how to build the same locally-running vision + voice agent in Python using Qwen 3.5 Small and Vision AI Agents in less than five minutes.

What You’ll Build

A fully local vision + voice agent that analyzes your camera feed in real time
Runs entirely on your Mac, laptop, or even mobile device using Ollama + Qwen 3.5 Small
Combines local multimodal understanding with natural speech input and output
No cloud LLM calls; everything stays on your machine

The Stack

LLM & Vision → Qwen 3.5 Small (via Ollama)
TTS → ElevenLabs
STT → Deepgram
Turn Detection → Smart-Turn
Transport → Stream WebRTC
Framework → Vision Agents (open-source)

Requirements

Ollama installed and running (ollama pull qwen3.5:2b recommended)
API keys for: Stream, Deepgram, ElevenLabs

Step 1: Install Vision Agents + Plugins

bash

1
2
uv add vision-agents
uv add "vision-agents[getstream, deepgram, elevenlabs, smart-turn]"

Step 2: Full Working Code (main.py)

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
     """Video analysis agent using Ollama (qwen3.5:9b).

Run from the project root so the installed vision_agents is used, e.g.:
    uv run python plugins/ollama/video_analysis_agent.py run
"""
import logging
from pathlib import Path

from dotenv import load_dotenv

from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import getstream, deepgram, elevenlabs, smart_turn
from vision_agents.plugins.ollama import VLM as OllamaVLM

# Load .env from project root so STREAM_* and other keys are always found
_project_root = Path(__file__).resolve().parent.parent.parent
load_dotenv(_project_root / ".env")

logger = logging.getLogger(__name__)

def create_agent(**kwargs) -> Agent:
    """Create a video analysis agent using Ollama with qwen3.5:9b."""
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Video Analyst", id="agent"),
        instructions=(
            "You are a video analysis assistant. Analyze the video feed and "
            "answer questions about what you see. Be detailed and descriptive "
            "in your observations."
        ),
        llm=OllamaVLM(
            model="qwen3.5:9b",
            fps=1,
            frame_buffer_seconds=10,
        ),
        stt=deepgram.STT(),
        tts=elevenlabs.TTS(),
        turn_detection=smart_turn.TurnDetection(),
    )
    agent._audio_buffer_limit_ms = 90_000
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join a call and start video analysis."""
    try:
        await agent.create_user()
        call = await agent.create_call(call_type, call_id)
        async with agent.join(call):
            await agent.simple_response("Tell the user a story about the video.")
            await agent.finish()
    except Exception as e:  # noqa: BLE001
        from getstream.video.rtc.connection_utils import SfuConnectionError

        if isinstance(e, SfuConnectionError):
            cause = e.__cause__ or e
            logger.error(
                "GetStream SFU connection failed: %s. Check STREAM_API_KEY and "
                "STREAM_API_SECRET in .env, and that your network allows WebRTC.",
                cause,
            )
        raise

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Step 3: Run It

bash

1
2
3
4
5
6
7
8
export STREAM_API_KEY=...
export STREAM_API_SECRET=...
EXAMPLE_BASE_URL=https://demo.visionagents.ai

export DEEPGRAM_API_KEY=...
export ELEVENLABS_API_KEY=...

uv run main.py run

A browser tab will open. Join the call, point your camera at something, and ask what it sees.

All processing happens locally on your device.

Why We Love This Setup

Qwen 3.5 Small gives you surprisingly strong multimodal performance for its size, and Ollama makes running it locally easy.

The Vision Agents Ollama plugin lets you use these tiny models with the same clean API as cloud LLMs.

You get full privacy, zero cloud costs, and the ability to run vision and voice agents even on a MacBook or lightweight hardware.

Links & Resources

Full source code
Vision Agents repo
Vision Agents docs
Ollama
Qwen 3.5 models on Ollama → ollama pull qwen3.5:2b

Try it on your Mac! Pull the 0.8B or 2B model and see how well it runs locally. 💻

Build a Local AI Agent with Qwen 3.5 Small on macOS