Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub

Build a Local AI Agent with Qwen 3.5 Small on macOS

New
3 min read
Amos G.
Amos G.
Published March 17, 2026

Qwen 3.5 Small is a new family of lightweight, high-performance models from Alibaba (0.8B, 2B, 4B, and 9B parameters) that are now available on Ollama.

These models support multimodal input, native tool calling, and strong reasoning, all while running efficiently on laptops, Macs, and even mobile/IoT devices.

In this demo, the agent runs completely locally using the 3.5:2b model and accurately describes what it sees in the live camera feed:

“I see a man in a black cap and a dark jacket. He appears to be looking down and not engaging with the camera.”

Here’s how to build the same locally-running vision + voice agent in Python using Qwen 3.5 Small and Vision AI Agents in less than five minutes.

What You’ll Build

  • A fully local vision + voice agent that analyzes your camera feed in real time
  • Runs entirely on your Mac, laptop, or even mobile device using Ollama + Qwen 3.5 Small
  • Combines local multimodal understanding with natural speech input and output
  • No cloud LLM calls; everything stays on your machine

The Stack

Requirements

  • Ollama installed and running (ollama pull qwen3.5:2b recommended)
  • API keys for: Stream, Deepgram, ElevenLabs

Step 1: Install Vision Agents + Plugins

bash
1
2
uv add vision-agents uv add "vision-agents[getstream, deepgram, elevenlabs, smart-turn]"

Step 2: Full Working Code (main.py)

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
"""Video analysis agent using Ollama (qwen3.5:9b). Run from the project root so the installed vision_agents is used, e.g.: uv run python plugins/ollama/video_analysis_agent.py run """ import logging from pathlib import Path from dotenv import load_dotenv from vision_agents.core import Agent, AgentLauncher, User, Runner from vision_agents.plugins import getstream, deepgram, elevenlabs, smart_turn from vision_agents.plugins.ollama import VLM as OllamaVLM # Load .env from project root so STREAM_* and other keys are always found _project_root = Path(__file__).resolve().parent.parent.parent load_dotenv(_project_root / ".env") logger = logging.getLogger(__name__) def create_agent(**kwargs) -> Agent: """Create a video analysis agent using Ollama with qwen3.5:9b.""" agent = Agent( edge=getstream.Edge(), agent_user=User(name="Video Analyst", id="agent"), instructions=( "You are a video analysis assistant. Analyze the video feed and " "answer questions about what you see. Be detailed and descriptive " "in your observations." ), llm=OllamaVLM( model="qwen3.5:9b", fps=1, frame_buffer_seconds=10, ), stt=deepgram.STT(), tts=elevenlabs.TTS(), turn_detection=smart_turn.TurnDetection(), ) agent._audio_buffer_limit_ms = 90_000 return agent async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: """Join a call and start video analysis.""" try: await agent.create_user() call = await agent.create_call(call_type, call_id) async with agent.join(call): await agent.simple_response("Tell the user a story about the video.") await agent.finish() except Exception as e: # noqa: BLE001 from getstream.video.rtc.connection_utils import SfuConnectionError if isinstance(e, SfuConnectionError): cause = e.__cause__ or e logger.error( "GetStream SFU connection failed: %s. Check STREAM_API_KEY and " "STREAM_API_SECRET in .env, and that your network allows WebRTC.", cause, ) raise if __name__ == "__main__": Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Step 3: Run It

bash
1
2
3
4
5
6
7
8
export STREAM_API_KEY=... export STREAM_API_SECRET=... EXAMPLE_BASE_URL=https://demo.visionagents.ai export DEEPGRAM_API_KEY=... export ELEVENLABS_API_KEY=... uv run main.py run

A browser tab will open. Join the call, point your camera at something, and ask what it sees.

All processing happens locally on your device.

Why We Love This Setup

Qwen 3.5 Small gives you surprisingly strong multimodal performance for its size, and Ollama makes running it locally easy.

The Vision Agents Ollama plugin lets you use these tiny models with the same clean API as cloud LLMs.

You get full privacy, zero cloud costs, and the ability to run vision and voice agents even on a MacBook or lightweight hardware.

Try it on your Mac! Pull the 0.8B or 2B model and see how well it runs locally. 💻

Integrating Video With Your App?
We've built a Video and Audio solution just for you. Check out our APIs and SDKs.
Learn more