Engineering: AI
The 2026 Python Libraries for Real-Time Multimodal Agents
Every vision-language model tutorial shows you the same thing: send an image to GPT-4o, get a description back. Ten lines of Python. Done. response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "user", "content": [ {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}}, {"type": "text", "text": "What's in this image?"} ] }] ) Real applications need something different. A security camera
Read more ->
20 min read
Seeing with GPT‑4o: Building with OpenAI’s Vision Capabilities
Over the last few years, developers have gone from using language models for text-only chat to relying on them as general-purpose perception systems. You're not only building chatbots; you're building apps that use text, audio, and vision to understand and act on the world around them. GPT-4o is the most capable step yet: a single
Read more ->
14 min read
From Cameras to Action: Real‑World Applications of Vision and Speech AI
You're working in a warehouse when you see an automated forklift barreling towards a coworker. You whip out your phone and type "STOP!" into the app controlling the vehicle. You add another exclamation point to make sure it knows it's an emergency. That's not good enough, and it's not how things have to be. AI
Read more ->
9 min read
Lessons from Building an AI Football Commentator
Vision Agents is our open source framework for quickly building low-latency video AI applications on the edge. It runs on Stream’s global edge network by default, supports any edge provider and integrates with 25+ leading voice and video AI models. To put the framework to the test, we built a real-time sports commentator using stock
Read more ->
10 min read
How Machines See: Inside Vision Models and Visual Understanding APIs
Before we read, before we write, we see. The human brain devotes more processing power to vision than to any other sense. We navigate the world through sight first, and a single glance tells us more than paragraphs of description ever could. For decades, this kind of visual understanding eluded machines. Computer vision could detect
Read more ->
8 min read
Seeing Like Gemini: Building Vision Applications with Google’s Multimodal Models
Google just dropped Gemini 3. The impression is it's impressive, and not just with words. The coolest concepts making the rounds are the ones that showcase the fundamental trait of the Gemini family of models: multimodality. From its inception, the Gemini models have been built different. Unlike GPT-4o or Claude, which bolt vision encoders onto
Read more ->
11 min read
Staying Competitive in a Rapid-Fire AI Landscape
Velocity is one of those words that shows up in every leadership deck and every product kickoff. But in practice, it behaves more like bubbles escaping a can of La Croix. The moment you try to hold onto it, it's gone. What remains is a backlog that looks less like a roadmap and more like
Read more ->
3 min read
What is MCP: The Infrastructure Powering Agentic AI
You might not be using AI agents yet, but you will soon. They'll schedule your meetings, analyze your data, write your code, and automate your workflows. However, to accomplish any of this, they need to access your calendar, data, code, and systems. Large language models can do this the same way any software talks to
Read more ->
10 min read