Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub ->

Engineering: AI

How Text-to-Speech Works: Neural Models, Latency, and Deployment

Not long ago, text-to-speech (TTS) was a laughing stock. Robotic, obviously synthetic output that made customer service jokes write themselves and relegated TTS to accessibility contexts where users had no alternative. Now, you may have listened to text-to-speech today without even realizing. AI-generated podcasts, automated customer service calls, voice assistants that actually sound like assistants.
Read more ->
17 min read

Edge-Optimized Speech Workflows: Combining Deepgram Nova-3 STT with Fish Speech V1.5 TTS

AI won’t stay online. It won’t stay on your laptop. It won’t stay centralized. It will move to every device and to the edge of every network, into your earbuds, your car, your factory floor, and your doorbell. This opens up a remarkable number of use cases. A fitness coach who listens continuously, counts your
Read more ->
15 min read

Visual Intelligence in Claude: Interpreting Documents and Structured Content

Claude isn’t the model most users turn to when needing visual capabilities. Rather than optimizing primarily for object detection or scene description, Claude processes visual content through the same reasoning architecture it uses for text. This design choice has significant implications for developers: Claude excels at tasks requiring interpretation and explanation rather than pure perception.
Read more ->
15 min read

Advanced Visual Reasoning with DeepSeek-VL and InternVL3

There's an obvious tendency to reach for the latest proprietary model when you need advanced AI. These are the frontier models after all, and thus deemed the “best.” But best really depends on what you're optimizing for. Proprietary APIs charge per request. For video workloads, that means per frame, and costs compound fast. They also
Read more ->
17 min read

Vision Agents v0.3: Deployments, HTTP Support, & 10 New Plugins

Two months after v0.2, we're excited to share Vision Agents v0.3—our next significant milestone towards running agents in production at scale. While v0.2 introduced the foundation for building realtime multimodal AI agents, v0.3 takes these agents from prototype to production. This release brings the infrastructure you need to deploy agents at scale: HTTP APIs, observability,
Read more ->
7 min read

The 2026 Python Libraries for Real-Time Multimodal Agents

Every vision-language model tutorial shows you the same thing: send an image to GPT-4o, get a description back. Ten lines of Python. Done. response = client.chat.completions.create(     model="gpt-4o",     messages=[{         "role": "user",         "content": [             {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},             {"type": "text", "text": "What's in this image?"}         ]     }] ) Real applications need something different. A security camera
Read more ->
20 min read

Seeing with GPT‑4o: Building with OpenAI’s Vision Capabilities

Over the last few years, developers have gone from using language models for text-only chat to relying on them as general-purpose perception systems. You're not only building chatbots; you're building apps that use text, audio, and vision to understand and act on the world around them. GPT-4o is the most capable step yet: a single
Read more ->
14 min read

From Cameras to Action: Real‑World Applications of Vision and Speech AI

You're working in a warehouse when you see an automated forklift barreling towards a coworker. You whip out your phone and type "STOP!" into the app controlling the vehicle. You add another exclamation point to make sure it knows it's an emergency. That's not good enough, and it's not how things have to be. AI
Read more ->
9 min read