Tutorials: Vision

New

Build a Voice Agent That Calls to Confirm Fraud Alerts

Build a Python voice agent with Vision Agents that places an outbound call to a cardholder, reads back a suspicious transaction, and either clears it or freezes the card - using Deepgram, an LLM, Cartesia, and Twilio.

16 min read

Tutorials

Vision

How to Build a Background Removal Tool with Segment Anything & Vision Agents

A step-by-step guide to building a real-time background removal tool with SAM 2, YOLO, and Vision Agents. Runs on a CPU, no GPU required.

22 min read

Tutorials

Vision

Gemini Live API & Lyria 3: Generate Music From Text, Phone & Video Calls

The instrumental background music in the video below is AI-generated using Lyria 3 by Google DeepMind. Lyria 3 allows anyone to generate AI music from text and image prompts. The music demos in this article take it further by adding another input prompt modality, your voice. Let's proceed to generate your first music with Lyria

17 min read

Tutorials

Vision

How to Clone Any Voice in Minutes Using Voxtral TTS

What You Will Build This tutorial demonstrates how to build an AI speech app with in-app voice cloning support. You can clone your favorite voice by supplying a reference audio of about 3 seconds. Here is a demo. Voice cloning example demonstrating reference and output voices Voice cloning example demonstrating reference and agent's output voices

11 min read

Tutorials

Vision

How To Design AI Voices in Minutes Using Qwen3-TTS

Before You Start To begin, ensure that you meet these requirements and have the following credentials. Python 3.13 or a later version. An Apple Silicon Mac (recommended) or any modern laptop. Stream API credentials (for realtime audio and video communication). A HuggingFace Account and access token (HF_TOKEN). A Deepgram API key (for speech-to-text). A Google

14 min read

Tutorials

Vision

The 6 Best On-Device TTS Models for Voice AI

When building voice AI applications, you have industry-leading cloud options for text-to-speech, such as Cartesia Sonic 3 and Grok TTS. For privacy and to avoid sharing your business's data with these commercial text-to-speech (TTS) providers, your team may want to use free, open-source solutions that run locally on mobile and desktop devices. Continue reading to

27 min read

Tutorials

Vision

Build a Restaurant Reservation AI Agent With Turbopuffer and Twilio

Let's build a restaurant reservation system to speak with a voice agent via a real-time phone call. The service will have three main features: Agent Outbound Call: The agent can act as both a customer helper and a restaurant assistant. For example, it can be configured as an AI restaurant employee that calls customers back

14 min read

Tutorials

Vision

Grok TTS + Vision: Build a Healthcare Appointment Agent

This step-by-step guide will help you build an AI front-desk receptionist that interacts with patients through conversations, assesses their conditions, and advises whether to visit a doctor or seek online medical advice. When an agent can see the patient's condition in real time, it can make a smarter recommendation, saving patients an unnecessary trip to

15 min read