Build an AI Voice Yoga Instructor in Python

Large Language Models (LLMs) have been improving recently and are often used for building conversational applications for speech and transcription. From answering location-based questions to managing a work calendar, voice AI assistants are becoming an everyday part of both personal and professional life.

In this tutorial, we’ll take those same technologies a step further, using LLMs, real-time video analysis, and speech-to-speech APIs to create an AI yoga instructor. This intelligent voice agent can see your poses through your webcam, analyze your form in real-time, and speak personalized feedback to help improve your practice.

By the end of the tutorial, you’ll have a fully interactive AI fitness companion in Python.

The AI Yoga Instructor Overview

As the diagram below illustrates, the yoga assistant will be developed using three primary technologies.

Vision Agents will provide the backend and frontend architectures to run the AI assistant. The Gemini Live API will enable users to engage in real-time speech conversations with the yoga agent. The Ultralytics YOLO model will help visualize users' poses, allowing the voice agent to see, correct, or provide instant feedback.

Think of it as a friendly assistant and coach for your daily workout in the gym and at home.

How It Works

The sample project in this tutorial demonstrates the use of the Gemini Live API for speech-to-speech conversations. The voice agent/instructor uses WebSockets to establish a peer connection with Google Gemini servers, enabling real-time and bidirectional audio streaming.

In the demo app, the Yoga AI instructor will:

Guide users through their yoga practice with real-time voice instructions, introducing poses, transitions, and breathing cues.
Watch videos from the users’ camera feed via Stream Video. Alternatively, you can implement a custom audio and video transport mechanism using WebRTC or WebSockets.
Use Ultralytics YOLO pose detection to analyze and correct users' body positions and movements.
Process videos in real-time using an LLM from Gemini. Similarly, you can use models from OpenAI, Anthropic, and any AI service provider you prefer.
Provide voice feedback on a particular pose, and encourage users to make workouts a habit.

The demo combines a fast object detection model (YOLO) with full real-time AI capabilities. You can apply this pattern to other video AI use cases, such as sports coaching, physical therapy, and drone monitoring, among others.

Prerequisites

To create the sample yoga instructor project in this tutorial, you will need the following requirements and API credentials.

Python 3.13: Working with Vision Agents requires the installation of Python 3.13 or a later version.
Stream API key and secret for the real-time audio and video infrastructure. Create a free Stream account, sign in to your dashboard, and generate your API key and secret.
Gemini API key to access an audio and text generation model from the Gemini Live API. Sign in to Google AI Studio and generate a new API key.
Alternatively, you can use a real-time model from OpenAI.
Vision Agents: An open-source video AI platform for building low-latency voice and video applications in Python. Clone the repo to try example use cases such as golf coach, geoguesser, and more.

Underlying Technologies and Frameworks

When you run the Python script for the AI yoga instructor, it takes a user's voice input from the transport mechanism in Vision Agents. The Gemini Live API processes the raw input audio for speech-to-speech and transcription interactions. Since the API is real-time, a single Gemini model is required for speech-to-text, turn detection, and text-to-speech. Another approach is to use the OpenAI Realtime API or Nova Sonic, a foundational speech-to-speech model from Amazon.

When practicing yoga, your device's camera will capture your video and send it to Ultralytics Yolo for processing the video frames.

Vision Agents Overview

Vision Agents is a free and fully open-source Python-based AI framework that combines speech and video to build natural-sounding voice AI-user interaction experiences.

Vision Agents provides developers with two main architectures for building voice AI apps.

Developers can use its integrated real-time speech-to-speech (STS) APIs, such as Gemini Live, OpenAI Realtime, and STS models like Amazon Nova Sonic. Aside from using the STS approach, developers can assemble custom pipelines for speech recognition and synthesis, vision, and video processing using models from leading AI providers, such as ElevenLabs, Deepgram, Cartesia, Moondream, Ultralytics, and Fish Audio, among others.

Video Image Processors Overview

Processors offer powerful capabilities for transforming audio and video images into valuable real-time insights in AI applications. They help in analyzing images and videos and apply them in vision and physical AI projects.

To integrate processors with voice AI, you can use models from industry-leading AI platforms like Ultralytics, Roboflow, and Moondream. Gemini Robotics is another model that can be used for labeling objects in images and plotting their 2D point.

The following are some of the application areas of processors:

API Calls and State: Often, you can use them for additional states, such as the score/stats of a video game/sports match.
Analysis Video Images: Detect poses and recognize objects in real-time video.
Image and Video Capture: Easily supports AI-driven image/video capture in your app.

Building your own app? Get early access to our Livestream or Video Calling API and launch in days!

The Vision Agents AI framework makes it easy to integrate any of the above processor solutions with your app. Check out the how-to guide to learn more.

Set up a New Python Project and Install Dependencies

Ensure you have Python 3.13 or a later version installed on your machine and initialize a new project with uv.

uv init

The above command will generate a new Python project and configure your virtual environment.

Add a .env file to the project's root to store the following API credentials.

bash

1
2
3
4
5
6
# Set API Credentials
GEMINI_API_KEY=...
ULTRALYTICS_API_KEY=...
STREAM_API_KEY=...
STREAM_API_SECRET=...
EXAMPLE_BASE_URL=https://pronto-staging.getstream.io

You can also use the export command to store the above credentials in your system's shell profile.

export YOUR_API_CREDENTIAL=...

Next, install the core Vision Agents SDK and the required project plugins for the Yoga AI agent with the following commands.

bash

1
2
3
uv add vision-agents
uv add "vision-agents[getstream, gemini]"
uv pip install vision-agents-plugins-ultralytics

We now have the audio/video edge transport, real-time voice LLM, and pose detection plugins installed.

Create Instructions for the Yoga Voice Agent

When working with Vision Agents, you can add instructions to steer the voice assistant's output to suit a specific need when instantiating an agent programmatically. For a basic voice demo, you can embed the instructions directly in the agent's definitions. Since the Yoga AI instructor requires detailed instructions for both beginner and advanced poses, as well as safety and accurate instructions, it is recommended to define these instructions in a separate Markdown file.

In your project's root directory, create a new Markdown file and fill out its content with yoga_instructor_guide.md to define the voice agent's core responsibilities and how to interact with users.

In the next section, you can reference yoga_instructor_guide.md when you create a new agent.

python

1
2
3
agent = agents.Agent(
    instructions="Read @yoga_instructor_guide.md"
)

Configure the Yoga Voice Instructor in Python

In the root directory of your Python project, find main.py and replace its content with the following sample code.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import asyncio
import logging
from uuid import uuid4

from dotenv import load_dotenv

from vision_agents.core.edge.types import User
from vision_agents.core import agents
from vision_agents.plugins import getstream, ultralytics, gemini

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load environment variables
load_dotenv()

# Read the yoga instructor guide
with open("yoga_instructor_guide.md", "r") as f:
    YOGA_INSTRUCTOR_INSTRUCTIONS = f.read()

async def start_yoga_instructor() -> None:
    """
    Start the yoga instructor agent with real-time pose detection
    """
    logger.info("🧘 Starting Yoga AI Instructor Agent...")

    # Create the agent with YOLO pose detection
    agent = agents.Agent(
        edge=getstream.Edge(),  # Stream's edge for low-latency video transport
        agent_user=User(name="AI Yoga Instructor", id="yoga_instructor_agent"),
        instructions="Read @yoga_instructor_guide.md",

        # Choose your LLM - uncomment your preferred option:
        # Option 1: Gemini Realtime (good for vision analysis)
        llm=gemini.Realtime(),

        # Option 2: OpenAI Realtime (alternative option)
        # llm=openai.Realtime(fps=10, model="gpt-4o-realtime-preview"),

        # Add YOLO pose detection processor
        processors=[
            ultralytics.YOLOPoseProcessor(
                model_path="../../yolo11n-pose.pt",  # YOLO pose detection model
                conf_threshold=0.5,  # Confidence threshold for detection
                enable_hand_tracking=True,  # Enable hand keypoint detection for detailed feedback
            )
        ],
    )

    logger.info("✅ Agent created successfully")

    # Create the agent user in the system
    await agent.create_user()
    logger.info("✅ Agent user created")

    # Create a call (room) for the video session
    call_id = str(uuid4())
    call = agent.edge.client.video.call("default", call_id)
    logger.info(f"✅ Call created with ID: {call_id}")

    # Open the demo UI in browser
    await agent.edge.open_demo(call)
    logger.info("🌐 Demo UI opened in browser")

    # Join the call and start the session
    logger.info("🎥 Agent joining the call...")
    with await agent.join(call):
        # Initial greeting and instructions
        await agent.llm.simple_response(
            text=(
                "Namaste! 🧘‍♀️ I'm your AI yoga instructor with a soft Scottish accent. "
                "I'll be guiding you through your practice today with the help of pose analysis. "
                "I can help you with standing poses, seated poses, transitions, and breathing. "
                "Just step onto your mat and show me what you'd like to work on. "
                "Remember to breathe, ground yourself, and listen to your body. Let's begin!"
            )
        )

        logger.info("🧘 Session active - providing real-time yoga feedback...")

        # Run until the call ends
        await agent.finish()

    logger.info("👋 Session ended - Namaste!")

if __name__ == "__main__":
    asyncio.run(start_yoga_instructor())

In summary, we:

Configure a new agent with edge=getstream.Edge() for low-latency audio and video transport.
Instruct the agent to read yoga_ai_instructor_guide.md for detailed instructions instructions="Read @yoga_instructor_guide.md".
Configure the voice agent to use the Gemini Live API llm=gemini.Realtime() for real-time communication.
Finally, equip the agent with Ultralytics YOLO for workout pose detection.

python

1
2
3
4
5
6
7
        processors=[
            ultralytics.YOLOPoseProcessor(
                model_path="../../yolo11n-pose.pt", # YOLO pose detection model
                conf_threshold=0.5, # Confidence threshold for detection
                enable_hand_tracking=True, # Enable hand keypoint detection for detailed feedback
            )
        ],

Run the Yoga Voice Instructor

Navigate to the project's root directory and run main.py using uv.

uv run main.py

When the Python script runs successfully, you should see an output similar to the video preview above, allowing you to engage with the Yoga AI instructor in real-time to detect and guide you on beginner and advanced poses.

Join the Open-Source Community

In this tutorial, you saw how easy it is to build a yoga voice AI instructor in Python using Vision Agents. Beyond yoga, the same setup can be easily adapted for other use cases by swapping the Ultralytics YOLO processor with alternatives like Roboflow or Moondream to detect and track objects in live video.

For instance, you could use these plugins to identify players in a soccer match or analyze movement in other sports.

Vision Agents supports more than 30 AI service integrations, and its growing ecosystem of third-party plugins makes it a powerful platform for developing speech and video AI experiences. The list of available third-party AI provider integrations with Vision Agents continues to grow every week.

Join the Discord community for questions and problems you may encounter with the framework when building your voice AI app. And check out PyPi for available Python packages to extend your Vision Agents implementation.

If you are interested in the future of Vision Agents, refer to the roadmap in its repository, watch for updates, and give it a ⭐️.