Create Speech-to-Text Experiences with ElevenLabs Scribe v2 Realtime

Create Speech-to-Text Experiences with ElevenLabs Scribe v2 Realtime & Vision Agents

Amos G.

Published February 6, 2026

ElevenLabs released Scribe v2 Realtime, an ultra-low latency speech-to-text model with ~150ms end-to-end transcription, supporting 90+ languages and claiming the lowest Word Error Rate in benchmarks for major languages and accents.

It's built specifically for agentic apps, live meetings, note-taking, and conversational AI, where every millisecond and every word matters.

In this demo, Scribe v2 Realtime transcribes both user speech and the agent's own voice output, enabling perfect real-time note-taking:

"I just want to use it for real-time note taking." → "OK, that's a great use case. Real-time note-taking can be super helpful."

— all with natural flow and no noticeable lag.

Here’s how to build the same speech-to-text experience yourself in under five minutes.

What You’ll Build

A voice AI app with ultra-accurate, real-time transcription for both user input and agent responses
Ideal for AI meeting assistants, live note-taking, interviews, or any app needing reliable live captions and understanding
Smooth, natural conversations with low-latency speech-to-text (STT) that captures every nuance, accent, and filler word
Custom pipeline using ElevenLabs Scribe v2 Realtime for STT, paired with Gemini LLM and ElevenLabs TTS

The Stack

LLM → Gemini (or any preferred model)
TTS → ElevenLabs
STT → ElevenLabs Scribe v2 Realtime
Transport → Stream WebRTC
Framework → Vision Agents (open-source)

Requirements (API Keys)

You’ll need API keys from:

ElevenLabs (for TTS and Scribe v2 for STT)
Google AI Studio (for Gemini LLM)
Stream (API key & secret)

Step 1: Set Up the Project

shell

1
2
3
4
5
uv init scribe-realtime-agent
cd scribe-realtime-agent

uv add vision-agents
uv add "vision-agents[getstream, gemini, elevenlabs, smart-turn]"

Step 2: Full Working Code (main.py)

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import getstream, gemini, elevenlabs

async def create_agent(**kwargs) -> Agent:
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Assistant", id="agent"),
        instructions="You're a helpful voice assistant. Be concise.",
        llm=gemini.LLM("gemini-2.5-flash"),
        stt=elevenlabs.STT(model_id="scribe_v2_realtime"),
        tts=elevenlabs.TTS(),
    )

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response("Greet the user")
        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Step 3: Run It

shell

1
2
3
4
5
6
7
8
9
10
11
export ELEVENLABS_API_KEY=...
export GOOGLE_API_KEY=...
export STREAM_API_KEY=...
export STREAM_API_SECRET=...
export EXAMPLE_BASE_URL=https://demo.visionagents.ai/

uv run main.py run

# Or

uv run main.py serve

Join the call and speak naturally. Scribe v2 Realtime will transcribe everything instantly and accurately.

Why We Love This Stack

Vision Agents makes it seamless to swap in ElevenLabs Scribe v2 Realtime as your STT in < 100 lines, handling streaming, turn detection, and dual transcription (user + agent audio) automatically.

Scribe v2 delivers accuracy at ultra-low latency, making it perfect for production voice AI.

Everything is open-source except the API calls.

Links & Resources

Give it a spin for your next meeting app or voice tool! 📝

Integrating Video With Your App?

We've built a Video and Audio solution just for you.
Check out our APIs and SDKs.

Learn more