Transcribing Calls

Transcription converts verbal conversations into readable text that can be saved, searched, and processed. Having transcription enabled makes video calls more accessible to deaf or hard-of-hearing participants, while also creating valuable documentation of meetings and conversations. When your calls are transcribed, you can easily search through past discussions and reference specific points that were made. It is particularly helpful for participants who might miss parts of the conversation or need to multitask during calls.

In-Call Transcription processes audio in real-time as the conversation happens, enabling live captions and immediate AI responses. The Stream Python AI SDK specializes in this approach for low-latency interactions.

Post-Call Transcription processes recorded audio after the call ends, offering higher accuracy but without the immediate benefits of real-time transcription.

The Stream Python AI SDK makes it easy to add transcription capabilities to your video calls using various Speech-to-Text (STT) providers like Deepgram and Moonshine. These providers can handle different accents, languages, and speaking styles while maintaining high accuracy in real-time transcription. This opens up possibilities for AI integration, allowing automated systems to understand and respond to spoken content in your applications.

In this section, we explore transcribing a video call using an AI plugin available in the SDK.

Basics

The SDK contains an abstract STT class which makes sure all transcription plugins work the same way.

You as the developer are responsible for passing along the text that you want to transcribe. Let’s look into the methods available to us for transcription providers as well as the events that accompany them.

Methods

Process Audio

Once you join the call, you can listen to the connection for audio events. You can then pass along the audio events for the STT class to process:

from getstream.video import rtc

async with rtc.join(call, bot_user_id) as connection:

    @connection.on("audio")
    async def on_audio(pcm: PcmData, user):
        # Process audio through Deepgram STT
        await stt.process_audio(pcm, user)

Close

You can close the STT connection with the close() method:

stt.close()

Events

Transcript Event

The transcript event is emitted per utterance:

@stt.on("transcript")
async def on_transcript(text: str, user: any, metadata: dict):
    # Process transcript event here

Partial Transcript Event

The partial transcript event is fired in real time as intermediate (partial) transcriptions are generated:

@stt.on("partial_transcript")
async def on_partial_transcript(text: str, user: any, metadata: dict):
    # Process partial transcript event here

Error Event

If an error occurs, an error event is fired:

@stt.on("error")
async def on_stt_error(error):
    # Process error event here

Example

To transcribe a call, we must achieve these steps:

Initialise Stream client
Create a call
Initialise the STT provider (Deepgram is used here)
Listen to audio events and pass them along to STT provider for processing

This is how it looks in code:

import asyncio
import logging
import time
import uuid
from dotenv import load_dotenv

from getstream.stream import Stream
from getstream.video import rtc
from getstream.video.rtc.track_util import PcmData
from getstream.plugins.deepgram.stt import DeepgramSTT
from examples.utils import create_user, open_browser

async def main():
    # Load environment variables
    load_dotenv()

    # Initialize Stream client from ENV
    client = Stream.from_env()

    # Create a unique call ID for this session
    call_id = str(uuid.uuid4())

    # Create
    user_id = f"user-{uuid.uuid4()}"
    client.upsert_users(UserRequest(id=user_id, name="My User"))
    user_token = client.create_token(user_id, expiration=3600)

    bot_user_id = f"transcription-bot-{uuid.uuid4()}"
    client.upsert_users(UserRequest(id=bot_user_id, name="Transcription Bot"))

    # Create the call
    call = client.video.call("default", call_id)
    call.get_or_create(data={"created_by_id": bot_user_id})

    # Initialize Deepgram STT (api_key comes from .env)
    stt = DeepgramSTT()

    try:
        async with await rtc.join(call, bot_user_id) as connection:

            # Set up transcription handlers
            @connection.on("audio")
            async def on_audio(pcm: PcmData, user):
                # Process audio through Deepgram STT
                await stt.process_audio(pcm, user)

            @stt.on("transcript")
            async def on_transcript(text: str, user: any, metadata: dict):
                timestamp = time.strftime("%H:%M:%S")
                user_info = user if user else "unknown"
                print(f"[{timestamp}] {user_info}: {text}")
                if metadata.get("confidence"):
                    print(f"    └─ confidence: {metadata['confidence']:.2%}")

            @stt.on("partial_transcript")
            async def on_partial_transcript(text: str, user: any, metadata: dict):
                if text.strip():  # Only show non-empty partial transcripts
                    user_info = user if user else "unknown"
                    print(f"    {user_info} (partial): {text}", end="\r")  # Overwrite line

            @stt.on("error")
            async def on_stt_error(error):
                print(f"\n❌ STT Error: {error}")

            # Keep the connection alive and waits for audio
            await connection.wait()

    except asyncio.CancelledError:
        print("\n⏹️  Stopping transcription bot...")
    except Exception as e:
        print(f"❌ Error: {e}")
        import traceback
        traceback.print_exc()
    finally:
        await stt.close()
        client.delete_users([user_id, bot_user_id])

if __name__ == "__main__":
    asyncio.run(main())

With this, you now have a call with transcription enabled through Deepgram. All transcription providers in the SDK will work similarly but may differ in customisation through parameters.

Recording Calls