Silero

Silero VAD is a high-accuracy, lightweight Voice Activity Detection (VAD) model designed to detect human speech in real-time audio streams.
It is optimized for low-latency environments and works efficiently on both CPU and GPU, with optional ONNX Runtime support for accelerated inference.

The Silero plugin in the Stream Python AI SDK uses this model to detect when a person starts and stops speaking.
This is especially useful for streaming applications where you need to know when to begin transcription, cut off silence, or manage conversational turns in real time.

Initialisation

The Silero plugin for Stream exists in the form of the SileroVAD class:

vad = SileroVAD()

Parameters

These are the parameters available in the SileroVAD plugin for you to customise:

NameTypeDefaultDescription
sample_rateint48000Audio sample rate in Hz expected from the input stream.
activation_thfloat0.4Threshold (0.0–1.0) for detecting the start of speech.
deactivation_thfloat0.2Threshold (0.0–1.0) for detecting the end of speech. Defaults to 0.7 * activation_th if not set.
speech_pad_msint300Milliseconds of padding to add before and after detected speech segments.
min_speech_msint250Minimum length (in ms) of audio required to emit a speech segment.
max_speech_msint30000Maximum allowed speech duration before forcing a flush.
model_rateint16000Sample rate required by the Silero model (typically 16000 Hz).
window_samplesint512Number of samples per processing window (512 for 16kHz, 256 for 8kHz).
devicestr"cpu"Device to run inference on (e.g., "cpu", "cuda", "cuda:0").
partial_framesint10Number of frames to accumulate before emitting a “partial” speech event (for real-time feedback).
use_onnxboolFalseWhether to use ONNX Runtime for inference instead of PyTorch.

Functionality

Process Audio

Once you join the call, you can listen to the connection for audio events. You can then pass along the audio events for the VAD class to process:

from getstream.video import rtc

async with rtc.join(call, bot_user_id) as connection:

    @connection.on("audio")
    async def _on_pcm(pcm: PcmData, user):
        await vad.process_audio(pcm, user)

Events

Audio Event

The audio event is triggered when a final transcript is available from Deepgram:

@vad.on("audio")
async def _on_audio(pcm: PcmData, user):
    # Process audio event here

Partial Event

The partial event is fired in real time as Silero generates intermediate (partial) transcriptions. You can specify the partial_frames parameter while initialising to specify how large this partial event can be.

@vad.on("partial")
async def _on_partial(pcm: PcmData, user):
    # Process partial event here

Close

You can close the VAD connection with the close() method:

vad.close()

Example

Check out our Silero example to see a practical implementation of the plugin and get inspiration for your own projects.

© Getstream.io, Inc. All Rights Reserved.