Silero

Silero VAD is a high-accuracy, lightweight Voice Activity Detection (VAD) model designed to detect human speech in real-time audio streams.
It is optimized for low-latency environments and works efficiently on both CPU and GPU, with optional ONNX Runtime support for accelerated inference.

The Silero plugin in the Stream Python AI SDK uses this model to detect when a person starts and stops speaking.
This is especially useful for streaming applications where you need to know when to begin transcription, cut off silence, or manage conversational turns in real time.

Initialisation

The Silero plugin for Stream exists in the form of the SileroVAD class:

vad = SileroVAD()

Parameters

These are the parameters available in the SileroVAD plugin for you to customise:

Name	Type	Default	Description
`sample_rate`	`int`	`48000`	Audio sample rate in Hz expected from the input stream.
`activation_th`	`float`	`0.4`	Threshold (0.0–1.0) for detecting the start of speech.
`deactivation_th`	`float`	`0.2`	Threshold (0.0–1.0) for detecting the end of speech. Defaults to `0.7 * activation_th` if not set.
`speech_pad_ms`	`int`	`300`	Milliseconds of padding to add before and after detected speech segments.
`min_speech_ms`	`int`	`250`	Minimum length (in ms) of audio required to emit a speech segment.
`max_speech_ms`	`int`	`30000`	Maximum allowed speech duration before forcing a flush.
`model_rate`	`int`	`16000`	Sample rate required by the Silero model (typically 16000 Hz).
`window_samples`	`int`	`512`	Number of samples per processing window (512 for 16kHz, 256 for 8kHz).
`device`	`str`	`"cpu"`	Device to run inference on (e.g., `"cpu"`, `"cuda"`, `"cuda:0"`).
`partial_frames`	`int`	`10`	Number of frames to accumulate before emitting a “partial” speech event (for real-time feedback).
`use_onnx`	`bool`	`False`	Whether to use ONNX Runtime for inference instead of PyTorch.

Functionality

Process Audio

Once you join the call, you can listen to the connection for audio events. You can then pass along the audio events for the VAD class to process:

from getstream.video import rtc

async with rtc.join(call, bot_user_id) as connection:

    @connection.on("audio")
    async def _on_pcm(pcm: PcmData, user):
        await vad.process_audio(pcm, user)

Events

Audio Event

The audio event is triggered when a final transcript is available from Deepgram:

@vad.on("audio")
async def _on_audio(pcm: PcmData, user):
    # Process audio event here

Partial Event

The partial event is fired in real time as Silero generates intermediate (partial) transcriptions. You can specify the partial_frames parameter while initialising to specify how large this partial event can be.

@vad.on("partial")
async def _on_partial(pcm: PcmData, user):
    # Process partial event here

Close

You can close the VAD connection with the close() method:

vad.close()

Example

Check out our Silero example to see a practical implementation of the plugin and get inspiration for your own projects.

OpenAI

Recording Calls