vad = SileroVAD()
Silero
Silero VAD is a high-accuracy, lightweight Voice Activity Detection (VAD) model designed to detect human speech in real-time audio streams.
It is optimized for low-latency environments and works efficiently on both CPU and GPU, with optional ONNX Runtime support for accelerated inference.
The Silero plugin in the Stream Python AI SDK uses this model to detect when a person starts and stops speaking.
This is especially useful for streaming applications where you need to know when to begin transcription, cut off silence, or manage conversational turns in real time.
Initialisation
The Silero plugin for Stream exists in the form of the SileroVAD
class:
Parameters
These are the parameters available in the SileroVAD plugin for you to customise:
Name | Type | Default | Description |
---|---|---|---|
sample_rate | int | 48000 | Audio sample rate in Hz expected from the input stream. |
activation_th | float | 0.4 | Threshold (0.0–1.0) for detecting the start of speech. |
deactivation_th | float | 0.2 | Threshold (0.0–1.0) for detecting the end of speech. Defaults to 0.7 * activation_th if not set. |
speech_pad_ms | int | 300 | Milliseconds of padding to add before and after detected speech segments. |
min_speech_ms | int | 250 | Minimum length (in ms) of audio required to emit a speech segment. |
max_speech_ms | int | 30000 | Maximum allowed speech duration before forcing a flush. |
model_rate | int | 16000 | Sample rate required by the Silero model (typically 16000 Hz). |
window_samples | int | 512 | Number of samples per processing window (512 for 16kHz, 256 for 8kHz). |
device | str | "cpu" | Device to run inference on (e.g., "cpu" , "cuda" , "cuda:0" ). |
partial_frames | int | 10 | Number of frames to accumulate before emitting a “partial” speech event (for real-time feedback). |
use_onnx | bool | False | Whether to use ONNX Runtime for inference instead of PyTorch. |
Functionality
Process Audio
Once you join the call, you can listen to the connection for audio events. You can then pass along the audio events for the VAD class to process:
from getstream.video import rtc
async with rtc.join(call, bot_user_id) as connection:
@connection.on("audio")
async def _on_pcm(pcm: PcmData, user):
await vad.process_audio(pcm, user)
Events
Audio Event
The audio event is triggered when a final transcript is available from Deepgram:
@vad.on("audio")
async def _on_audio(pcm: PcmData, user):
# Process audio event here
Partial Event
The partial event is fired in real time as Silero generates intermediate (partial) transcriptions.
You can specify the partial_frames
parameter while initialising to specify how large this partial event can be.
@vad.on("partial")
async def _on_partial(pcm: PcmData, user):
# Process partial event here
Close
You can close the VAD connection with the close()
method:
vad.close()
Example
Check out our Silero example to see a practical implementation of the plugin and get inspiration for your own projects.