explaingit

snakers4/silero-vad

9,033PythonAudience · developerComplexity · 2/5LicenseSetup · easy

TLDR

Silero VAD is a tiny 2MB pre-trained model that detects when speech is present in audio and returns timestamps marking where people are talking, covers 6,000-plus languages, runs in under 1ms per chunk on CPU, and installs in one line.

Mindmap

mindmap
  root((Silero VAD))
    What it does
      Detects speech in audio
      Returns timestamps
      Filters silence
    Model Details
      2MB tiny model
      6000 plus languages
      Under 1ms per chunk
    Runtimes
      PyTorch
      ONNX Runtime
      C++ Rust Go wrappers
    Use Cases
      Pre-transcription filter
      Voice bot trigger
      Dataset cleaning
    Audience
      ML developers
      Voice app builders
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Strip silence from audio files before sending them to a transcription model to reduce cost and processing time

USE CASE 2

Build a voice bot that only activates its audio pipeline when someone is actually speaking, cutting CPU and API costs

USE CASE 3

Clean up call-center or podcast recording datasets by automatically tagging speech versus silence segments before training

USE CASE 4

Add a lightweight voice-activity trigger to an IoT or mobile device so it only wakes up and processes audio when speech is detected

Tech stack

PythonPyTorchONNX

Getting it running

Difficulty · easy Time to first run · 5min
Use freely for any purpose including commercial products, with no restrictions beyond keeping the copyright notice (MIT license).

In plain English

Silero VAD is a pre-trained machine learning model that detects when speech is present in an audio recording. Given an audio file or a stream, it returns timestamps marking where people are talking and where they are not. This is useful any time you need to separate speech from silence or background noise before passing audio on to a transcription system or another processing step. The model is small, around 2 megabytes, and fast: it processes a 30-millisecond audio chunk in under 1 millisecond on a single CPU core. It was trained on recordings covering more than 6,000 languages, which means it does not need to be retrained or adjusted for a specific language. It works at 8,000 Hz and 16,000 Hz sampling rates. Installation is a one-line pip command. The Python API is minimal: load the model, read an audio file, call get_speech_timestamps, and get back a list of time ranges where speech was detected. The underlying model can also run through ONNX runtime, which allows it to work in environments without PyTorch. Community-maintained wrappers exist for C++, Rust, Go, Java, and C#, as well as browser-based use through ONNX Runtime Web. Common use cases listed in the README include processing call-center recordings, building voice bots that only activate when someone speaks, cleaning audio datasets before training other models, and adding voice interfaces to IoT or mobile devices. The model is published under the MIT license. The README notes it has no telemetry, no registration requirement, no expiration date, and no vendor lock-in. It is also available on PyPI as a standard Python package.

Copy-paste prompts

Prompt 1
Install Silero VAD and show me how to load an audio file, detect speech segments, and print the start and end timestamps of each talking interval.
Prompt 2
Help me integrate Silero VAD into a Python voice bot so it only sends audio chunks to Whisper for transcription when speech is actually detected.
Prompt 3
Show me how to run Silero VAD using ONNX Runtime instead of PyTorch so I can use it in an environment without a GPU or a full PyTorch installation.
Prompt 4
I have a folder of call-center recordings. Write a Python script using Silero VAD that generates a JSON file for each recording listing all speech and silence time ranges.
Prompt 5
Help me use Silero VAD in real time with a microphone input stream, printing a message each time speech starts and stops.
Open on GitHub → Explain another repo

← snakers4 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.