snakers4/silero-vad

★ 9,033PythonAudience · developerComplexity · 2/5LicenseSetup · easy

Mindmap

mindmap
  root((Silero VAD))
    What it does
      Detects speech in audio
      Returns timestamps
      Filters silence
    Model Details
      2MB tiny model
      6000 plus languages
      Under 1ms per chunk
    Runtimes
      PyTorch
      ONNX Runtime
      C++ Rust Go wrappers
    Use Cases
      Pre-transcription filter
      Voice bot trigger
      Dataset cleaning
    Audience
      ML developers
      Voice app builders

mindmap root((Silero VAD)) What it does Detects speech in audio Returns timestamps Filters silence Model Details 2MB tiny model 6000 plus languages Under 1ms per chunk Runtimes PyTorch ONNX Runtime C++ Rust Go wrappers Use Cases Pre-transcription filter Voice bot trigger Dataset cleaning Audience ML developers Voice app builders

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Strip silence from audio files before sending them to a transcription model to reduce cost and processing time

USE CASE 2

Build a voice bot that only activates its audio pipeline when someone is actually speaking, cutting CPU and API costs

USE CASE 3

Clean up call-center or podcast recording datasets by automatically tagging speech versus silence segments before training

USE CASE 4

Add a lightweight voice-activity trigger to an IoT or mobile device so it only wakes up and processes audio when speech is detected

Tech stack

PythonPyTorchONNX

Getting it running

Difficulty · easy Time to first run · 5min

Use freely for any purpose including commercial products, with no restrictions beyond keeping the copyright notice (MIT license).

In plain English

Silero VAD is a pre-trained machine learning model that detects when speech is present in an audio recording. Given an audio file or a stream, it returns timestamps marking where people are talking and where they are not. This is useful any time you need to separate speech from silence or background noise before passing audio on to a transcription system or another processing step. The model is small, around 2 megabytes, and fast: it processes a 30-millisecond audio chunk in under 1 millisecond on a single CPU core. It was trained on recordings covering more than 6,000 languages, which means it does not need to be retrained or adjusted for a specific language. It works at 8,000 Hz and 16,000 Hz sampling rates. Installation is a one-line pip command. The Python API is minimal: load the model, read an audio file, call get_speech_timestamps, and get back a list of time ranges where speech was detected. The underlying model can also run through ONNX runtime, which allows it to work in environments without PyTorch. Community-maintained wrappers exist for C++, Rust, Go, Java, and C#, as well as browser-based use through ONNX Runtime Web. Common use cases listed in the README include processing call-center recordings, building voice bots that only activate when someone speaks, cleaning audio datasets before training other models, and adding voice interfaces to IoT or mobile devices. The model is published under the MIT license. The README notes it has no telemetry, no registration requirement, no expiration date, and no vendor lock-in. It is also available on PyPI as a standard Python package.

Copy-paste prompts

Prompt 1

Install Silero VAD and show me how to load an audio file, detect speech segments, and print the start and end timestamps of each talking interval.

Prompt 2

Help me integrate Silero VAD into a Python voice bot so it only sends audio chunks to Whisper for transcription when speech is actually detected.

Prompt 3

Show me how to run Silero VAD using ONNX Runtime instead of PyTorch so I can use it in an environment without a GPU or a full PyTorch installation.

Prompt 4

I have a folder of call-center recordings. Write a Python script using Silero VAD that generates a JSON file for each recording listing all speech and silence time ranges.

Prompt 5

Help me use Silero VAD in real time with a microphone input stream, printing a message each time speech starts and stops.

Open on GitHub → Explain another repo

← snakers4 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.