explaingit

m-bain/whisperx

21,726PythonAudience · developerComplexity · 3/5MaintainedLicenseSetup · moderate

TLDR

Fast speech-to-text with precise word-level timestamps and speaker identification, built on OpenAI's Whisper model.

Mindmap

mindmap
  root((WhisperX))
    What it does
      Transcribes audio to text
      Word-level timestamps
      Speaker identification
    How it works
      Voice activity detection
      Whisper transcription
      Forced alignment
      Speaker diarization
    Use cases
      Generate subtitles
      Interview analysis
      Meeting transcription
    Tech stack
      Python
      OpenAI Whisper
      Phoneme models
      GPU acceleration

Things people build with this

USE CASE 1

Generate accurate subtitles for videos with precise timing for each word.

USE CASE 2

Transcribe interviews or podcasts and identify which speaker said each part.

USE CASE 3

Process meeting recordings to create searchable transcripts with exact timestamps.

USE CASE 4

Analyze spoken content where you need to know exactly when each word was said.

Tech stack

PythonOpenAI WhisperPyTorchCUDA

Getting it running

Difficulty · moderate Time to first run · 30min

CUDA/GPU setup and PyTorch installation can be time-consuming; CPU fallback available but slow.

Use freely for any purpose, including commercial use, as long as you keep the copyright notice and license text.

In plain English

WhisperX is an automatic speech recognition tool, software that converts spoken audio into written text. It builds on Whisper, an open speech-recognition model developed by OpenAI, and adds three things Whisper alone does not do well: it runs much faster, it provides accurate timestamps for every individual word rather than for whole sentences, and it can tell different speakers apart and label who said what. The README explains the pipeline. Audio first passes through voice activity detection (VAD), which finds the segments where someone is actually speaking; this both reduces hallucinated transcriptions and allows the audio to be processed in batches. Those batches are then transcribed by Whisper itself (running on a faster backend called faster-whisper) for high throughput, the README cites roughly seventy times real-time speed using the large-v2 model. The resulting words are then aligned to the audio using a phoneme-level model (wav2vec2.0) to produce word-level timestamps that are much more precise than Whisper's native sentence-level ones. Finally, optional speaker diarization (powered by pyannote-audio) partitions the audio by speaker so each line in the transcript can be labelled with a speaker ID. You would use WhisperX whenever you need a high-quality transcript with reliable word timings, for example to generate subtitle files where the text needs to land on the right frame, to build a searchable index of meetings or lectures, or to feed downstream tools that act on individual words. WhisperX is written in Python, distributed on PyPI, and can run on the GPU (with CUDA) or fall back to CPU. Enabling speaker diarization requires a Hugging Face token and accepting the model's user agreement. The full README is longer than what was provided.

Copy-paste prompts

Prompt 1
How do I use WhisperX to transcribe an audio file and get word-level timestamps for subtitle generation?
Prompt 2
Show me how to set up WhisperX with speaker diarization to identify who is speaking in a meeting recording.
Prompt 3
What GPU memory do I need to run WhisperX, and how do I optimize it for faster transcription?
Prompt 4
How do I transcribe audio in a language other than English using WhisperX?
Prompt 5
Can you help me integrate WhisperX into a Python script to batch process multiple audio files?
Open on GitHub → Explain another repo

Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.