explaingit

m-bain/whisperx

Analysis updated 2026-06-21

21,726PythonAudience · developerComplexity · 3/5Setup · moderate

TLDR

WhisperX turns audio recordings into text transcripts with precise word-level timestamps and optional speaker labels, adding fast batched processing on top of OpenAI Whisper.

Mindmap

mindmap
  root((WhisperX))
    What it does
      Speech to text
      Word timestamps
      Speaker labels
    Pipeline Steps
      Voice detection
      Batch transcription
      Forced alignment
      Diarization
    Use Cases
      Video subtitles
      Podcast archives
      Meeting notes
    Tech Stack
      Python
      faster-whisper
      pyannote-audio
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Generate subtitle files for videos with accurate word-by-word timing synced to the audio.

USE CASE 2

Build a searchable archive of podcasts or meeting recordings with precise timestamps.

USE CASE 3

Automatically label which speaker said what in a multi-person conversation or interview.

USE CASE 4

Feed precisely timed transcripts into downstream tools like video editors or search engines.

What is it built with?

PythonPyTorchCUDAfaster-whisperpyannote-audio

How does it compare?

m-bain/whisperxgraphdeco-inria/gaussian-splattingrecommenders-team/recommenders
Stars21,72621,67321,669
LanguagePythonPythonPython
Setup difficultymoderatehardmoderate
Complexity3/54/53/5
Audiencedeveloperdeveloperresearcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Requires a Hugging Face access token for speaker diarization, GPU with CUDA recommended for fast performance.

In plain English

WhisperX is an open-source tool that turns recorded speech into text and tags each word with a precise timestamp, so a transcript lines up tightly with the audio. It builds on Whisper, a speech-recognition model from OpenAI that is good at the words themselves but only gives rough timing for whole utterances and cannot natively process audio in batches. WhisperX adds the missing pieces: accurate per-word timing, fast batched inference, and the ability to tell different speakers apart in a conversation. The README describes a small pipeline. First, Voice Activity Detection finds the chunks of audio that contain speech and skips silence, which speeds things up and cuts down on the model "hallucinating" words that were never said. Those chunks are then transcribed in batches using a faster-whisper backend, which the project says reaches around 70x realtime with the large-v2 model. Next, a phoneme-based recognition model (a sibling family of models trained on the smallest sound units in language, such as the p in "tap") is used for forced alignment: lining up the transcript against the audio so each word gets an exact start and end time. An optional speaker-diarization step from pyannote-audio then splits the audio by speaker and attaches speaker labels to the transcript. You would use WhisperX when the timestamps matter as much as the text: making subtitle files for video, building searchable archives of podcasts or meetings, or feeding precisely timed transcripts into downstream tools. It is a Python package installable from PyPI, runs on GPU through CUDA or on CPU, and uses a Hugging Face token to enable diarization.

Copy-paste prompts

Prompt 1
I am using WhisperX to transcribe a podcast. Write Python code to transcribe an audio file and output an SRT subtitle file with word-level timestamps.
Prompt 2
Help me set up WhisperX with speaker diarization using pyannote-audio, I need to label which speaker said what in my meeting recording.
Prompt 3
My WhisperX transcription is slow. How do I enable batched GPU inference with CUDA to reach faster than realtime processing?
Prompt 4
Write a Python script using WhisperX that takes an audio file and outputs JSON with each word, its start time, end time, and speaker label.
Prompt 5
How do I install WhisperX from PyPI and run it on a CPU instead of a GPU?

Frequently asked questions

What is whisperx?

WhisperX turns audio recordings into text transcripts with precise word-level timestamps and optional speaker labels, adding fast batched processing on top of OpenAI Whisper.

What language is whisperx written in?

Mainly Python. The stack also includes Python, PyTorch, CUDA.

How hard is whisperx to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is whisperx for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub m-bain on gitmyhub

Verify against the repo before relying on details.