mahmoudashraf97/whisper-diarization

★ 5,519Jupyter NotebookAudience · researcherComplexity · 3/5Setup · hard

Mindmap

mindmap
  root((repo))
    What it does
      Transcribe audio
      Label each speaker
    Pipeline stages
      Vocal extraction
      Whisper transcription
      Speaker identification
      Punctuation fix
    Tech stack
      Python
      OpenAI Whisper
      FFMPEG
    Requirements
      GPU recommended
      10GB VRAM ideal

mindmap root((repo)) What it does Transcribe audio Label each speaker Pipeline stages Vocal extraction Whisper transcription Speaker identification Punctuation fix Tech stack Python OpenAI Whisper FFMPEG Requirements GPU recommended 10GB VRAM ideal

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Transcribe a recorded meeting or interview and get back a labeled transcript showing which speaker said each line.

USE CASE 2

Process podcast audio into a text script with speaker labels automatically, without manual editing.

USE CASE 3

Analyze a multi-speaker audio recording to extract what each individual person said at what time.

Tech stack

PythonJupyter NotebookFFMPEGCythonOpenAI Whisper

Getting it running

Difficulty · hard Time to first run · 1h+

Requires a GPU with at least 10 GB VRAM for best results, also needs FFMPEG and Cython installed before the Python dependencies.

In plain English

Whisper Diarization is a Python tool that transcribes audio and labels each sentence with the name of the person who said it. Standard transcription tools turn speech into text but treat everyone as one voice. This project goes further by identifying different speakers and assigning each spoken segment to the right person, a task called speaker diarization. The pipeline works in several stages. First it extracts just the vocal track from the audio, which improves how accurately it can identify speakers. Then it uses OpenAI's Whisper model to turn the audio into text with timestamps. A forced-alignment step corrects those timestamps at the word level to reduce timing errors. Then it runs a voice activity detection model to remove silences, and a speaker embedding model to identify each unique voice. The final step combines the word timestamps with the speaker identities to produce a labeled transcript, and a punctuation model does a final timing correction. To run it, you need Python 3.10 or later, along with FFMPEG (a standard audio-processing tool available on most operating systems) and Cython installed first. After installing the Python dependencies, you point the script at an audio file with a single command. Command-line options let you choose which Whisper model size to use, specify the language if auto-detection fails, adjust batch size if you run low on GPU memory, and disable the vocal-extraction step if you prefer. The tool works best with a GPU that has at least 10 GB of video memory. A parallel variant of the script is available for faster processing on high-memory machines, though the README notes it is still experimental. The main known limitation is that overlapping speech, two people talking at once, is not yet handled well. The project is open source. It builds on OpenAI Whisper, Faster Whisper, Nvidia NeMo, and Facebook's Demucs, combining them into one working pipeline.

Copy-paste prompts

Prompt 1

How do I install Whisper Diarization and run it on a local audio file? Give me the full setup steps including FFMPEG, Cython, and Python dependencies.

Prompt 2

Show me the command to run whisper-diarization on an MP3 file using the medium Whisper model with Spanish as the specified language.

Prompt 3

What GPU memory do I need to run this tool, and what command-line options should I adjust if I only have 8GB VRAM instead of 10GB?

Prompt 4

How does the speaker diarization pipeline work in this project? Walk me through each stage from vocal extraction to the final labeled transcript.

Open on GitHub → Explain another repo

← mahmoudashraf97 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.