Transcribe a recorded meeting or interview and get back a labeled transcript showing which speaker said each line.
Process podcast audio into a text script with speaker labels automatically, without manual editing.
Analyze a multi-speaker audio recording to extract what each individual person said at what time.
Requires a GPU with at least 10 GB VRAM for best results, also needs FFMPEG and Cython installed before the Python dependencies.
Whisper Diarization is a Python tool that transcribes audio and labels each sentence with the name of the person who said it. Standard transcription tools turn speech into text but treat everyone as one voice. This project goes further by identifying different speakers and assigning each spoken segment to the right person, a task called speaker diarization. The pipeline works in several stages. First it extracts just the vocal track from the audio, which improves how accurately it can identify speakers. Then it uses OpenAI's Whisper model to turn the audio into text with timestamps. A forced-alignment step corrects those timestamps at the word level to reduce timing errors. Then it runs a voice activity detection model to remove silences, and a speaker embedding model to identify each unique voice. The final step combines the word timestamps with the speaker identities to produce a labeled transcript, and a punctuation model does a final timing correction. To run it, you need Python 3.10 or later, along with FFMPEG (a standard audio-processing tool available on most operating systems) and Cython installed first. After installing the Python dependencies, you point the script at an audio file with a single command. Command-line options let you choose which Whisper model size to use, specify the language if auto-detection fails, adjust batch size if you run low on GPU memory, and disable the vocal-extraction step if you prefer. The tool works best with a GPU that has at least 10 GB of video memory. A parallel variant of the script is available for faster processing on high-memory machines, though the README notes it is still experimental. The main known limitation is that overlapping speech, two people talking at once, is not yet handled well. The project is open source. It builds on OpenAI Whisper, Faster Whisper, Nvidia NeMo, and Facebook's Demucs, combining them into one working pipeline.
← mahmoudashraf97 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.