Transcribe long podcast or interview recordings to text in minutes on a GPU-equipped machine.
Generate word-level or chunk-level timestamps from audio for subtitles or searchable transcripts.
Identify different speakers in a conversation recording using the built-in diarization integration.
Integrate fast Whisper transcription into a Python data pipeline without calling an external API.
Requires a compatible NVIDIA GPU or Apple Silicon Mac, speaker diarization needs an additional setup step with a separate model.
Insanely Fast Whisper is a command-line tool that converts audio files into text transcripts, using OpenAI's Whisper speech recognition model. The headline claim in the README is that it can transcribe two and a half hours of audio in under 98 seconds when run on a high-end GPU, which it achieves by combining several speed optimization techniques. Whisper is an AI model that listens to audio and writes down what was said. It is quite accurate and supports many languages. The standard way to run Whisper is slow, especially for long recordings. This tool wraps Whisper with several acceleration techniques from the Hugging Face ecosystem, including half-precision math (fp16), batch processing, and an optional feature called Flash Attention 2, which makes the attention calculations inside the model faster. The README includes benchmark numbers showing how each combination of optimizations affects speed. Using it is meant to be simple. You install it with a single command and then point it at an audio file. It runs entirely on your own machine, so your audio never leaves your computer. It works on NVIDIA graphics cards and on Macs with Apple Silicon chips. The tool can transcribe audio or translate it into English from another language. It can also produce timestamps at the word level or by chunks, which is useful if you want to know exactly when each part of the transcript occurred. For situations where you need to identify who is speaking in a conversation, the tool integrates with a separate speaker diarization model. Diarization means labeling each part of the transcript with which speaker said it. This requires an additional step to set up. The project started as a benchmark demonstration and grew into a practical tool based on community interest. It is not affiliated with OpenAI. The code can also be used as a Python snippet rather than through the CLI if you prefer to integrate it into your own scripts.
← vaibhavs10 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.