Transcribe podcast episodes or meeting recordings into searchable text.
Translate foreign-language videos or interviews into English subtitles.
Build accessibility features that caption live audio streams in real time.
Extract speech from video files and generate transcripts for documentation.
Requires ffmpeg system dependency and PyTorch installation, which can take time depending on GPU availability.
Whisper is a general-purpose speech recognition model from OpenAI. The README states it is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. In everyday terms, you give it an audio file and it gives you back text, either a transcript of the speech in the original language, or an English translation of speech in another language. How it works: a Transformer sequence-to-sequence neural network is trained on speech tasks including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. Those tasks are jointly represented as a sequence of tokens predicted by the decoder, so one model replaces many stages of a traditional pipeline. The transcribe method processes audio with a sliding 30-second window. The README provides six model sizes, tiny, base, small, medium, large, and turbo, with English-only and multilingual versions. Sizes range from 39M parameters needing about 1 GB of VRAM and roughly 10× the speed of large, up to 1550M parameters at about 10 GB of VRAM. The turbo model is an optimized version of large-v3 that the README says offers faster transcription with minimal accuracy loss but is not trained for translation. You install it via pip install -U openai-whisper and need ffmpeg installed. After install, you can transcribe at the command line (whisper audio.flac --model turbo), specify a language, or translate non-English speech to English with a multilingual model. You can also call whisper.load_model and model.transcribe from Python, or drop to lower-level helpers for language detection and decoding. The repository is written in Python and the full README is longer than what was provided.
Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.