explaingit

openai/whisper

99,006PythonAudience · developerComplexity · 2/5MaintainedLicenseSetup · moderate

TLDR

OpenAI's speech recognition model that transcribes audio to text in multiple languages or translates speech to English. Install via pip, run from command line or Python code.

Mindmap

mindmap
  root((Whisper))
    What it does
      Transcribe speech
      Translate to English
      Identify language
    How to use
      Command line
      Python API
      Six model sizes
    Model options
      Tiny to large
      English only
      Multilingual
    Tech details
      Transformer network
      30-second windows
      VRAM requirements
    Use cases
      Subtitles
      Meeting notes
      Accessibility

Things people build with this

USE CASE 1

Transcribe podcast episodes or meeting recordings into searchable text.

USE CASE 2

Translate foreign-language videos or interviews into English subtitles.

USE CASE 3

Build accessibility features that caption live audio streams in real time.

USE CASE 4

Extract speech from video files and generate transcripts for documentation.

Tech stack

PythonPyTorchTransformerffmpeg

Getting it running

Difficulty · moderate Time to first run · 30min

Requires ffmpeg system dependency and PyTorch installation, which can take time depending on GPU availability.

Use freely for any purpose, including commercial use, as long as you keep the copyright notice and license text.

In plain English

Whisper is a general-purpose speech recognition model from OpenAI. The README states it is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. In everyday terms, you give it an audio file and it gives you back text, either a transcript of the speech in the original language, or an English translation of speech in another language. How it works: a Transformer sequence-to-sequence neural network is trained on speech tasks including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. Those tasks are jointly represented as a sequence of tokens predicted by the decoder, so one model replaces many stages of a traditional pipeline. The transcribe method processes audio with a sliding 30-second window. The README provides six model sizes, tiny, base, small, medium, large, and turbo, with English-only and multilingual versions. Sizes range from 39M parameters needing about 1 GB of VRAM and roughly 10× the speed of large, up to 1550M parameters at about 10 GB of VRAM. The turbo model is an optimized version of large-v3 that the README says offers faster transcription with minimal accuracy loss but is not trained for translation. You install it via pip install -U openai-whisper and need ffmpeg installed. After install, you can transcribe at the command line (whisper audio.flac --model turbo), specify a language, or translate non-English speech to English with a multilingual model. You can also call whisper.load_model and model.transcribe from Python, or drop to lower-level helpers for language detection and decoding. The repository is written in Python and the full README is longer than what was provided.

Copy-paste prompts

Prompt 1
Show me how to transcribe an MP3 file using Whisper's Python API and save the output as a text file.
Prompt 2
I have a Spanish-language video. How do I use Whisper to translate it to English subtitles?
Prompt 3
What's the smallest Whisper model I can run on a laptop with 4GB of VRAM, and how do I install and use it?
Prompt 4
How do I detect what language is spoken in an audio file using Whisper before transcribing it?
Prompt 5
Set up a command-line script that batch-processes all MP3 files in a folder and transcribes them with the turbo model.
Open on GitHub → Explain another repo

Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.