explaingit

openai/whisper

Analysis updated 2026-06-20

99,006PythonAudience · developerComplexity · 3/5Setup · moderate

TLDR

Whisper is OpenAI's open-source speech recognition model that converts audio files to text, supports 99+ languages, and can translate spoken words directly into English, installed with a single pip command.

Mindmap

mindmap
  root((whisper))
    What it does
      Speech to text
      Translation
      Language detection
    Tech stack
      Python
      PyTorch
      ffmpeg
    Model sizes
      Tiny and base
      Small and medium
      Large and turbo
    Use cases
      Transcribe audio
      Add subtitles
      Translate speech
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Transcribe a recorded meeting, podcast, or lecture audio file to text without a paid cloud transcription service.

USE CASE 2

Add subtitles to a video by feeding the audio to Whisper and formatting the timestamped output as SRT.

USE CASE 3

Build a multilingual speech-to-text feature in your app that handles dozens of languages with a single model.

USE CASE 4

Translate spoken foreign-language audio directly into English text without a separate translation step.

What is it built with?

PythonPyTorchffmpeg

How does it compare?

openai/whisperpytorch/pytorchfastapi/fastapi
Stars99,00699,69297,946
LanguagePythonPythonPython
Setup difficultymoderatemoderateeasy
Complexity3/54/52/5
Audiencedeveloperresearcherdeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Requires ffmpeg installed on your system and a GPU with enough VRAM for the chosen model size, tiny/base run on CPU.

In plain English

Whisper is OpenAI's speech recognition system. The README describes it as a general-purpose speech recognition model trained on a large dataset of diverse audio, and as a multitasking model that can handle multilingual speech recognition, speech translation, and language identification. In everyday terms, you hand it a sound file and it gives you back the text of what was said, either as a transcript in the original language, or translated into English. Under the hood it is a Transformer sequence-to-sequence model: the audio is turned into a numerical representation and the model predicts the corresponding text. Several different speech tasks are encoded as a single sequence of tokens with special control tokens marking which task is being asked, which is how one model can replace what used to be several separate stages of a speech pipeline. When transcribing, the audio is processed with a sliding 30-second window. Whisper comes in six model sizes, tiny, base, small, medium, large, and turbo, ranging from 39 million up to 1.55 billion parameters and from roughly 1 GB up to 10 GB of required VRAM, with corresponding tradeoffs between speed and accuracy. Four of the sizes also have English-only versions that tend to perform better on English. The turbo model is described as an optimized version of large-v3, faster than large but not trained for translation. You install it as a Python package with pip install -U openai-whisper. It depends on PyTorch and on the ffmpeg command-line tool to read audio files. There is a whisper command for transcribing from the shell and a small Python library for using the model from code.

Copy-paste prompts

Prompt 1
Using the openai-whisper Python library, transcribe an MP3 file to text and save the output with timestamps in SRT subtitle format.
Prompt 2
How do I pick the right Whisper model size for my use case, I need good accuracy but only have 4 GB of VRAM?
Prompt 3
Write a Python script that uses Whisper to transcribe every audio file in a folder and save each transcript as a .txt file next to the original.
Prompt 4
How do I use Whisper's translation mode to convert a French-language audio file into English text?
Prompt 5
Show me how to run Whisper from the command line and get word-level timestamps in the output JSON.

Frequently asked questions

What is whisper?

Whisper is OpenAI's open-source speech recognition model that converts audio files to text, supports 99+ languages, and can translate spoken words directly into English, installed with a single pip command.

What language is whisper written in?

Mainly Python. The stack also includes Python, PyTorch, ffmpeg.

How hard is whisper to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is whisper for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub openai on gitmyhub

Verify against the repo before relying on details.