Transcribe a long audio file into text at up to 70 times the speed of the original Whisper model by running it on a TPU.
Batch-process audio files by splitting them into 30-second chunks and processing them in parallel across multiple accelerators.
Translate spoken audio from any supported language into English text by changing a single pipeline parameter.
Generate a timestamped transcript of a recording that marks which words were spoken at which point in the audio.
Requires installing a compatible version of JAX separately before pip-installing this library, TPU setup is easiest via the provided Kaggle notebook.
Whisper JAX is a faster reimplementation of OpenAI's Whisper speech-to-text model, rewritten to run on a different computing framework called JAX. OpenAI's original Whisper converts audio files into text transcripts across many languages, and this project takes that same model and makes it run dramatically faster: up to 70 times quicker than the original, particularly when run on Google's TPU hardware. The practical result is that 30 minutes of audio can be transcribed in roughly 30 seconds. The core idea is that JAX compiles the transcription function the first time it runs, then caches the compiled version so every subsequent call is much faster. There is a one-time wait the first time you process audio, but after that the speed difference is substantial. The library also supports batching, which splits a long audio file into 30-second chunks and processes them in parallel across multiple hardware accelerators. The project reports this gives roughly a 10x additional speedup with less than 1% reduction in accuracy. Using the library looks like loading a pipeline, pointing it at an audio file, and getting text back. The same pipeline can transcribe speech in its original language or translate it into English by changing a single parameter. It can also return timestamps alongside the transcript, which marks which words were spoken at which points in the recording. The library works with any of the official Whisper model sizes, from the tiny version with 39 million parameters up to the large model, as well as multilingual variants. It runs on CPU, GPU, and TPU, though the largest speed gains come from TPU environments. For users who want more control, the library also exposes lower-level building blocks that match the structure of the Hugging Face Transformers library. Installing it requires Python 3.9, a compatible version of the JAX package installed separately, and then a pip install from the GitHub repository. A Kaggle notebook is provided to demonstrate the full setup on a cloud TPU environment.
← sanchit-gandhi on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.