Analysis updated 2026-06-24
Generate plain text transcripts of YouTube, TikTok, or Instagram Reels videos from the command line
Batch transcribe local podcast or meeting recordings offline without sending audio to a cloud service
Produce timestamped subtitle drafts for editing in a video tool
| kouhxp/yapsnap | helpmeeadice/bandori-pet-rev | hkust-c4g/domainshuttle | |
|---|---|---|---|
| Stars | 167 | 156 | 156 |
| Language | Python | Python | Python |
| Setup difficulty | easy | moderate | hard |
| Complexity | 2/5 | 3/5 | 4/5 |
| Audience | developer | general | researcher |
Figures from each repo's GitHub metadata at analysis time.
Needs system ffmpeg installed separately and a one time 80 MB model download on first run.
yapsnap is a command-line tool that turns any video URL or local audio file into a plain text transcript. It runs entirely on your CPU with no GPU and no cloud calls. After the first run, when an 80 MB model is downloaded, everything works offline and your audio never leaves your machine. The basic usage is one line, for example yapsnap followed by a YouTube URL, which writes a .txt file with the transcription. Under the hood it chains three pieces. yt-dlp fetches audio from any URL it understands, which covers YouTube, YouTube Shorts, X (formerly Twitter), TikTok, Instagram Reels, and direct media links. ffmpeg decodes the audio to 16 kHz mono PCM and optionally speeds it up without changing pitch using an atempo filter. The default speed factor is 1.5x, which the author says cuts about a third off transcription time with little accuracy loss. Then a streaming Zipformer2 transducer from the Kroko ASR project, in INT8 ONNX format, processes the PCM in chunks via sherpa-onnx and produces text. Local files in common formats also work, including mp3, mp4, m4a, wav, webm, mov, mkv, aac, opus, ogg, and flac, since anything ffmpeg can decode is acceptable input. By default the output goes to a transcripts/ folder under the current directory, with a filename derived from the input or video ID. Passing -o sets a custom output path. Passing --timestamps switches the output from one paragraph to one sentence per line with [MM:SS] prefixes, and the timestamps stay in original-audio time even when the audio was sped up before transcription. Installation is pip install yapsnap from PyPI, plus a system ffmpeg via brew, apt, dnf, winget, or choco depending on the operating system. Two equivalent commands are installed: yapsnap and an alias called transcribe. The whole tool is a single Python module with three dependencies (sherpa-onnx, numpy, yt-dlp). Python 3.9 or newer is required and the license is Apache 2.0. The default model is English, but the same code can transcribe other languages by pointing --model at a different folder or setting the KROKO_MODEL environment variable. Kroko publishes streaming models for Dutch, French, German, Hebrew, Italian, Portuguese, Spanish, Swedish, Swiss German, and Turkish on Hugging Face, and any other sherpa-onnx streaming transducer with the standard encoder, decoder, joiner, and tokens.txt layout also works.
Command line tool that transcribes any video URL or local audio file to text offline on CPU, using yt-dlp, ffmpeg, and a small streaming Zipformer2 ONNX model.
Mainly Python. The stack also includes Python, ffmpeg, yt-dlp.
Apache 2.0 permits commercial and personal use with attribution and a patent grant.
Setup difficulty is rated easy, with roughly 5min to a first successful run.
Mainly developer.
This repo across BitVibe Labs
Verify against the repo before relying on details.