explaingit

kouhxp/yapsnap

167Python

TLDR

yapsnap is a command-line tool that turns any video URL or local audio file into a plain text transcript.

Mindmap

A visual breakdown will appear here once this repo is fully enriched.

In plain English

yapsnap is a command-line tool that turns any video URL or local audio file into a plain text transcript. It runs entirely on your CPU with no GPU and no cloud calls. After the first run, when an 80 MB model is downloaded, everything works offline and your audio never leaves your machine. The basic usage is one line, for example yapsnap followed by a YouTube URL, which writes a .txt file with the transcription. Under the hood it chains three pieces. yt-dlp fetches audio from any URL it understands, which covers YouTube, YouTube Shorts, X (formerly Twitter), TikTok, Instagram Reels, and direct media links. ffmpeg decodes the audio to 16 kHz mono PCM and optionally speeds it up without changing pitch using an atempo filter. The default speed factor is 1.5x, which the author says cuts about a third off transcription time with little accuracy loss. Then a streaming Zipformer2 transducer from the Kroko ASR project, in INT8 ONNX format, processes the PCM in chunks via sherpa-onnx and produces text. Local files in common formats also work, including mp3, mp4, m4a, wav, webm, mov, mkv, aac, opus, ogg, and flac, since anything ffmpeg can decode is acceptable input. By default the output goes to a transcripts/ folder under the current directory, with a filename derived from the input or video ID. Passing -o sets a custom output path. Passing --timestamps switches the output from one paragraph to one sentence per line with [MM:SS] prefixes, and the timestamps stay in original-audio time even when the audio was sped up before transcription. Installation is pip install yapsnap from PyPI, plus a system ffmpeg via brew, apt, dnf, winget, or choco depending on the operating system. Two equivalent commands are installed: yapsnap and an alias called transcribe. The whole tool is a single Python module with three dependencies (sherpa-onnx, numpy, yt-dlp). Python 3.9 or newer is required and the license is Apache 2.0. The default model is English, but the same code can transcribe other languages by pointing --model at a different folder or setting the KROKO_MODEL environment variable. Kroko publishes streaming models for Dutch, French, German, Hebrew, Italian, Portuguese, Spanish, Swedish, Swiss German, and Turkish on Hugging Face, and any other sherpa-onnx streaming transducer with the standard encoder, decoder, joiner, and tokens.txt layout also works.

Open on GitHub → Explain another repo

Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.