Analysis updated 2026-05-18
Extract keyframes from a product demo video and paste them into Claude to ask which features were demonstrated.
Transcribe a lecture video and save the text to a notes folder for later reference without sending the video to any cloud service.
Run the tool with `--why` to pull frames from a competitor's announcement video focused on finding their pricing strategy.
Install it as a Claude Code skill so Claude can automatically process any video URL you paste into your coding session.
| huangchihhungleo/claude-real-video | bytedance/lance | sapientinc/hrm-text | |
|---|---|---|---|
| Stars | 637 | 637 | 617 |
| Language | Python | Python | Python |
| Setup difficulty | moderate | hard | hard |
| Complexity | 2/5 | 5/5 | 5/5 |
| Audience | developer | researcher | researcher |
Figures from each repo's GitHub metadata at analysis time.
Requires ffmpeg installed separately via your system package manager before the pip package will work.
Claude Real Video is a Python command-line tool that extracts the meaningful frames from a video and transcribes its audio, so you can hand that material to an AI assistant and ask questions about what is actually in the video. The problem it solves is that most AI tools cannot genuinely watch a video. When you paste a YouTube link into a chatbot, it usually reads the transcript rather than seeing the images. This tool does the visual processing on your own computer and gives you files you can then share with whatever AI you choose. The key difference from simpler approaches is how it selects frames. A naive method grabs one frame per second, which wastes context on repetitive shots from a static screencast and misses important moments in a fast-cut video. This tool detects scene changes instead, pulling a frame whenever the image meaningfully shifts. It also compares each candidate frame against the recent ones already kept and discards near-duplicates, so a shot that appears multiple times only gets included once. A 58-second clip that naive sampling would represent with 58 frames might reduce to 26 meaningfully distinct ones. Beyond frames, it optionally runs Whisper, a speech recognition tool, to produce a text transcript of the audio. If the video already has subtitle files attached, it uses those instead, which is faster and more accurate. You can also save the full audio track so a model that can process audio directly gets the actual sound rather than just the words. The output is a folder of image files, an optional transcript, and a summary file that an AI assistant can read to understand the material. You point a tool like Claude or ChatGPT at that folder and ask your questions from there. A --why flag lets you state your reason for watching, such as finding the pricing strategy in a product demo, so the summary focuses on what matters to you rather than producing a generic description. Installation requires Python 3.10 or later plus ffmpeg installed separately. The tool works from a YouTube or other public video URL or from a local file. It runs on macOS, Windows, and Linux, and is MIT licensed.
A local Python tool that extracts scene-change keyframes and audio transcripts from a video so you can paste the results into Claude, ChatGPT, or any AI to ask questions about what the video shows.
Mainly Python. The stack also includes Python, ffmpeg, yt-dlp.
MIT license, use, modify, and distribute freely for any purpose including commercial, as long as you keep the copyright notice.
Setup difficulty is rated moderate, with roughly 30min to a first successful run.
Mainly developer.
This repo across BitVibe Labs
Verify against the repo before relying on details.