Give Claude Code the ability to watch a local video and answer questions about it
Transcribe a YouTube lecture and pull representative frames in one step
Run a fully offline video understanding pipeline with local Whisper
Adapt frame sampling rate per request between long lectures and short clips
Needs Node 20, ffmpeg, and optionally yt-dlp plus a transcription backend choice (Gemini key, local Whisper, or OpenAI).
vision-link is a tool that lets an AI assistant watch and understand a video. It is built as an MCP server (Model Context Protocol, the way Claude and a few other AI clients plug into outside tools) and ships with an optional plugin for Claude Code. When you point it at a video file or a YouTube link, it pulls still frames out of the video using ffmpeg and runs the audio through one of three transcription backends. The frames and the timestamped transcript are then handed back to your AI client so the model can actually see and read what is in the clip. The README is clear that the server is a perception layer only. It gathers the raw evidence, frames and text, and leaves all the interpretation to the AI client on the other end. Three audio backends are offered: the Gemini API (free tier of 1500 requests a day), local Whisper that runs fully offline through whisper.cpp or the Python openai-whisper package, or the paid OpenAI Whisper API. All three pair with the same ffmpeg-based frame extraction. Installation in Claude Code is two slash commands, /plugin marketplace add and /plugin install, followed by a /vision-link:setup-video-vision wizard that offers Quick, Advanced, and Custom modes. A /vision-link:doctor command checks that ffmpeg, whisper, and yt-dlp are present, verifies the configuration, and tries to fix common issues automatically. The project also supports Claude Desktop, Cursor IDE, and any generic MCP client. Once set up, you can either trigger it with a slash command like /vision-link:watch-video path/to/file.mp4 or just mention a video file or YouTube URL in conversation and let the assistant pick it up. The server adapts to the request, for example pulling frames at a low rate for a long lecture or at a higher resolution and narrow time window for a question about on-screen text. Requirements are Node.js 20 or newer, ffmpeg, and optionally yt-dlp for YouTube URLs. Settings live in a config file under ~/.0labs-vision/. The npm package is at version 1.4.0 with 146 passing tests.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.