explaingit

0labs-in/vision-link

4TypeScriptAudience · developerComplexity · 3/5ActiveSetup · moderate

TLDR

MCP server that extracts frames with ffmpeg and transcribes audio via Gemini, Whisper, or OpenAI so an AI client like Claude Code can watch a video file or YouTube link.

Mindmap

mindmap
  root((vision-link))
    Inputs
      Video files
      YouTube URLs
      MCP client requests
    Outputs
      Extracted frames
      Timestamped transcript
      MCP tool responses
    Use Cases
      Let Claude watch a lecture
      Search on-screen text in a clip
      Summarize a YouTube video
      Offline video review
    Tech Stack
      TypeScript
      Node
      ffmpeg
      Whisper
      MCP

Things people build with this

USE CASE 1

Give Claude Code the ability to watch a local video and answer questions about it

USE CASE 2

Transcribe a YouTube lecture and pull representative frames in one step

USE CASE 3

Run a fully offline video understanding pipeline with local Whisper

USE CASE 4

Adapt frame sampling rate per request between long lectures and short clips

Tech stack

TypeScriptNodeffmpegWhisperMCPyt-dlp

Getting it running

Difficulty · moderate Time to first run · 30min

Needs Node 20, ffmpeg, and optionally yt-dlp plus a transcription backend choice (Gemini key, local Whisper, or OpenAI).

In plain English

vision-link is a tool that lets an AI assistant watch and understand a video. It is built as an MCP server (Model Context Protocol, the way Claude and a few other AI clients plug into outside tools) and ships with an optional plugin for Claude Code. When you point it at a video file or a YouTube link, it pulls still frames out of the video using ffmpeg and runs the audio through one of three transcription backends. The frames and the timestamped transcript are then handed back to your AI client so the model can actually see and read what is in the clip. The README is clear that the server is a perception layer only. It gathers the raw evidence, frames and text, and leaves all the interpretation to the AI client on the other end. Three audio backends are offered: the Gemini API (free tier of 1500 requests a day), local Whisper that runs fully offline through whisper.cpp or the Python openai-whisper package, or the paid OpenAI Whisper API. All three pair with the same ffmpeg-based frame extraction. Installation in Claude Code is two slash commands, /plugin marketplace add and /plugin install, followed by a /vision-link:setup-video-vision wizard that offers Quick, Advanced, and Custom modes. A /vision-link:doctor command checks that ffmpeg, whisper, and yt-dlp are present, verifies the configuration, and tries to fix common issues automatically. The project also supports Claude Desktop, Cursor IDE, and any generic MCP client. Once set up, you can either trigger it with a slash command like /vision-link:watch-video path/to/file.mp4 or just mention a video file or YouTube URL in conversation and let the assistant pick it up. The server adapts to the request, for example pulling frames at a low rate for a long lecture or at a higher resolution and narrow time window for a question about on-screen text. Requirements are Node.js 20 or newer, ffmpeg, and optionally yt-dlp for YouTube URLs. Settings live in a config file under ~/.0labs-vision/. The npm package is at version 1.4.0 with 146 passing tests.

Copy-paste prompts

Prompt 1
Install vision-link in Claude Code and walk me through the /vision-link:setup-video-vision Advanced flow
Prompt 2
Add a Cursor IDE preset to vision-link that points at OpenAI Whisper and saves transcripts to a project folder
Prompt 3
Write a Dockerfile for vision-link that bundles ffmpeg, yt-dlp, and whisper.cpp for headless servers
Prompt 4
Extend vision-link with a tool that returns OCR text from extracted frames using tesseract
Prompt 5
Build a CI test that runs /vision-link:doctor in a clean container and fails on missing dependencies
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.