0labs-in/vision-link

Analysis updated 2026-06-24

★ 4TypeScriptAudience · developerComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((vision-link))
    Inputs
      Video files
      YouTube URLs
      MCP client requests
    Outputs
      Extracted frames
      Timestamped transcript
      MCP tool responses
    Use Cases
      Let Claude watch a lecture
      Search on-screen text in a clip
      Summarize a YouTube video
      Offline video review
    Tech Stack
      TypeScript
      Node
      ffmpeg
      Whisper
      MCP

mindmap root((vision-link)) Inputs Video files YouTube URLs MCP client requests Outputs Extracted frames Timestamped transcript MCP tool responses Use Cases Let Claude watch a lecture Search on-screen text in a clip Summarize a YouTube video Offline video review Tech Stack TypeScript Node ffmpeg Whisper MCP

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Give Claude Code the ability to watch a local video and answer questions about it

USE CASE 2

Transcribe a YouTube lecture and pull representative frames in one step

USE CASE 3

Run a fully offline video understanding pipeline with local Whisper

USE CASE 4

Adapt frame sampling rate per request between long lectures and short clips

What is it built with?

TypeScriptNodeffmpegWhisperMCPyt-dlp

How does it compare?

	0labs-in/vision-link	arviahq/arvia	ashchanance/3d-companion-animation
Stars	4	4	4
Language	TypeScript	TypeScript	TypeScript
Setup difficulty	moderate	moderate	moderate
Complexity	3/5	3/5	3/5
Audience	developer	developer	vibe coder

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Needs Node 20, ffmpeg, and optionally yt-dlp plus a transcription backend choice (Gemini key, local Whisper, or OpenAI).

In plain English

vision-link is a tool that lets an AI assistant watch and understand a video. It is built as an MCP server (Model Context Protocol, the way Claude and a few other AI clients plug into outside tools) and ships with an optional plugin for Claude Code. When you point it at a video file or a YouTube link, it pulls still frames out of the video using ffmpeg and runs the audio through one of three transcription backends. The frames and the timestamped transcript are then handed back to your AI client so the model can actually see and read what is in the clip. The README is clear that the server is a perception layer only. It gathers the raw evidence, frames and text, and leaves all the interpretation to the AI client on the other end. Three audio backends are offered: the Gemini API (free tier of 1500 requests a day), local Whisper that runs fully offline through whisper.cpp or the Python openai-whisper package, or the paid OpenAI Whisper API. All three pair with the same ffmpeg-based frame extraction. Installation in Claude Code is two slash commands, /plugin marketplace add and /plugin install, followed by a /vision-link:setup-video-vision wizard that offers Quick, Advanced, and Custom modes. A /vision-link:doctor command checks that ffmpeg, whisper, and yt-dlp are present, verifies the configuration, and tries to fix common issues automatically. The project also supports Claude Desktop, Cursor IDE, and any generic MCP client. Once set up, you can either trigger it with a slash command like /vision-link:watch-video path/to/file.mp4 or just mention a video file or YouTube URL in conversation and let the assistant pick it up. The server adapts to the request, for example pulling frames at a low rate for a long lecture or at a higher resolution and narrow time window for a question about on-screen text. Requirements are Node.js 20 or newer, ffmpeg, and optionally yt-dlp for YouTube URLs. Settings live in a config file under ~/.0labs-vision/. The npm package is at version 1.4.0 with 146 passing tests.

Copy-paste prompts

Prompt 1

Install vision-link in Claude Code and walk me through the /vision-link:setup-video-vision Advanced flow

Prompt 2

Add a Cursor IDE preset to vision-link that points at OpenAI Whisper and saves transcripts to a project folder

Prompt 3

Write a Dockerfile for vision-link that bundles ffmpeg, yt-dlp, and whisper.cpp for headless servers

Prompt 4

Extend vision-link with a tool that returns OCR text from extracted frames using tesseract

Prompt 5

Build a CI test that runs /vision-link:doctor in a clean container and fails on missing dependencies

Frequently asked questions

What is vision-link?

MCP server that extracts frames with ffmpeg and transcribes audio via Gemini, Whisper, or OpenAI so an AI client like Claude Code can watch a video file or YouTube link.

What language is vision-link written in?

Mainly TypeScript. The stack also includes TypeScript, Node, ffmpeg.

How hard is vision-link to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is vision-link for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.