explaingit

lxfater/video-to-doc-stepfun

Analysis updated 2026-05-18

8PythonAudience · developerComplexity · 3/5LicenseSetup · moderate

TLDR

Converts tutorial or screen-recording videos into step-by-step guides with screenshots, Markdown, and PDF output using Whisper transcription and an AI model to identify actions.

Mindmap

mindmap
  root((video-to-doc))
    What it does
      Transcribe audio with Whisper
      Identify steps from subtitles
      Extract screenshots via ffmpeg
      Generate Markdown and PDF guide
    Tech Stack
      Python
      FastAPI
      React
      Whisper
      ffmpeg
      StepFun API
    Use Cases
      Software tutorial documentation
      Onboarding screen recordings
      How-to guides from demos
    Audience
      Content creators
      Technical writers
      PMs and founders
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Turn a screen-recording of a software walkthrough into a written how-to guide with embedded screenshots.

USE CASE 2

Generate onboarding documentation from a recorded demo without manually writing each step.

USE CASE 3

Create a PDF tutorial from a training video to share with non-technical teammates.

What is it built with?

PythonFastAPIReactWhisperffmpegStepFun API

How does it compare?

lxfater/video-to-doc-stepfunadam-s/car-diagnosisbongobongo2020/krea2-character-lora-trainer
Stars888
LanguagePythonPythonPython
Setup difficultymoderatemoderatemoderate
Complexity3/53/53/5
Audiencedeveloperresearchervibe coder

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Requires ffmpeg installed on the system and a StepFun platform API key.

MIT license: use freely for any purpose, including commercial projects, as long as you keep the copyright notice.

In plain English

This tool takes a screen-recording or tutorial video and automatically converts it into a step-by-step written guide with screenshots embedded. The output is a Markdown document and a PDF, both structured around the actions visible in the video. The default workflow uses subtitles to drive the process. A local Whisper model listens to the audio and produces a transcript with timestamps. An AI model from StepFun (step-3.7-flash) reads those subtitles, figures out which actions are happening and when, and picks the best frame to capture as a screenshot for each step. The tool then uses ffmpeg to pull those frames out of the video. For steps the AI is less sure about, it sends the screenshot back to the AI for a second look and a corrected description. You can also turn on a mode where the full video is uploaded to the AI and it watches the screen directly, rather than working from audio alone. That approach costs more in API calls but can catch steps that have no spoken narration. A web-search option is available to supplement the final document with additional context from the internet. The project includes a small web application so you can do everything through a browser. You upload a video, watch the processing stages in real time, then edit the draft document in a split-screen editor (Markdown on the left, preview on the right) before exporting the PDF. The backend is FastAPI, the frontend is React, and they communicate via a server-sent-events stream for live progress updates. Setup requires Python, ffmpeg installed on your system, and an API key from the StepFun platform. The configuration is minimal: copy the example environment file, paste in your key, and run the script. The README is in Chinese, but the code and configuration options are straightforward. All outputs land in a folder named after the original video file.

Copy-paste prompts

Prompt 1
Using video-to-doc-stepfun, convert my screen-recording tutorial.mp4 into a step-by-step Markdown guide with screenshots for each action.
Prompt 2
I have a .srt subtitle file for my video. Use it with video-to-doc-stepfun (--srt_path) to skip Whisper transcription and go straight to generating the operation document.
Prompt 3
Set up the FastAPI + React web interface for video-to-doc-stepfun so I can upload videos and edit the output document in the browser before exporting to PDF.
Prompt 4
Explain what the --use_video flag does in video-to-doc-stepfun and when I should use it instead of the default subtitle-driven mode.

Frequently asked questions

What is video-to-doc-stepfun?

Converts tutorial or screen-recording videos into step-by-step guides with screenshots, Markdown, and PDF output using Whisper transcription and an AI model to identify actions.

What language is video-to-doc-stepfun written in?

Mainly Python. The stack also includes Python, FastAPI, React.

What license does video-to-doc-stepfun use?

MIT license: use freely for any purpose, including commercial projects, as long as you keep the copyright notice.

How hard is video-to-doc-stepfun to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is video-to-doc-stepfun for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub lxfater on gitmyhub

Verify against the repo before relying on details.