explaingit

jason-create-cmd/video-caption-generator

1TypeScriptAudience · developerComplexity · 4/5ActiveSetup · hard

TLDR

Cloudflare Pages plus Worker app that uploads a video, transcribes it with Soniox, polishes captions with DeepSeek or Gemini, and returns SRT, ASS, and JSON subtitle files.

Mindmap

mindmap
  root((video-caption-generator))
    Inputs
      Uploaded video
      Custom prompts
    Outputs
      SRT subtitles
      ASS subtitles
      JSON transcript
    Use Cases
      Chinese video subtitles
      Soft track or burn-in
      ASR polishing pipeline
    Tech Stack
      TypeScript
      Cloudflare Pages
      Cloudflare Workers
      R2
      D1
      Soniox
      DeepSeek
      Gemini
      ffmpeg

Things people build with this

USE CASE 1

Self-host a Chinese subtitle generator that produces SRT and ASS files from uploaded videos.

USE CASE 2

Reuse the Soniox plus LLM polishing pipeline as a template for any timestamp-anchored caption job.

USE CASE 3

Drop the output into ffmpeg to burn subtitles directly into a final video locally.

USE CASE 4

Study a working Cloudflare Pages plus Worker plus R2 plus D1 stack for async media jobs.

Tech stack

TypeScriptCloudflare WorkersR2D1SonioxDeepSeekGeminiffmpeg

Getting it running

Difficulty · hard Time to first run · 1h+

Needs a Cloudflare account plus R2, D1, Workers, paid Soniox account, and at least one of DeepSeek or Gemini API keys before the pipeline will run end to end.

In plain English

This project is an open-source tool for generating subtitles from videos. The README is written in Chinese. You upload a video through a simple web page, the system transcribes it, an AI model rewrites the raw transcript into clean subtitles, and you get back standard subtitle files (SRT and ASS) plus a JSON transcript. Final merging of subtitles into the video, either as a soft track or burned-in, is done locally with ffmpeg on your own machine. The hosted parts run on Cloudflare. A static Pages site serves the single-page front end, and a Worker handles the API at /api/*, covering login, upload signing, job status, the Soniox webhook, subtitle generation, and cleanup. Files (original video, subtitles, transcript) live in an R2 bucket called video-caption-files, while job metadata, video dimensions, and any per-job custom prompts live in a D1 database. Transcription uses the Soniox Async Transcription service, defaulting to its stt-async-v4 model. The polishing step calls a language model: DeepSeek by default, with Gemini as a fallback if the primary key or call fails. The pipeline is strict about responsibility. The browser uploads the video directly to R2 using a signed URL, the Worker creates a D1 job, Soniox transcribes asynchronously and pings back through a webhook, then the system splits the transcript into initial segments. The LLM rewrites only the text of those segments and must cite the original sourceSegmentIds, so timestamps stay anchored to what Soniox produced. If JSON parsing fails, an id is invalid, the text is empty, or the call times out, the job still finishes using the unpolished captions. Old jobs are not retroactively rerun. The default polishing prompt aims for short, natural Chinese captions, removing filler words, fixing ASR letter-spacing artifacts like "A P I", choosing between Arabic and Chinese numerals by context, dropping leading and trailing punctuation, and keeping each line between roughly 8 and 18 Chinese characters. The ASS output writes the real video PlayResX and PlayResY so libass does not fall back to 384x288 and blow up the font size. The README documents local development, Cloudflare resource setup (R2 bucket, D1 database, CORS, lifecycle), the full list of Worker secrets including Soniox, DeepSeek, Gemini, admin password, webhook secret, and R2 keys, deployment commands, the two ffmpeg merge commands, and a retention policy that deletes raw video after 24 hours and subtitles after 30 days.

Copy-paste prompts

Prompt 1
Deploy video-caption-generator on my Cloudflare account, including R2 bucket, D1 database, CORS, and all the Worker secrets.
Prompt 2
Trace the job lifecycle in video-caption-generator from browser upload to R2 to Soniox webhook to LLM polishing.
Prompt 3
Swap DeepSeek for Claude as the polishing model in video-caption-generator and keep Gemini as fallback.
Prompt 4
Adjust the default polishing prompt in video-caption-generator to produce English captions with line lengths between 30 and 60 characters.
Prompt 5
Write the ffmpeg commands documented in this repo to burn the generated ASS subtitles into the final video at the original PlayResX and PlayResY.
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.