Self-host a Chinese subtitle generator that produces SRT and ASS files from uploaded videos.
Reuse the Soniox plus LLM polishing pipeline as a template for any timestamp-anchored caption job.
Drop the output into ffmpeg to burn subtitles directly into a final video locally.
Study a working Cloudflare Pages plus Worker plus R2 plus D1 stack for async media jobs.
Needs a Cloudflare account plus R2, D1, Workers, paid Soniox account, and at least one of DeepSeek or Gemini API keys before the pipeline will run end to end.
This project is an open-source tool for generating subtitles from videos. The README is written in Chinese. You upload a video through a simple web page, the system transcribes it, an AI model rewrites the raw transcript into clean subtitles, and you get back standard subtitle files (SRT and ASS) plus a JSON transcript. Final merging of subtitles into the video, either as a soft track or burned-in, is done locally with ffmpeg on your own machine. The hosted parts run on Cloudflare. A static Pages site serves the single-page front end, and a Worker handles the API at /api/*, covering login, upload signing, job status, the Soniox webhook, subtitle generation, and cleanup. Files (original video, subtitles, transcript) live in an R2 bucket called video-caption-files, while job metadata, video dimensions, and any per-job custom prompts live in a D1 database. Transcription uses the Soniox Async Transcription service, defaulting to its stt-async-v4 model. The polishing step calls a language model: DeepSeek by default, with Gemini as a fallback if the primary key or call fails. The pipeline is strict about responsibility. The browser uploads the video directly to R2 using a signed URL, the Worker creates a D1 job, Soniox transcribes asynchronously and pings back through a webhook, then the system splits the transcript into initial segments. The LLM rewrites only the text of those segments and must cite the original sourceSegmentIds, so timestamps stay anchored to what Soniox produced. If JSON parsing fails, an id is invalid, the text is empty, or the call times out, the job still finishes using the unpolished captions. Old jobs are not retroactively rerun. The default polishing prompt aims for short, natural Chinese captions, removing filler words, fixing ASR letter-spacing artifacts like "A P I", choosing between Arabic and Chinese numerals by context, dropping leading and trailing punctuation, and keeping each line between roughly 8 and 18 Chinese characters. The ASS output writes the real video PlayResX and PlayResY so libass does not fall back to 384x288 and blow up the font size. The README documents local development, Cloudflare resource setup (R2 bucket, D1 database, CORS, lifecycle), the full list of Worker secrets including Soniox, DeepSeek, Gemini, admin password, webhook secret, and R2 keys, deployment commands, the two ffmpeg merge commands, and a retention policy that deletes raw video after 24 hours and subtitles after 30 days.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.