Analysis updated 2026-05-18
Run text-driven video editing on a short clip using a pre-trained LiveEdit checkpoint, providing a source video and text instruction in a JSON file
Reproduce the streaming video editing results from the ECCV 2026 LiveEdit paper using the official training and inference scripts
Experiment with the AR-Oriented Mask Cache for efficient chunk-level computation reuse and visualize which regions are being recalculated
| cp-cp/liveedit | zhw040803-glitch/uav-gps-dqn-detection | 0xh4ku/manga-pdf-to-epub | |
|---|---|---|---|
| Stars | 59 | 59 | 60 |
| Language | Python | Python | Python |
| Setup difficulty | hard | moderate | moderate |
| Complexity | 5/5 | 3/5 | 2/5 |
| Audience | researcher | researcher | general |
Figures from each repo's GitHub metadata at analysis time.
Requires NVIDIA GPU with CUDA, must download both Wan2.1 base model and LiveEdit checkpoint from Hugging Face before running inference.
LiveEdit is an academic research project (accepted to ECCV 2026) from Tsinghua University and HKUST that approaches video editing differently from most AI video tools. Where typical AI video editing requires the entire clip to be processed at once before showing any results, LiveEdit edits video in small, overlapping chunks processed one after another, similar to how a livestream works. This allows it to begin showing edited output much sooner than batch approaches. The workflow takes two inputs: a source video and a text instruction describing what to change, such as "change the red currants to deep black grapes." The model keeps untouched parts of the frame (backgrounds, people, objects not mentioned in the instruction) as close to the original as possible while applying the transformation only to relevant regions. A mask cache optimization skips recalculation for regions that have not changed between chunks, reducing the computation required per chunk. LiveEdit is built on top of an existing video generation model called Wan2.1. The training procedure has three stages: first it teaches the model to edit video well in the standard offline whole-clip setting, then it adapts it to process chunks sequentially, then it applies a distillation step to compress the number of denoising steps required per chunk, which is what produces real-time-oriented performance. Running inference requires downloading the Wan2.1 base model weights and the LiveEdit checkpoint from Hugging Face, writing a small JSON file specifying your source video and instruction, then running a shell script. Training requires multiple NVIDIA GPUs and additional setup for dataset paths. This project is primarily for AI researchers studying video editing, diffusion models, or streaming inference. General users would need significant technical setup and GPU hardware to run it.
An ECCV 2026 research codebase that edits video in real time using a diffusion model, processing footage chunk-by-chunk from a text instruction while keeping unchanged regions intact.
Mainly Python. The stack also includes Python, PyTorch, CUDA.
Setup difficulty is rated hard, with roughly 1h+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.