Analysis updated 2026-07-03
Generate a short video that continues a scene from a single photo using a text description or camera movement instructions.
Produce low-latency video at 16 frames per second using the Fast model variant for near-real-time outputs.
Control the generated video's virtual camera with pose files or action strings specifying movements like forward, left, and jump.
| robbyant/lingbot-world | bytedance/byteps | allenai/open-instruct | |
|---|---|---|---|
| Stars | 3,718 | 3,717 | 3,720 |
| Language | Python | Python | Python |
| Setup difficulty | hard | hard | hard |
| Complexity | 5/5 | 4/5 | 5/5 |
| Audience | researcher | researcher | researcher |
Figures from each repo's GitHub metadata at analysis time.
Inference examples specify 8 GPUs, flash-attn and CUDA setup adds significant configuration overhead.
LingBot-World is an open-source AI system for generating realistic video from a single image and a text description. It is described as a world model, meaning it learns to simulate how environments look and change over time rather than just producing a static output. Given an image and a prompt, it generates a video that continues the scene in a physically plausible way across a range of environments including realistic settings, cartoon styles, and scientific visualizations. The system offers two main model variants. The Base model takes a starting image and either camera movement instructions or action commands (like pressing a game controller direction) and generates a video that follows those instructions. The Fast variant is optimized for lower latency, producing output with under one second of delay at 16 frames per second. Both variants support 480p and 720p output and can generate videos up to several minutes long while maintaining consistency. Running the model requires significant hardware, as the inference scripts are designed to run across multiple GPUs using a distributed training tool called torchrun, with the example commands specifying eight GPUs. Model weights are downloaded from HuggingFace or ModelScope before running. Installation builds on a base called Wan2.2 and requires a recent version of PyTorch along with a library called flash-attn for faster attention computation. Camera control works by supplying camera pose files that describe how the virtual camera should move through the scene. Action control uses either structured data files or a simple action string format where you specify movements like forward, left, jump, and look directions with durations. The project is released under the Apache 2.0 license. A technical report is available on arXiv and a demo page with video examples is linked from the repository.
An open-source AI world model that generates realistic video from a single image and a text prompt, simulating how a scene continues over time across realistic, cartoon, and scientific visual styles.
Mainly Python. The stack also includes Python, PyTorch, HuggingFace.
Open-source under Apache 2.0, use freely including commercially, with attribution.
Setup difficulty is rated hard, with roughly 1day+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.