robbyant/lingbot-world

Analysis updated 2026-07-03

★ 3,718PythonAudience · researcherComplexity · 5/5LicenseSetup · hard

Mindmap

mindmap
  root((LingBot-World))
    Input types
      Single starting image
      Text prompt
      Camera pose files
    Model variants
      Base model
      Fast low-latency
    Output formats
      480p video
      720p video
    Controls
      Camera movement
      Action commands
    Requirements
      Multi-GPU cluster
      PyTorch and CUDA

mindmap root((LingBot-World)) Input types Single starting image Text prompt Camera pose files Model variants Base model Fast low-latency Output formats 480p video 720p video Controls Camera movement Action commands Requirements Multi-GPU cluster PyTorch and CUDA

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Generate a short video that continues a scene from a single photo using a text description or camera movement instructions.

USE CASE 2

Produce low-latency video at 16 frames per second using the Fast model variant for near-real-time outputs.

USE CASE 3

Control the generated video's virtual camera with pose files or action strings specifying movements like forward, left, and jump.

What is it built with?

PythonPyTorchHuggingFacetorchrunflash-attn

How does it compare?

	robbyant/lingbot-world	bytedance/byteps	allenai/open-instruct
Stars	3,718	3,717	3,720
Language	Python	Python	Python
Setup difficulty	hard	hard	hard
Complexity	5/5	4/5	5/5
Audience	researcher	researcher	researcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Inference examples specify 8 GPUs, flash-attn and CUDA setup adds significant configuration overhead.

Open-source under Apache 2.0, use freely including commercially, with attribution.

In plain English

LingBot-World is an open-source AI system for generating realistic video from a single image and a text description. It is described as a world model, meaning it learns to simulate how environments look and change over time rather than just producing a static output. Given an image and a prompt, it generates a video that continues the scene in a physically plausible way across a range of environments including realistic settings, cartoon styles, and scientific visualizations. The system offers two main model variants. The Base model takes a starting image and either camera movement instructions or action commands (like pressing a game controller direction) and generates a video that follows those instructions. The Fast variant is optimized for lower latency, producing output with under one second of delay at 16 frames per second. Both variants support 480p and 720p output and can generate videos up to several minutes long while maintaining consistency. Running the model requires significant hardware, as the inference scripts are designed to run across multiple GPUs using a distributed training tool called torchrun, with the example commands specifying eight GPUs. Model weights are downloaded from HuggingFace or ModelScope before running. Installation builds on a base called Wan2.2 and requires a recent version of PyTorch along with a library called flash-attn for faster attention computation. Camera control works by supplying camera pose files that describe how the virtual camera should move through the scene. Action control uses either structured data files or a simple action string format where you specify movements like forward, left, jump, and look directions with durations. The project is released under the Apache 2.0 license. A technical report is available on arXiv and a demo page with video examples is linked from the repository.

Copy-paste prompts

Prompt 1

I'm running LingBot-World on an 8-GPU server. Write the torchrun command to generate a 720p video from a single input image using camera pose files.

Prompt 2

Using LingBot-World's action control format, create an action string that makes the virtual camera walk forward for 2 seconds, turn left, then jump.

Prompt 3

Help me install flash-attn and the Wan2.2 base dependencies for LingBot-World on Ubuntu 22.04 with CUDA 12.1, including any version pins needed to avoid compatibility errors.

Prompt 4

I want to call LingBot-World inference from a Python script. Show me how to load the Base model weights from HuggingFace and pass an image plus a text prompt to generate a video clip.

Frequently asked questions

What is lingbot-world?

An open-source AI world model that generates realistic video from a single image and a text prompt, simulating how a scene continues over time across realistic, cartoon, and scientific visual styles.

What language is lingbot-world written in?

Mainly Python. The stack also includes Python, PyTorch, HuggingFace.

What license does lingbot-world use?

Open-source under Apache 2.0, use freely including commercially, with attribution.

How hard is lingbot-world to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is lingbot-world for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub robbyant on gitmyhub

Verify against the repo before relying on details.