explaingit

largeworldmodel/lwm

7,411PythonAudience · researcherComplexity · 5/5Setup · hard

TLDR

An AI model that processes up to one million tokens of text, images, and video at once, letting you ask questions about very long documents or hour-long video clips.

Mindmap

mindmap
  root((Large World Model))
    Capabilities
      One million token context
      Text understanding
      Video understanding
      Image chat
    Model Variants
      Text only
      Text and video
      Base and chat tuned
    Training
      RingAttention technique
      Books and video data
      7 billion parameters
    Requirements
      TPU for vision models
      GPU for text models
      Ubuntu only
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Ask questions about facts buried anywhere inside a very long document such as a full book or lengthy legal contract.

USE CASE 2

Query the content of an hour-long YouTube video by feeding it directly to the model.

USE CASE 3

Use the chat-tuned variant to have a conversation about a set of images without splitting them into smaller batches.

Tech stack

PythonJAXPyTorchTPUCUDA

Getting it running

Difficulty · hard Time to first run · 1day+

Vision-language models require TPUs and JAX, text models need a multi-GPU Ubuntu machine, Windows and macOS are unsupported.

In plain English

Large World Model (LWM) is an AI system that can read and understand very long pieces of text, images, and video all at once. Most AI models can only look at a limited amount of information at a time, similar to reading just the first few pages of a book before answering questions about it. LWM extends that window to one million tokens, which is roughly the equivalent of several full-length novels or about an hour of video, allowing it to answer questions about content that appears anywhere in a very long document or clip. The project trains a 7-billion-parameter neural network on a large collection of books and diverse videos. The training process uses a technique called RingAttention, which distributes the work of processing very long sequences across many processors simultaneously. Without this, training on such long inputs would exceed the memory limits of any single piece of hardware. The team gradually increased the context size during training, starting at 4,000 tokens and working up to one million. The released models come in several variants. Some are text-only, others understand both text and video. Some are base models, while others are chat-tuned versions you can have a conversation with. The vision-language models run on TPUs using a framework called Jax, while the text-only versions also work with PyTorch on standard GPUs. The README includes setup instructions, a table listing each available model along with its context size and download links, and guidance on configuration parameters that control how computation is split across hardware. Practical capabilities demonstrated in the README include retrieving specific facts buried inside a one-million-token document with high accuracy, answering questions about the content of a one-hour YouTube video, chatting about individual images, and generating images or short video clips from text prompts. The code is supported on Ubuntu, Windows and macOS have not been tested.

Copy-paste prompts

Prompt 1
I have a 500-page PDF. Using the Large World Model, how do I load it as a one-million-token context and ask it to find every mention of a specific clause?
Prompt 2
Walk me through setting up the LWM text-only model on a multi-GPU Ubuntu machine and running the provided example to answer questions about a long document.
Prompt 3
What is RingAttention and how does LWM use it to train on sequences that are too long to fit in a single GPU's memory?
Prompt 4
I want to use the LWM video model to summarize an hour-long lecture video. What format should I convert the video to and what hardware do I need?
Open on GitHub → Explain another repo

← largeworldmodel on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.