explaingit

chili-lab/lt2

17PythonAudience · researcherComplexity · 5/5Setup · hard

TLDR

Research code for a new language model architecture that replaces slow attention with faster alternatives and reuses layers in loops, achieving roughly 2.7x faster decoding than a standard baseline.

Mindmap

mindmap
  root((LT2))
    What it does
      Linear-time attention
      Layer parameter sharing
      Faster decoding
    Variants
      LT2-linear
      LT2-sparse
      LT2-hybrid
    Tech
      PyTorch
      SLURM cluster
      Custom CUDA kernels
    Experiments
      600M parameters
      1.3B parameters
      FineWeb-Edu dataset
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Reproduce the LT2 paper experiments at 600M or 1.3B parameter scale on a GPU cluster using the provided configs.

USE CASE 2

Train a looped transformer variant with linear, sparse, or hybrid attention and compare it to a standard transformer baseline.

Tech stack

PythonPyTorchCUDASLURMMamba2FlashAttention

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a multi-GPU cluster with SLURM or torchrun and pre-training on the FineWeb-Edu dataset.

In plain English

LT2 is the research codebase accompanying an academic paper about a new language model architecture called Linear-Time Looped Transformers. The problem it addresses is a well-known inefficiency in standard transformer models: the attention mechanism, which allows the model to weigh how relevant each word is to every other word, becomes dramatically slower and more memory-intensive as text gets longer. LT2 proposes replacing that attention step with alternatives that scale more efficiently with sequence length. The "looped" part of the name refers to parameter sharing. Instead of having many distinct layers with their own separate learned values, LT2 reuses the same set of layers multiple times in sequence. A model with 20 physical layers run through 4 loops effectively behaves like an 80-layer model but uses only 20 layers' worth of memory. This is a known technique, and the contribution here is applying it specifically to the faster attention alternatives. Three variants are included. LT2-linear replaces attention with linear-attention methods such as Mamba2, DeltaNet, and RetNet, which process tokens using a small fixed-size memory state rather than comparing all tokens pairwise. LT2-sparse uses sliding-window attention, where each token only attends to nearby tokens rather than the whole sequence. LT2-hybrid mixes a small number of standard attention layers in with the faster linear-attention layers, according to the paper, this hybrid reaches better quality than a standard looped transformer while running decode at about 2.7 times the speed. The repository is built on Meta's Lingua pre-training framework and is structured for training on GPU clusters, either via SLURM job scheduling or torchrun for local multi-GPU setups. It includes configuration files for reproducing the paper's experiments at 600 million and 1.3 billion parameter scales, training on the FineWeb-Edu dataset. Custom GPU kernel code is included for the performance-critical parts. This is a research-oriented project aimed at people studying language model architecture. Running it requires significant GPU resources and familiarity with distributed training tooling.

Copy-paste prompts

Prompt 1
I want to reproduce the LT2-hybrid experiment from the lt2 repository on a 4-GPU machine using torchrun. Walk me through which config file to use, the torchrun command, and what looped layers means in this context.
Prompt 2
Explain the difference between LT2-linear, LT2-sparse, and LT2-hybrid in the lt2 repo. Which one should I start with if I want the best quality-to-speed tradeoff on a single A100?
Prompt 3
I am reading the lt2 codebase and do not understand how parameter sharing across loops works. Explain how a model with 20 physical layers run through 4 loops behaves like an 80-layer model and where in the code this looping happens.
Open on GitHub → Explain another repo

← chili-lab on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.