explaingit

ehoogeboom/discrete-diffusion-lm

4PythonAudience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

Unofficial PyTorch port of D-MMD that distills a masked discrete diffusion language model into a few-step student on OpenWebText.

Mindmap

mindmap
  root((discrete-diffusion-lm))
    Inputs
      OpenWebText
      GPT-2 BPE tokens
      Teacher checkpoint
    Outputs
      AR teacher
      MD teacher
      Few-step student
    Use Cases
      Reproduce D-MMD
      Train small LM
      Benchmark samplers
    Tech Stack
      PyTorch
      CUDA
      RunPod
      tiktoken

Things people build with this

USE CASE 1

Train a masked discrete diffusion language model from scratch on OpenWebText

USE CASE 2

Distill a slow diffusion teacher into a 16 or 32 step student

USE CASE 3

Benchmark autoregressive vs diffusion samplers with the centered_gm metric

USE CASE 4

Reproduce a vibe-coded port of the D-MMD paper on a single A100

Tech stack

PythonPyTorchCUDAtiktokenRunPodWandB

Getting it running

Difficulty · hard Time to first run · 1day+

Needs a CUDA GPU plus OpenWebText prep, and full runs assume a RunPod A100 or H100 with the bundle/ssh workflow.

In plain English

This repository is an unofficial PyTorch reimplementation of a research paper called D-MMD by Hoogeboom et al. The paper describes a way to take a kind of language model called a masked discrete diffusion model and distill it into a faster student model that needs only a few sampling steps to produce text. The author calls the work vibe-coded and third party, which is a way of saying it is an informal personal port of the paper rather than the authors' own reference release. To make the experiments self contained, the repo also includes the supporting trainers needed to reproduce the paper's setup end to end on the OpenWebText dataset. Three methods share the same transformer backbone and only differ in the loss and a few internal details. An autoregressive trainer produces a normal GPT style next token model. A masked diffusion trainer produces the teacher model. A distillation script then takes that teacher and trains a few step student version using the D-MMD method. The results section is honest about scale. Runs use a small backbone of about 30 million non embedding parameters, much smaller than the GPT-2-small sized backbones in the original papers. The author presents the numbers as evidence that the pipeline works end to end, not as a replication of the paper. Tables compare teacher samples at different top_p settings against student runs with 16 and 32 sampling steps, using a reference based metric called centered_gm computed with gpt2-large. Early student results sit below the teacher's score by step two thousand, though training was stopped before convergence. The README documents how to install the package, prepare the OpenWebText data, train a teacher, and run the distillation step, including single GPU, torchrun multi GPU, and a RunPod cloud workflow with bundle creation scripts.

Copy-paste prompts

Prompt 1
Walk me through setting up discrete-diffusion-lm on a RunPod A100 and training the md teacher on OpenWebText
Prompt 2
Show me how to point distill.py at an md checkpoint and run the k=16 D-MMD student
Prompt 3
Explain how the AR, MD, and D-MMD trainers share the transformer backbone in src/discrete_diffusion_lm/model.py
Prompt 4
Help me add a new SampleEval metric alongside GenerativeLLMMetricConfig for gpt2-large
Prompt 5
Scale this repo from the 6L/8H/512d backbone up to the GPT-2-small sized 12L/12H/768d setup
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.