ehoogeboom/discrete-diffusion-lm

Analysis updated 2026-06-24

★ 4PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((discrete-diffusion-lm))
    Inputs
      OpenWebText
      GPT-2 BPE tokens
      Teacher checkpoint
    Outputs
      AR teacher
      MD teacher
      Few-step student
    Use Cases
      Reproduce D-MMD
      Train small LM
      Benchmark samplers
    Tech Stack
      PyTorch
      CUDA
      RunPod
      tiktoken

mindmap root((discrete-diffusion-lm)) Inputs OpenWebText GPT-2 BPE tokens Teacher checkpoint Outputs AR teacher MD teacher Few-step student Use Cases Reproduce D-MMD Train small LM Benchmark samplers Tech Stack PyTorch CUDA RunPod tiktoken

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Train a masked discrete diffusion language model from scratch on OpenWebText

USE CASE 2

Distill a slow diffusion teacher into a 16 or 32 step student

USE CASE 3

Benchmark autoregressive vs diffusion samplers with the centered_gm metric

USE CASE 4

Reproduce a vibe-coded port of the D-MMD paper on a single A100

What is it built with?

PythonPyTorchCUDAtiktokenRunPodWandB

How does it compare?

	ehoogeboom/discrete-diffusion-lm	adeliox/klein-head-swap	ats4321/ragit
Stars	4	4	4
Language	Python	Python	Python
Setup difficulty	hard	moderate	moderate
Complexity	5/5	3/5	2/5
Audience	researcher	designer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Needs a CUDA GPU plus OpenWebText prep, and full runs assume a RunPod A100 or H100 with the bundle/ssh workflow.

In plain English

This repository is an unofficial PyTorch reimplementation of a research paper called D-MMD by Hoogeboom et al. The paper describes a way to take a kind of language model called a masked discrete diffusion model and distill it into a faster student model that needs only a few sampling steps to produce text. The author calls the work vibe-coded and third party, which is a way of saying it is an informal personal port of the paper rather than the authors' own reference release. To make the experiments self contained, the repo also includes the supporting trainers needed to reproduce the paper's setup end to end on the OpenWebText dataset. Three methods share the same transformer backbone and only differ in the loss and a few internal details. An autoregressive trainer produces a normal GPT style next token model. A masked diffusion trainer produces the teacher model. A distillation script then takes that teacher and trains a few step student version using the D-MMD method. The results section is honest about scale. Runs use a small backbone of about 30 million non embedding parameters, much smaller than the GPT-2-small sized backbones in the original papers. The author presents the numbers as evidence that the pipeline works end to end, not as a replication of the paper. Tables compare teacher samples at different top_p settings against student runs with 16 and 32 sampling steps, using a reference based metric called centered_gm computed with gpt2-large. Early student results sit below the teacher's score by step two thousand, though training was stopped before convergence. The README documents how to install the package, prepare the OpenWebText data, train a teacher, and run the distillation step, including single GPU, torchrun multi GPU, and a RunPod cloud workflow with bundle creation scripts.

Copy-paste prompts

Prompt 1

Walk me through setting up discrete-diffusion-lm on a RunPod A100 and training the md teacher on OpenWebText

Prompt 2

Show me how to point distill.py at an md checkpoint and run the k=16 D-MMD student

Prompt 3

Explain how the AR, MD, and D-MMD trainers share the transformer backbone in src/discrete_diffusion_lm/model.py

Prompt 4

Help me add a new SampleEval metric alongside GenerativeLLMMetricConfig for gpt2-large

Prompt 5

Scale this repo from the 6L/8H/512d backbone up to the GPT-2-small sized 12L/12H/768d setup

Frequently asked questions

What is discrete-diffusion-lm?

Unofficial PyTorch port of D-MMD that distills a masked discrete diffusion language model into a few-step student on OpenWebText.

What language is discrete-diffusion-lm written in?

Mainly Python. The stack also includes Python, PyTorch, CUDA.

How hard is discrete-diffusion-lm to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is discrete-diffusion-lm for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.