Train a masked discrete diffusion language model from scratch on OpenWebText
Distill a slow diffusion teacher into a 16 or 32 step student
Benchmark autoregressive vs diffusion samplers with the centered_gm metric
Reproduce a vibe-coded port of the D-MMD paper on a single A100
Needs a CUDA GPU plus OpenWebText prep, and full runs assume a RunPod A100 or H100 with the bundle/ssh workflow.
This repository is an unofficial PyTorch reimplementation of a research paper called D-MMD by Hoogeboom et al. The paper describes a way to take a kind of language model called a masked discrete diffusion model and distill it into a faster student model that needs only a few sampling steps to produce text. The author calls the work vibe-coded and third party, which is a way of saying it is an informal personal port of the paper rather than the authors' own reference release. To make the experiments self contained, the repo also includes the supporting trainers needed to reproduce the paper's setup end to end on the OpenWebText dataset. Three methods share the same transformer backbone and only differ in the loss and a few internal details. An autoregressive trainer produces a normal GPT style next token model. A masked diffusion trainer produces the teacher model. A distillation script then takes that teacher and trains a few step student version using the D-MMD method. The results section is honest about scale. Runs use a small backbone of about 30 million non embedding parameters, much smaller than the GPT-2-small sized backbones in the original papers. The author presents the numbers as evidence that the pipeline works end to end, not as a replication of the paper. Tables compare teacher samples at different top_p settings against student runs with 16 and 32 sampling steps, using a reference based metric called centered_gm computed with gpt2-large. Early student results sit below the teacher's score by step two thousand, though training was stopped before convergence. The README documents how to install the package, prepare the OpenWebText data, train a teacher, and run the distillation step, including single GPU, torchrun multi GPU, and a RunPod cloud workflow with bundle creation scripts.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.