Run the LLaDA-8B-Instruct model to experiment with diffusion-based text generation as an alternative to GPT-style autoregressive models.
Launch the Gradio web interface to interact with LLaDA through a browser without writing any code.
Evaluate LLaDA on standard language model benchmarks using the lm-evaluation-harness framework.
Study the masking-based training objective as a blueprint for building your own diffusion language model.
Requires a GPU with enough VRAM to load an 8B model, slower than autoregressive models at inference time due to multi-pass generation.
LLaDA stands for Large Language Diffusion with mAsking. It is a research project from GSAI at the Chinese Academy of Sciences that trains a large language model using a diffusion approach rather than the autoregressive method that most popular language models use today. The result is an 8-billion-parameter model that the authors say performs comparably to Meta's LLaMA3 8B on a range of benchmarks. Most language models generate text by predicting one token at a time, always moving left to right. LLaDA takes a different route. It starts with a response where every word is masked, or hidden, and then gradually unmasks tokens across multiple steps until the full answer is revealed. The theoretical motivation is that this approach forms a proper generative model with a well-defined probability distribution over text, which the team argues is something BERT-style masked models do not achieve. The training objective is an upper bound on the negative log-likelihood of the model, giving it the mathematical grounding needed to scale and generalize. In practice, you interact with LLaDA much like any other open-weights language model. The pretrained base model and an instruction-tuned variant called LLaDA-8B-Instruct are both available on Hugging Face. Loading them requires the Transformers library. The repo includes scripts for running a chat session in the terminal, launching a Gradio web interface for a visual demo, and evaluating the model on standard benchmarks using the lm-evaluation-harness framework. The project has grown since the original February 2025 paper. A vision-language version called LLaDA-V has been added, along with LLaDA 1.5, which improves preference alignment. A Mixture-of-Experts variant called LLaDA-MoE uses only about one billion active parameters at inference time while reportedly outperforming the dense 8B model on some tasks. One known limitation is sampling speed. Because LLaDA generates a fixed-length response in multiple passes rather than streaming tokens one by one, it is currently slower than autoregressive models and cannot use the KV-Cache optimizations those models rely on. The authors acknowledge this and point to ongoing work in the broader diffusion model community to close the gap.
← ml-gsai on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.