ml-gsai/llada

★ 3,781PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((LLaDA))
    What it does
      Diffusion language model
      Unmask to generate
      8B parameters
    Models
      Base pretrained
      Instruct tuned
      Vision LLaDA-V
      MoE variant
    Usage
      Terminal chat
      Gradio web UI
      Benchmark eval
    Research
      Masking objective
      Proper probability
      Scales like GPT

mindmap root((LLaDA)) What it does Diffusion language model Unmask to generate 8B parameters Models Base pretrained Instruct tuned Vision LLaDA-V MoE variant Usage Terminal chat Gradio web UI Benchmark eval Research Masking objective Proper probability Scales like GPT

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run the LLaDA-8B-Instruct model to experiment with diffusion-based text generation as an alternative to GPT-style autoregressive models.

USE CASE 2

Launch the Gradio web interface to interact with LLaDA through a browser without writing any code.

USE CASE 3

Evaluate LLaDA on standard language model benchmarks using the lm-evaluation-harness framework.

USE CASE 4

Study the masking-based training objective as a blueprint for building your own diffusion language model.

Tech stack

PythonPyTorchTransformersHugging FaceGradio

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a GPU with enough VRAM to load an 8B model, slower than autoregressive models at inference time due to multi-pass generation.

Open-source research code, check the repository for the exact license terms.

In plain English

LLaDA stands for Large Language Diffusion with mAsking. It is a research project from GSAI at the Chinese Academy of Sciences that trains a large language model using a diffusion approach rather than the autoregressive method that most popular language models use today. The result is an 8-billion-parameter model that the authors say performs comparably to Meta's LLaMA3 8B on a range of benchmarks. Most language models generate text by predicting one token at a time, always moving left to right. LLaDA takes a different route. It starts with a response where every word is masked, or hidden, and then gradually unmasks tokens across multiple steps until the full answer is revealed. The theoretical motivation is that this approach forms a proper generative model with a well-defined probability distribution over text, which the team argues is something BERT-style masked models do not achieve. The training objective is an upper bound on the negative log-likelihood of the model, giving it the mathematical grounding needed to scale and generalize. In practice, you interact with LLaDA much like any other open-weights language model. The pretrained base model and an instruction-tuned variant called LLaDA-8B-Instruct are both available on Hugging Face. Loading them requires the Transformers library. The repo includes scripts for running a chat session in the terminal, launching a Gradio web interface for a visual demo, and evaluating the model on standard benchmarks using the lm-evaluation-harness framework. The project has grown since the original February 2025 paper. A vision-language version called LLaDA-V has been added, along with LLaDA 1.5, which improves preference alignment. A Mixture-of-Experts variant called LLaDA-MoE uses only about one billion active parameters at inference time while reportedly outperforming the dense 8B model on some tasks. One known limitation is sampling speed. Because LLaDA generates a fixed-length response in multiple passes rather than streaming tokens one by one, it is currently slower than autoregressive models and cannot use the KV-Cache optimizations those models rely on. The authors acknowledge this and point to ongoing work in the broader diffusion model community to close the gap.

Copy-paste prompts

Prompt 1

Show me how to load LLaDA-8B-Instruct from Hugging Face and run a chat session with it using the provided terminal script.

Prompt 2

How do I launch the Gradio demo for LLaDA on my local machine so I can test it through a browser interface?

Prompt 3

Walk me through running LLaDA on the standard lm-evaluation-harness benchmarks and comparing the results to LLaMA3 8B.

Prompt 4

Explain the LLaDA masking-based training objective, how does starting from a fully masked response and unmasking tokens differ from autoregressive generation?

Prompt 5

What hardware do I need to run LLaDA-8B inference, and are there any quantised versions available to reduce memory requirements?

Open on GitHub → Explain another repo

← ml-gsai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.