z-lab/dflash

★ 4,529PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((repo))
    What it does
      Draft token guessing
      Parallel verification
      Faster generation
    Tech Stack
      Python
      vLLM backend
      SGLang backend
      Hugging Face
    Supported Models
      Qwen3 variants
      LLaMA 3.1
      Gemma 4
    Use Cases
      LLM inference speed
      Server throughput
      Apple Silicon runs

mindmap root((repo)) What it does Draft token guessing Parallel verification Faster generation Tech Stack Python vLLM backend SGLang backend Hugging Face Supported Models Qwen3 variants LLaMA 3.1 Gemma 4 Use Cases LLM inference speed Server throughput Apple Silicon runs

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Speed up text generation from a large language model without changing its output quality.

USE CASE 2

Add speculative decoding to an existing vLLM or SGLang inference server with a single config argument.

USE CASE 3

Run faster LLM inference on Apple Silicon Macs using the MLX backend with a DFlash draft model.

Tech stack

PythonvLLMSGLangHugging Face TransformersMLX

Getting it running

Difficulty · hard Time to first run · 1h+

Requires a running vLLM, SGLang, MLX, or Transformers environment with a compatible base model and the paired DFlash draft model downloaded from Hugging Face.

License not mentioned in the explanation, check the repository for details.

In plain English

Large language models generate text one token at a time, where each step depends on the previous one. This is inherently slow when you want many tokens. Speculative decoding is a technique that speeds this up by running a small, fast "draft" model that guesses several upcoming tokens at once, and then letting the large model verify all those guesses in a single parallel pass. Correct guesses are accepted, wrong ones are discarded and regenerated. The net effect is faster output from the large model without any change to the quality of its responses. DFlash is a lightweight draft model built specifically for this purpose. It uses a block diffusion approach, meaning it generates a batch of candidate tokens simultaneously rather than one at a time. The repository provides pre-trained DFlash draft models for a range of popular large language models, including several Qwen3 and Qwen3.5 variants, Gemma4, LLaMA 3.1, Kimi, and MiniMax models. Each draft model is paired to one specific base model and hosted on Hugging Face. The library works with four serving backends: vLLM, SGLang, Hugging Face Transformers, and MLX for Apple Silicon machines. Each backend has its own install path and example commands in the README. For vLLM and SGLang, you start a server and pass speculative decoding configuration as a JSON argument pointing at the DFlash draft model. For Transformers and MLX, you load both the main model and the draft model in Python and call a special generation function. Benchmarking scripts are included and cover standard datasets such as GSM8K math problems, HumanEval coding tasks, and MT-Bench conversation quality. The authors plan to release the training recipe so that users can train DFlash draft models for other large language models not yet in the supported list.

Copy-paste prompts

Prompt 1

Show me how to configure vLLM to use a DFlash draft model for Qwen3 to enable speculative decoding and measure the speedup.

Prompt 2

I'm running an SGLang inference server with LLaMA 3.1. Write the JSON configuration to add DFlash speculative decoding.

Prompt 3

Using Hugging Face Transformers in Python, show me how to load both a LLaMA 3.1 model and its DFlash draft model and run speculative decoding generation.

Prompt 4

Show me how to run the DFlash GSM8K benchmark to measure the token generation speedup compared to standard decoding.

Prompt 5

I have an Apple Silicon Mac and want to run DFlash with MLX. Show me the install steps and a sample generation call.

Open on GitHub → Explain another repo

← z-lab on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.