Speed up text generation from a large language model without changing its output quality.
Add speculative decoding to an existing vLLM or SGLang inference server with a single config argument.
Run faster LLM inference on Apple Silicon Macs using the MLX backend with a DFlash draft model.
Requires a running vLLM, SGLang, MLX, or Transformers environment with a compatible base model and the paired DFlash draft model downloaded from Hugging Face.
Large language models generate text one token at a time, where each step depends on the previous one. This is inherently slow when you want many tokens. Speculative decoding is a technique that speeds this up by running a small, fast "draft" model that guesses several upcoming tokens at once, and then letting the large model verify all those guesses in a single parallel pass. Correct guesses are accepted, wrong ones are discarded and regenerated. The net effect is faster output from the large model without any change to the quality of its responses. DFlash is a lightweight draft model built specifically for this purpose. It uses a block diffusion approach, meaning it generates a batch of candidate tokens simultaneously rather than one at a time. The repository provides pre-trained DFlash draft models for a range of popular large language models, including several Qwen3 and Qwen3.5 variants, Gemma4, LLaMA 3.1, Kimi, and MiniMax models. Each draft model is paired to one specific base model and hosted on Hugging Face. The library works with four serving backends: vLLM, SGLang, Hugging Face Transformers, and MLX for Apple Silicon machines. Each backend has its own install path and example commands in the README. For vLLM and SGLang, you start a server and pass speculative decoding configuration as a JSON argument pointing at the DFlash draft model. For Transformers and MLX, you load both the main model and the draft model in Python and call a special generation function. Benchmarking scripts are included and cover standard datasets such as GSM8K math problems, HumanEval coding tasks, and MT-Bench conversation quality. The authors plan to release the training recipe so that users can train DFlash draft models for other large language models not yet in the supported list.
← z-lab on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.