Analysis updated 2026-05-18
Speed up Qwen3-8B or Gemma-4-12B text generation on an Apple Silicon Mac by 1.5-2x using DSpark or DFlash without changing the output.
Run a local OpenAI-compatible API server on a Mac with speculative decoding speedup, then connect LM Studio or any OpenAI client to it.
Benchmark DSpark vs DFlash vs baseline decoding on a specific model to find the fastest configuration for your workload.
| arahim3/mlx-dspark | 410979729/scope-recall | gongyichuren/tg-watchbot | |
|---|---|---|---|
| Stars | 33 | 33 | 33 |
| Language | Python | Python | Python |
| Setup difficulty | easy | moderate | moderate |
| Complexity | 3/5 | 3/5 | 3/5 |
| Audience | developer | developer | developer |
Figures from each repo's GitHub metadata at analysis time.
Requires Apple Silicon Mac (M1 or later), model weights download from Hugging Face on first use and can be several gigabytes.
mlx-dspark is a Python library that speeds up text generation from language models on Apple Silicon Macs. It runs a technique called speculative decoding: a smaller "drafter" model proposes several tokens at once, and the main model verifies them all in parallel. When the drafter's guesses are correct, the result is the same as normal decoding but faster. When they are wrong, the main model corrects them. The output is always identical to what the main model would have produced on its own. The library supports two drafter types. DSpark, published by DeepSeek, proposes a block of seven tokens at once using a small parallel network trained to predict the main model's next moves. DFlash, from z-lab, uses a different approach called block diffusion, denoising a full sixteen-token block in one parallel pass. Both work with the Qwen3 and Gemma-4 model families and produce speedups of roughly 1.5 to 2 times on typical text, with larger gains on structured output like code. Installation is a single pip command. The library downloads the main model and a matched drafter from Hugging Face on first use. It runs on Apple Silicon via MLX, no server or external GPU is needed. It can be used from the command line, called from Python code, or run as an OpenAI-compatible API server that tools like LM Studio can connect to. The built-in API server supports streaming, multi-turn chat, tool calling, and prefix caching. Prefix caching keeps the conversation context in memory between turns so follow-up messages do not re-process the full history each time, which speeds up long conversations significantly. The library is MIT-licensed. Matched drafters are available for Qwen3 (4B, 8B) and Gemma-4 (12B), other DSpark or DFlash checkpoints can be used by specifying a drafter path manually.
mlx-dspark runs DSpark and DFlash speculative decoding on Apple Silicon Macs via MLX, making Qwen3 and Gemma-4 text generation 1.5-2x faster while producing output identical to normal decoding.
Mainly Python. The stack also includes Python, MLX, Apple Silicon.
Free to use for any purpose including commercial use as long as you keep the copyright notice.
Setup difficulty is rated easy, with roughly 30min to a first successful run.
Mainly developer.
This repo across BitVibe Labs
Verify against the repo before relying on details.