explaingit

arahim3/mlx-dspark

Analysis updated 2026-05-18

33PythonAudience · developerComplexity · 3/5LicenseSetup · easy

TLDR

mlx-dspark runs DSpark and DFlash speculative decoding on Apple Silicon Macs via MLX, making Qwen3 and Gemma-4 text generation 1.5-2x faster while producing output identical to normal decoding.

Mindmap

mindmap
  root((mlx-dspark))
    What it does
      Faster LLM on Mac
      Speculative decoding
      Same output as baseline
    Drafters
      DSpark from DeepSeek
      DFlash from z-lab
    Supported Models
      Qwen3 4B and 8B
      Gemma-4 12B
    Usage Modes
      CLI generate command
      Python library
      OpenAI API server
    Features
      Streaming and tool calls
      Prefix caching
      LM Studio compatible
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Speed up Qwen3-8B or Gemma-4-12B text generation on an Apple Silicon Mac by 1.5-2x using DSpark or DFlash without changing the output.

USE CASE 2

Run a local OpenAI-compatible API server on a Mac with speculative decoding speedup, then connect LM Studio or any OpenAI client to it.

USE CASE 3

Benchmark DSpark vs DFlash vs baseline decoding on a specific model to find the fastest configuration for your workload.

What is it built with?

PythonMLXApple SiliconHugging Face

How does it compare?

arahim3/mlx-dspark410979729/scope-recallgongyichuren/tg-watchbot
Stars333333
LanguagePythonPythonPython
Setup difficultyeasymoderatemoderate
Complexity3/53/53/5
Audiencedeveloperdeveloperdeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · easy Time to first run · 30min

Requires Apple Silicon Mac (M1 or later), model weights download from Hugging Face on first use and can be several gigabytes.

Free to use for any purpose including commercial use as long as you keep the copyright notice.

In plain English

mlx-dspark is a Python library that speeds up text generation from language models on Apple Silicon Macs. It runs a technique called speculative decoding: a smaller "drafter" model proposes several tokens at once, and the main model verifies them all in parallel. When the drafter's guesses are correct, the result is the same as normal decoding but faster. When they are wrong, the main model corrects them. The output is always identical to what the main model would have produced on its own. The library supports two drafter types. DSpark, published by DeepSeek, proposes a block of seven tokens at once using a small parallel network trained to predict the main model's next moves. DFlash, from z-lab, uses a different approach called block diffusion, denoising a full sixteen-token block in one parallel pass. Both work with the Qwen3 and Gemma-4 model families and produce speedups of roughly 1.5 to 2 times on typical text, with larger gains on structured output like code. Installation is a single pip command. The library downloads the main model and a matched drafter from Hugging Face on first use. It runs on Apple Silicon via MLX, no server or external GPU is needed. It can be used from the command line, called from Python code, or run as an OpenAI-compatible API server that tools like LM Studio can connect to. The built-in API server supports streaming, multi-turn chat, tool calling, and prefix caching. Prefix caching keeps the conversation context in memory between turns so follow-up messages do not re-process the full history each time, which speeds up long conversations significantly. The library is MIT-licensed. Matched drafters are available for Qwen3 (4B, 8B) and Gemma-4 (12B), other DSpark or DFlash checkpoints can be used by specifying a drafter path manually.

Copy-paste prompts

Prompt 1
I want to run Qwen3-8B on my M3 MacBook with DSpark speculative decoding. Walk me through installing mlx-dspark, downloading the model, and starting an OpenAI API server.
Prompt 2
Show me how to use mlx-dspark from Python to generate text with DSpark and print the acceptance length and tokens per second.
Prompt 3
What is the difference between DSpark and DFlash in mlx-dspark and which should I use for code generation vs conversational tasks?
Prompt 4
How much RAM do I need to run Gemma-4-12B with DSpark on Apple Silicon, and how do I use a 4-bit quantized version to fit a smaller Mac?
Prompt 5
I want to connect LM Studio to the mlx-dspark API server. What base URL do I use and does it support streaming and tool calling?

Frequently asked questions

What is mlx-dspark?

mlx-dspark runs DSpark and DFlash speculative decoding on Apple Silicon Macs via MLX, making Qwen3 and Gemma-4 text generation 1.5-2x faster while producing output identical to normal decoding.

What language is mlx-dspark written in?

Mainly Python. The stack also includes Python, MLX, Apple Silicon.

What license does mlx-dspark use?

Free to use for any purpose including commercial use as long as you keep the copyright notice.

How hard is mlx-dspark to set up?

Setup difficulty is rated easy, with roughly 30min to a first successful run.

Who is mlx-dspark for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub arahim3 on gitmyhub

Verify against the repo before relying on details.