arahim3/mlx-dspark

Analysis updated 2026-05-18

★ 33PythonAudience · developerComplexity · 3/5LicenseSetup · easy

Mindmap

mindmap
  root((mlx-dspark))
    What it does
      Faster LLM on Mac
      Speculative decoding
      Same output as baseline
    Drafters
      DSpark from DeepSeek
      DFlash from z-lab
    Supported Models
      Qwen3 4B and 8B
      Gemma-4 12B
    Usage Modes
      CLI generate command
      Python library
      OpenAI API server
    Features
      Streaming and tool calls
      Prefix caching
      LM Studio compatible

mindmap root((mlx-dspark)) What it does Faster LLM on Mac Speculative decoding Same output as baseline Drafters DSpark from DeepSeek DFlash from z-lab Supported Models Qwen3 4B and 8B Gemma-4 12B Usage Modes CLI generate command Python library OpenAI API server Features Streaming and tool calls Prefix caching LM Studio compatible

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Speed up Qwen3-8B or Gemma-4-12B text generation on an Apple Silicon Mac by 1.5-2x using DSpark or DFlash without changing the output.

USE CASE 2

Run a local OpenAI-compatible API server on a Mac with speculative decoding speedup, then connect LM Studio or any OpenAI client to it.

USE CASE 3

Benchmark DSpark vs DFlash vs baseline decoding on a specific model to find the fastest configuration for your workload.

What is it built with?

PythonMLXApple SiliconHugging Face

How does it compare?

	arahim3/mlx-dspark	410979729/scope-recall	gongyichuren/tg-watchbot
Stars	33	33	33
Language	Python	Python	Python
Setup difficulty	easy	moderate	moderate
Complexity	3/5	3/5	3/5
Audience	developer	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · easy Time to first run · 30min

Requires Apple Silicon Mac (M1 or later), model weights download from Hugging Face on first use and can be several gigabytes.

Free to use for any purpose including commercial use as long as you keep the copyright notice.

In plain English

mlx-dspark is a Python library that speeds up text generation from language models on Apple Silicon Macs. It runs a technique called speculative decoding: a smaller "drafter" model proposes several tokens at once, and the main model verifies them all in parallel. When the drafter's guesses are correct, the result is the same as normal decoding but faster. When they are wrong, the main model corrects them. The output is always identical to what the main model would have produced on its own. The library supports two drafter types. DSpark, published by DeepSeek, proposes a block of seven tokens at once using a small parallel network trained to predict the main model's next moves. DFlash, from z-lab, uses a different approach called block diffusion, denoising a full sixteen-token block in one parallel pass. Both work with the Qwen3 and Gemma-4 model families and produce speedups of roughly 1.5 to 2 times on typical text, with larger gains on structured output like code. Installation is a single pip command. The library downloads the main model and a matched drafter from Hugging Face on first use. It runs on Apple Silicon via MLX, no server or external GPU is needed. It can be used from the command line, called from Python code, or run as an OpenAI-compatible API server that tools like LM Studio can connect to. The built-in API server supports streaming, multi-turn chat, tool calling, and prefix caching. Prefix caching keeps the conversation context in memory between turns so follow-up messages do not re-process the full history each time, which speeds up long conversations significantly. The library is MIT-licensed. Matched drafters are available for Qwen3 (4B, 8B) and Gemma-4 (12B), other DSpark or DFlash checkpoints can be used by specifying a drafter path manually.

Copy-paste prompts

Prompt 1

I want to run Qwen3-8B on my M3 MacBook with DSpark speculative decoding. Walk me through installing mlx-dspark, downloading the model, and starting an OpenAI API server.

Prompt 2

Show me how to use mlx-dspark from Python to generate text with DSpark and print the acceptance length and tokens per second.

Prompt 3

What is the difference between DSpark and DFlash in mlx-dspark and which should I use for code generation vs conversational tasks?

Prompt 4

How much RAM do I need to run Gemma-4-12B with DSpark on Apple Silicon, and how do I use a 4-bit quantized version to fit a smaller Mac?

Prompt 5

I want to connect LM Studio to the mlx-dspark API server. What base URL do I use and does it support streaming and tool calling?

Frequently asked questions

What is mlx-dspark?

mlx-dspark runs DSpark and DFlash speculative decoding on Apple Silicon Macs via MLX, making Qwen3 and Gemma-4 text generation 1.5-2x faster while producing output identical to normal decoding.

What language is mlx-dspark written in?

Mainly Python. The stack also includes Python, MLX, Apple Silicon.

What license does mlx-dspark use?

Free to use for any purpose including commercial use as long as you keep the copyright notice.

How hard is mlx-dspark to set up?

Setup difficulty is rated easy, with roughly 30min to a first successful run.

Who is mlx-dspark for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub arahim3 on gitmyhub

Verify against the repo before relying on details.