explaingit

hook12aaa/qwen3-mlx

0C++Audience · developerComplexity · 4/5ActiveLicenseSetup · hard

TLDR

A C++ inference engine for Apple Silicon that runs Qwen3-4B on Apple MLX, exposes an OpenAI compatible HTTP API, and benchmarks at 125 tokens per second on an M3 Max.

Mindmap

mindmap
  root((qwen3-mlx))
    Inputs
      Prompt text
      Hugging Face weights
      HTTP request
    Outputs
      Streaming tokens
      OpenAI style JSON
      Benchmark numbers
    Use Cases
      Local Qwen3 inference on Mac
      Drop in OpenAI API replacement
      Study MLX optimization
      Run a terminal chat REPL
    Tech Stack
      C++
      Apple MLX
      HTTP
      Hugging Face safetensors

Things people build with this

USE CASE 1

Run Qwen3-4B locally on an Apple Silicon Mac at over 100 tokens per second

USE CASE 2

Point the official OpenAI Python SDK at a local MLX powered server with no code changes

USE CASE 3

Read RESEARCH.md to study one successful and five failed MLX optimization attempts

USE CASE 4

Use the terminal chat REPL or benchmark harness to compare against llama-cli and mlx_lm

Tech stack

C++MLXHTTPHugging Face

Getting it running

Difficulty · hard Time to first run · 1h+

Only runs on arm64 Macs with macOS 26 or newer; first run downloads and quantizes weights from Hugging Face before serving.

MIT means anyone can use, copy, modify, and redistribute the code commercially as long as the original copyright notice is included.

In plain English

qwen3-mlx is a small C++ program for Apple Silicon Macs that runs the Qwen3-4B language model and serves answers over an HTTP API. The README frames it as both a working tool and a case study: the author wanted to understand exactly what happens between sending a prompt and getting a token back, so they wrote their own inference engine using Apple's MLX C++ library instead of relying on Python, llama.cpp, or Ollama. On a Mac Studio with an M3 Max chip, the engine reaches 125.77 tokens per second on the Qwen3-4B model, which the README reports is 1.62 times faster than the llama-cli baseline and close to the Python-based mlx_lm tool. A test suite with 49 cases is included, and the project is tagged as version 0.1.0 under the MIT license. It only runs on arm64 Macs with macOS 26 or newer. The README points to a separate file called RESEARCH.md, which it describes as the more interesting part. That file logs six optimisation attempts the author tried, five of which were wrong. Only one worked, which involved quantizing a specific table inside the model so that the last big matrix multiplication reads 50 megabytes per token instead of 778 megabytes. The author records the failed hypotheses too, so that others do not repeat them. The repo builds four binaries: a streaming HTTP server that mimics the OpenAI API (so the official OpenAI Python SDK can talk to it without changes), a terminal chat REPL, a benchmark harness, and the test runner. The server handles one request at a time and returns a 429 response when busy. Weights come from a Hugging Face safetensors download which the engine then quantizes and caches locally on first run.

Copy-paste prompts

Prompt 1
Build qwen3-mlx on my M3 Max and start the streaming HTTP server on the default port
Prompt 2
Point the OpenAI Python SDK at the qwen3-mlx server and run a streaming chat completion
Prompt 3
Walk me through the quantized table optimization in qwen3-mlx that cut per token reads from 778 MB to 50 MB
Prompt 4
Run the qwen3-mlx benchmark harness and compare its tokens per second against mlx_lm on the same prompt
Prompt 5
Add request queuing to qwen3-mlx so it stops returning 429 when a second request arrives mid generation
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.