Run Qwen3-4B locally on an Apple Silicon Mac at over 100 tokens per second
Point the official OpenAI Python SDK at a local MLX powered server with no code changes
Read RESEARCH.md to study one successful and five failed MLX optimization attempts
Use the terminal chat REPL or benchmark harness to compare against llama-cli and mlx_lm
Only runs on arm64 Macs with macOS 26 or newer; first run downloads and quantizes weights from Hugging Face before serving.
qwen3-mlx is a small C++ program for Apple Silicon Macs that runs the Qwen3-4B language model and serves answers over an HTTP API. The README frames it as both a working tool and a case study: the author wanted to understand exactly what happens between sending a prompt and getting a token back, so they wrote their own inference engine using Apple's MLX C++ library instead of relying on Python, llama.cpp, or Ollama. On a Mac Studio with an M3 Max chip, the engine reaches 125.77 tokens per second on the Qwen3-4B model, which the README reports is 1.62 times faster than the llama-cli baseline and close to the Python-based mlx_lm tool. A test suite with 49 cases is included, and the project is tagged as version 0.1.0 under the MIT license. It only runs on arm64 Macs with macOS 26 or newer. The README points to a separate file called RESEARCH.md, which it describes as the more interesting part. That file logs six optimisation attempts the author tried, five of which were wrong. Only one worked, which involved quantizing a specific table inside the model so that the last big matrix multiplication reads 50 megabytes per token instead of 778 megabytes. The author records the failed hypotheses too, so that others do not repeat them. The repo builds four binaries: a streaming HTTP server that mimics the OpenAI API (so the official OpenAI Python SDK can talk to it without changes), a terminal chat REPL, a benchmark harness, and the test runner. The server handles one request at a time and returns a 429 response when busy. Weights come from a Hugging Face safetensors download which the engine then quantizes and caches locally on first run.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.