hook12aaa/qwen3-mlx

Analysis updated 2026-06-24

★ 0C++Audience · developerComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((qwen3-mlx))
    Inputs
      Prompt text
      Hugging Face weights
      HTTP request
    Outputs
      Streaming tokens
      OpenAI style JSON
      Benchmark numbers
    Use Cases
      Local Qwen3 inference on Mac
      Drop in OpenAI API replacement
      Study MLX optimization
      Run a terminal chat REPL
    Tech Stack
      C++
      Apple MLX
      HTTP
      Hugging Face safetensors

mindmap root((qwen3-mlx)) Inputs Prompt text Hugging Face weights HTTP request Outputs Streaming tokens OpenAI style JSON Benchmark numbers Use Cases Local Qwen3 inference on Mac Drop in OpenAI API replacement Study MLX optimization Run a terminal chat REPL Tech Stack C++ Apple MLX HTTP Hugging Face safetensors

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Run Qwen3-4B locally on an Apple Silicon Mac at over 100 tokens per second

USE CASE 2

Point the official OpenAI Python SDK at a local MLX powered server with no code changes

USE CASE 3

Read RESEARCH.md to study one successful and five failed MLX optimization attempts

USE CASE 4

Use the terminal chat REPL or benchmark harness to compare against llama-cli and mlx_lm

What is it built with?

C++MLXHTTPHugging Face

How does it compare?

	hook12aaa/qwen3-mlx	ujjwalkarn/xgboost	wenqijiang/fast-vector-similarity-search-on-fpga
Stars	0	—	—
Language	C++	C++	C++
Last pushed	—	2015-05-02	2021-10-31
Maintenance	—	Dormant	Dormant
Setup difficulty	hard	moderate	hard
Complexity	4/5	3/5	5/5
Audience	developer	data	researcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1h+

Only runs on arm64 Macs with macOS 26 or newer, first run downloads and quantizes weights from Hugging Face before serving.

MIT means anyone can use, copy, modify, and redistribute the code commercially as long as the original copyright notice is included.

In plain English

qwen3-mlx is a small C++ program for Apple Silicon Macs that runs the Qwen3-4B language model and serves answers over an HTTP API. The README frames it as both a working tool and a case study: the author wanted to understand exactly what happens between sending a prompt and getting a token back, so they wrote their own inference engine using Apple's MLX C++ library instead of relying on Python, llama.cpp, or Ollama. On a Mac Studio with an M3 Max chip, the engine reaches 125.77 tokens per second on the Qwen3-4B model, which the README reports is 1.62 times faster than the llama-cli baseline and close to the Python-based mlx_lm tool. A test suite with 49 cases is included, and the project is tagged as version 0.1.0 under the MIT license. It only runs on arm64 Macs with macOS 26 or newer. The README points to a separate file called RESEARCH.md, which it describes as the more interesting part. That file logs six optimisation attempts the author tried, five of which were wrong. Only one worked, which involved quantizing a specific table inside the model so that the last big matrix multiplication reads 50 megabytes per token instead of 778 megabytes. The author records the failed hypotheses too, so that others do not repeat them. The repo builds four binaries: a streaming HTTP server that mimics the OpenAI API (so the official OpenAI Python SDK can talk to it without changes), a terminal chat REPL, a benchmark harness, and the test runner. The server handles one request at a time and returns a 429 response when busy. Weights come from a Hugging Face safetensors download which the engine then quantizes and caches locally on first run.

Copy-paste prompts

Prompt 1

Build qwen3-mlx on my M3 Max and start the streaming HTTP server on the default port

Prompt 2

Point the OpenAI Python SDK at the qwen3-mlx server and run a streaming chat completion

Prompt 3

Walk me through the quantized table optimization in qwen3-mlx that cut per token reads from 778 MB to 50 MB

Prompt 4

Run the qwen3-mlx benchmark harness and compare its tokens per second against mlx_lm on the same prompt

Prompt 5

Add request queuing to qwen3-mlx so it stops returning 429 when a second request arrives mid generation

Frequently asked questions

What is qwen3-mlx?

A C++ inference engine for Apple Silicon that runs Qwen3-4B on Apple MLX, exposes an OpenAI compatible HTTP API, and benchmarks at 125 tokens per second on an M3 Max.

What language is qwen3-mlx written in?

Mainly C++. The stack also includes C++, MLX, HTTP.

What license does qwen3-mlx use?

MIT means anyone can use, copy, modify, and redistribute the code commercially as long as the original copyright notice is included.

How hard is qwen3-mlx to set up?

Setup difficulty is rated hard, with roughly 1h+ to a first successful run.

Who is qwen3-mlx for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.