Run HRM-Text-1B locally on an Apple Silicon Mac at over 50 tokens per second in 4-bit
Generate text from the hrm-mlx CLI given a prompt and a downloaded checkpoint
Stream tokens from the HRMTextGenerator Python API in your own app
Benchmark MLX 4-bit vs PyTorch MPS BF16 on a recurrent reasoning model
Requires an Apple Silicon Mac plus a 740 MB or 2.2 GB weight download from Hugging Face.
HRM-mlx is a Python project that lets a specific small language model called HRM-Text-1B run directly on Apple Silicon Macs, meaning the M-series chips inside recent MacBooks and Mac desktops. The original HRM-Text model was published by a group called Sapient. This repository takes that model and ports it onto Apple's own machine learning runtime, called MLX, so it runs faster on a local Mac than it would using the standard PyTorch path through Apple's MPS backend. The README reports benchmark numbers on a MacBook Pro M4 Max with a 32-core GPU. PyTorch MPS in BF16 hits 22 tokens per second, HRM-mlx in BF16 reaches 28.2, and HRM-mlx in a smaller 4-bit format reaches 53.2, which is 2.4 times faster than the PyTorch baseline. The shape of the test is 512 input tokens followed by 128 generated tokens, and the README warns that exact numbers depend on the chip. To use the project, a developer clones the repo, creates a Python virtual environment, installs the package, and then downloads one of two pre-built weight files from Hugging Face: a 740 MB 4-bit version for the fastest local speed, or a 2.2 GB BF16 version as an unquantized baseline. A command-line tool called "hrm-mlx" then generates text from a prompt, and a small Python API exposes a HRMTextGenerator class with both a one-shot generate call and a token-by-token stream. The "How it works" section explains that HRM-Text is not a normal one-billion-parameter decoder. Each output token runs a recurrent reasoning loop of eight internal passes. The repo keeps that recurrence and rewrites the inference parts in MLX, with packed weight loading, recurrent key-value caches, fast RMSNorm, RoPE, and attention paths, persisted 4-bit weights, and an optional custom Metal SwiGLU activation. The notes section adds that HRM-Text-1B is a base reasoning model, not a polished chat assistant, and that the 4-bit checkpoint has not been formally evaluated. The license is Apache-2.0, matching the upstream model.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.