explaingit

aryagm/hrm-mlx

3PythonAudience · researcherComplexity · 4/5ActiveLicenseSetup · moderate

TLDR

Apache-licensed port of the HRM-Text-1B recurrent reasoning model to Apple MLX so it runs faster on M-series Macs than the PyTorch MPS baseline.

Mindmap

mindmap
  root((HRM-mlx))
    Inputs
      Text prompt
      Hugging Face weights
      Apple Silicon Mac
    Outputs
      Generated tokens
      Streamed text
      Benchmark numbers
    Use Cases
      Local LLM on M4
      Fast 4-bit inference
      Research on HRM
    Tech Stack
      Python
      MLX
      Metal
      Hugging Face

Things people build with this

USE CASE 1

Run HRM-Text-1B locally on an Apple Silicon Mac at over 50 tokens per second in 4-bit

USE CASE 2

Generate text from the hrm-mlx CLI given a prompt and a downloaded checkpoint

USE CASE 3

Stream tokens from the HRMTextGenerator Python API in your own app

USE CASE 4

Benchmark MLX 4-bit vs PyTorch MPS BF16 on a recurrent reasoning model

Tech stack

PythonMLXMetalHuggingFace

Getting it running

Difficulty · moderate Time to first run · 30min

Requires an Apple Silicon Mac plus a 740 MB or 2.2 GB weight download from Hugging Face.

Apache 2.0, free to use, modify, and ship commercially with attribution and a notice file.

In plain English

HRM-mlx is a Python project that lets a specific small language model called HRM-Text-1B run directly on Apple Silicon Macs, meaning the M-series chips inside recent MacBooks and Mac desktops. The original HRM-Text model was published by a group called Sapient. This repository takes that model and ports it onto Apple's own machine learning runtime, called MLX, so it runs faster on a local Mac than it would using the standard PyTorch path through Apple's MPS backend. The README reports benchmark numbers on a MacBook Pro M4 Max with a 32-core GPU. PyTorch MPS in BF16 hits 22 tokens per second, HRM-mlx in BF16 reaches 28.2, and HRM-mlx in a smaller 4-bit format reaches 53.2, which is 2.4 times faster than the PyTorch baseline. The shape of the test is 512 input tokens followed by 128 generated tokens, and the README warns that exact numbers depend on the chip. To use the project, a developer clones the repo, creates a Python virtual environment, installs the package, and then downloads one of two pre-built weight files from Hugging Face: a 740 MB 4-bit version for the fastest local speed, or a 2.2 GB BF16 version as an unquantized baseline. A command-line tool called "hrm-mlx" then generates text from a prompt, and a small Python API exposes a HRMTextGenerator class with both a one-shot generate call and a token-by-token stream. The "How it works" section explains that HRM-Text is not a normal one-billion-parameter decoder. Each output token runs a recurrent reasoning loop of eight internal passes. The repo keeps that recurrence and rewrites the inference parts in MLX, with packed weight loading, recurrent key-value caches, fast RMSNorm, RoPE, and attention paths, persisted 4-bit weights, and an optional custom Metal SwiGLU activation. The notes section adds that HRM-Text-1B is a base reasoning model, not a polished chat assistant, and that the 4-bit checkpoint has not been formally evaluated. The license is Apache-2.0, matching the upstream model.

Copy-paste prompts

Prompt 1
Walk me through cloning HRM-mlx, making a venv, and downloading the 4-bit weights from Hugging Face
Prompt 2
Use HRMTextGenerator to stream tokens from a prompt with the 4-bit checkpoint on my M4 Max
Prompt 3
Explain the 8-pass recurrent reasoning loop and how MLX caches the recurrent KV state
Prompt 4
Compare PyTorch MPS BF16 vs HRM-mlx 4-bit speeds on my Mac with a 512 prompt and 128 output
Prompt 5
Add a custom sampling temperature option to the hrm-mlx CLI
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.