aryagm/hrm-mlx

Analysis updated 2026-06-24

★ 3PythonAudience · researcherComplexity · 4/5LicenseSetup · moderate

Mindmap

mindmap
  root((HRM-mlx))
    Inputs
      Text prompt
      Hugging Face weights
      Apple Silicon Mac
    Outputs
      Generated tokens
      Streamed text
      Benchmark numbers
    Use Cases
      Local LLM on M4
      Fast 4-bit inference
      Research on HRM
    Tech Stack
      Python
      MLX
      Metal
      Hugging Face

mindmap root((HRM-mlx)) Inputs Text prompt Hugging Face weights Apple Silicon Mac Outputs Generated tokens Streamed text Benchmark numbers Use Cases Local LLM on M4 Fast 4-bit inference Research on HRM Tech Stack Python MLX Metal Hugging Face

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Run HRM-Text-1B locally on an Apple Silicon Mac at over 50 tokens per second in 4-bit

USE CASE 2

Generate text from the hrm-mlx CLI given a prompt and a downloaded checkpoint

USE CASE 3

Stream tokens from the HRMTextGenerator Python API in your own app

USE CASE 4

Benchmark MLX 4-bit vs PyTorch MPS BF16 on a recurrent reasoning model

What is it built with?

PythonMLXMetalHuggingFace

How does it compare?

	aryagm/hrm-mlx	0marildo/imago	agentlexi/agent-lexi
Stars	3	3	3
Language	Python	Python	Python
Setup difficulty	moderate	easy	moderate
Complexity	4/5	2/5	4/5
Audience	researcher	general	vibe coder

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Requires an Apple Silicon Mac plus a 740 MB or 2.2 GB weight download from Hugging Face.

Apache 2.0, free to use, modify, and ship commercially with attribution and a notice file.

In plain English

HRM-mlx is a Python project that lets a specific small language model called HRM-Text-1B run directly on Apple Silicon Macs, meaning the M-series chips inside recent MacBooks and Mac desktops. The original HRM-Text model was published by a group called Sapient. This repository takes that model and ports it onto Apple's own machine learning runtime, called MLX, so it runs faster on a local Mac than it would using the standard PyTorch path through Apple's MPS backend. The README reports benchmark numbers on a MacBook Pro M4 Max with a 32-core GPU. PyTorch MPS in BF16 hits 22 tokens per second, HRM-mlx in BF16 reaches 28.2, and HRM-mlx in a smaller 4-bit format reaches 53.2, which is 2.4 times faster than the PyTorch baseline. The shape of the test is 512 input tokens followed by 128 generated tokens, and the README warns that exact numbers depend on the chip. To use the project, a developer clones the repo, creates a Python virtual environment, installs the package, and then downloads one of two pre-built weight files from Hugging Face: a 740 MB 4-bit version for the fastest local speed, or a 2.2 GB BF16 version as an unquantized baseline. A command-line tool called "hrm-mlx" then generates text from a prompt, and a small Python API exposes a HRMTextGenerator class with both a one-shot generate call and a token-by-token stream. The "How it works" section explains that HRM-Text is not a normal one-billion-parameter decoder. Each output token runs a recurrent reasoning loop of eight internal passes. The repo keeps that recurrence and rewrites the inference parts in MLX, with packed weight loading, recurrent key-value caches, fast RMSNorm, RoPE, and attention paths, persisted 4-bit weights, and an optional custom Metal SwiGLU activation. The notes section adds that HRM-Text-1B is a base reasoning model, not a polished chat assistant, and that the 4-bit checkpoint has not been formally evaluated. The license is Apache-2.0, matching the upstream model.

Copy-paste prompts

Prompt 1

Walk me through cloning HRM-mlx, making a venv, and downloading the 4-bit weights from Hugging Face

Prompt 2

Use HRMTextGenerator to stream tokens from a prompt with the 4-bit checkpoint on my M4 Max

Prompt 3

Explain the 8-pass recurrent reasoning loop and how MLX caches the recurrent KV state

Prompt 4

Compare PyTorch MPS BF16 vs HRM-mlx 4-bit speeds on my Mac with a 512 prompt and 128 output

Prompt 5

Add a custom sampling temperature option to the hrm-mlx CLI

Frequently asked questions

What is hrm-mlx?

Apache-licensed port of the HRM-Text-1B recurrent reasoning model to Apple MLX so it runs faster on M-series Macs than the PyTorch MPS baseline.

What language is hrm-mlx written in?

Mainly Python. The stack also includes Python, MLX, Metal.

What license does hrm-mlx use?

Apache 2.0, free to use, modify, and ship commercially with attribution and a notice file.

How hard is hrm-mlx to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is hrm-mlx for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.