sapientinc/hrm-text

★ 617PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((HRM-Text))
    Inputs
      Tokenized corpus
      Training config
      Multi-node H100s
    Outputs
      Pretrained checkpoint
      HuggingFace export
      Benchmark scores
    Use Cases
      Cheap 1B pretrain
      Architecture research
      Reproduce GSM8k MMLU
    Tech Stack
      PyTorch
      FSDP2
      FlashAttention
      CUDA
      Docker

mindmap root((HRM-Text)) Inputs Tokenized corpus Training config Multi-node H100s Outputs Pretrained checkpoint HuggingFace export Benchmark scores Use Cases Cheap 1B pretrain Architecture research Reproduce GSM8k MMLU Tech Stack PyTorch FSDP2 FlashAttention CUDA Docker

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Pretrain a 600M or 1B language model from scratch on rented H100 nodes

USE CASE 2

Compare a hierarchical recurrent backbone against a same-size transformer baseline

USE CASE 3

Run GSM8k, MATH, MMLU, and ARC evaluation on a freshly trained checkpoint

USE CASE 4

Export an HRM-Text checkpoint to Hugging Face Transformers format for inference

Tech stack

PythonPyTorchFSDP2FlashAttentionCUDADockerWandB

Getting it running

Difficulty · hard Time to first run · 1day+

Requires Hopper-class H100 GPUs for FlashAttention 3 and multi-node NCCL setup, not runnable on consumer hardware.

In plain English

HRM-Text is a code release that lets a small team pretrain a 1 billion parameter language model from scratch for roughly $1000 in GPU rental. The headline claim in the README is that the same approach reaches benchmark numbers comparable to much larger projects while using 130 to 600 times less compute and 150 to 900 times less data. HRM stands for hierarchical recurrent model, the architecture the authors are pushing as an alternative to a standard transformer of the same size. The repository ships the full pretraining stack: a hierarchical recurrent architecture, a sequence packing trick called PrefixLM, FlashAttention 3 attention kernels, distributed training via PyTorch FSDP2, evaluation scripts for common benchmarks, and a tool to export the trained checkpoint into Hugging Face Transformers format. The README is explicit that the attention path needs Hopper-class GPUs such as the H100, since it relies on FlashAttention 3. Two reference runs are documented. The L size has 600 million parameters and trains on a single node of 8 H100s in about 50 hours, with reported scores including 77.6% on GSM8k and 56.6% on MMLU. The XL size has 1 billion parameters and trains on two nodes of 8 H100s each in about 46 hours, scoring 84.7% on GSM8k and 60.7% on MMLU. The pricing math assumes $2 per H100 hour. The workflow walks the user through preparing tokenized data with a companion repo called data_io, running training in a published Docker image, checking NCCL communication for multi-node setups, logging to Weights and Biases, launching with torchrun, evaluating against benchmarks like GSM8k, MATH, MMLU, and ARC, and finally exporting to the Hugging Face format. The README also lists alternative baseline architectures included for comparison, such as a standard transformer, a tiny recursive model, and a universal transformer.

Copy-paste prompts

Prompt 1

Set up the data_io pipeline and tokenize a 100B-token corpus for HRM-Text pretraining

Prompt 2

Launch an 8xH100 single-node training run of the L 600M HRM-Text config with torchrun

Prompt 3

Add a new baseline architecture next to the existing transformer and universal-transformer entries

Prompt 4

Convert an HRM-Text XL checkpoint into HuggingFace format and run a GSM8k eval

Prompt 5

Diagnose NCCL failures on a two-node H100 setup before starting the XL run

Open on GitHub → Explain another repo

← sapientinc on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.