Analysis updated 2026-06-24
Reproduce the HRM-Text pretraining data pipeline end to end
Build a custom instruction-response pretraining corpus from Hugging Face sources
Tokenize a large cleaned text corpus into numpy arrays with a Rust pipeline
Run stratified sampling to assemble balanced training epochs
| sapientinc/data_io | 410979729/scope-recall | arahim3/mlx-dspark | |
|---|---|---|---|
| Stars | 33 | 33 | 33 |
| Language | Python | Python | Python |
| Setup difficulty | hard | moderate | easy |
| Complexity | 5/5 | 3/5 | 3/5 |
| Audience | researcher | developer | developer |
Figures from each repo's GitHub metadata at analysis time.
Cleaning stage needs about 512 GiB of RAM, plus Rust and Cargo for the tokenizer build.
data_io is the data preparation pipeline used to pretrain HRM-Text, a language model from Sapient Intelligence. Pretraining is the long, expensive first phase of building a language model, where the model is exposed to large amounts of text and learns general patterns before any task-specific tuning. Where most pretraining pipelines feed the model raw web pages, this one is different: it builds question-and-answer pairs in an instruction-and-response format, then turns those into the numeric token arrays the trainer actually reads. The pipeline runs in four stages. First, data cleaning takes raw datasets, mostly downloaded from Hugging Face Hub plus a few from Google Drive and Google Cloud Storage, and rewrites each item into a standard shape with three fields: a condition tag string, an instruction, and a response. Sources include FLAN, PleIAs SYNTH, Open Platypus, ARB, scibench, AMPS, the DeepMind mathematics dataset, GSM8K, and others, each with its own cleaning script under pipe/ or pipe_clustered/. The README warns that running the cleaning scripts needs about 512 GiB of RAM, and offers a prebuilt cleaned dataset on Hugging Face so most users can skip ahead. Second, a BPE tokenizer is trained on the cleaned text, though pretrained tokenizers are already shipped in trained_tokenizers/. Third, a Rust program tokenizes the cleaned text into numpy arrays of token IDs, with separate arrays recording where each instruction and response starts and how long it is. The tokenizer is incremental: if a source file changes, only that file is retokenized. Building it needs Rust and Cargo installed. Fourth, stratified sampling assembles balanced training epochs. This step runs on each training node, writes its output into /dev/shm so it lives in RAM, and is driven by a prefix_config.yaml file that controls how many rows to sample from each source file, whether long contexts are dropped or truncated, and how much to repeat small datasets. The sample script writes Markdown analytics to standard output with coverage statistics by category and task. The associated paper is on arXiv at 2605.20613 and the resulting model checkpoint is published as sapientinc/HRM-Text-1B on Hugging Face. The contribution guide separates pull requests into optimizations that must keep outputs identical, and data changes that must include validation.
Four-stage data preparation pipeline that cleans Hugging Face datasets into instruction-response pairs, tokenizes them in Rust, and samples balanced epochs for HRM-Text pretraining.
Mainly Python. The stack also includes Python, Rust, Cargo.
Setup difficulty is rated hard, with roughly 1day+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.