data_io is the data preparation pipeline used to pretrain HRM-Text, a language model from Sapient Intelligence. Pretraining is the long, expensive first phase of building a language model, where the model is exposed to large amounts of text and learns general patterns before any task-specific tuning. Where most pretraining pipelines feed the model raw web pages, this one is different: it builds question-and-answer pairs in an instruction-and-response format, then turns those into the numeric token arrays the trainer actually reads. The pipeline runs in four stages. First, data cleaning takes raw datasets, mostly downloaded from Hugging Face Hub plus a few from Google Drive and Google Cloud Storage, and rewrites each item into a standard shape with three fields: a condition tag string, an instruction, and a response. Sources include FLAN, PleIAs SYNTH, Open Platypus, ARB, scibench, AMPS, the DeepMind mathematics dataset, GSM8K, and others, each with its own cleaning script under pipe/ or pipe_clustered/. The README warns that running the cleaning scripts needs about 512 GiB of RAM, and offers a prebuilt cleaned dataset on Hugging Face so most users can skip ahead. Second, a BPE tokenizer is trained on the cleaned text, though pretrained tokenizers are already shipped in trained_tokenizers/. Third, a Rust program tokenizes the cleaned text into numpy arrays of token IDs, with separate arrays recording where each instruction and response starts and how long it is. The tokenizer is incremental: if a source file changes, only that file is retokenized. Building it needs Rust and Cargo installed. Fourth, stratified sampling assembles balanced training epochs. This step runs on each training node, writes its output into /dev/shm so it lives in RAM, and is driven by a prefix_config.yaml file that controls how many rows to sample from each source file, whether long contexts are dropped or truncated, and how much to repeat small datasets. The sample script writes Markdown analytics to standard output with coverage statistics by category and task. The associated paper is on arXiv at 2605.20613 and the resulting model checkpoint is published as sapientinc/HRM-Text-1B on Hugging Face. The contribution guide separates pull requests into optimizations that must keep outputs identical, and data changes that must include validation.
Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.