explaingit

sapientinc/data_io

Analysis updated 2026-06-24

33PythonAudience · researcherComplexity · 5/5Setup · hard

TLDR

Four-stage data preparation pipeline that cleans Hugging Face datasets into instruction-response pairs, tokenizes them in Rust, and samples balanced epochs for HRM-Text pretraining.

Mindmap

mindmap
  root((data_io))
    Inputs
      Hugging Face datasets
      Cleaning scripts
      prefix_config.yaml
    Outputs
      Cleaned QA pairs
      Numpy token arrays
      Sampled epochs
    Use Cases
      Pretrain HRM-Text
      Build instruction corpus
      Reproduce paper
    Tech Stack
      Python
      Rust
      Cargo
      Numpy
      BPE
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Reproduce the HRM-Text pretraining data pipeline end to end

USE CASE 2

Build a custom instruction-response pretraining corpus from Hugging Face sources

USE CASE 3

Tokenize a large cleaned text corpus into numpy arrays with a Rust pipeline

USE CASE 4

Run stratified sampling to assemble balanced training epochs

What is it built with?

PythonRustCargoNumpyBPE

How does it compare?

sapientinc/data_io410979729/scope-recallarahim3/mlx-dspark
Stars333333
LanguagePythonPythonPython
Setup difficultyhardmoderateeasy
Complexity5/53/53/5
Audienceresearcherdeveloperdeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Cleaning stage needs about 512 GiB of RAM, plus Rust and Cargo for the tokenizer build.

In plain English

data_io is the data preparation pipeline used to pretrain HRM-Text, a language model from Sapient Intelligence. Pretraining is the long, expensive first phase of building a language model, where the model is exposed to large amounts of text and learns general patterns before any task-specific tuning. Where most pretraining pipelines feed the model raw web pages, this one is different: it builds question-and-answer pairs in an instruction-and-response format, then turns those into the numeric token arrays the trainer actually reads. The pipeline runs in four stages. First, data cleaning takes raw datasets, mostly downloaded from Hugging Face Hub plus a few from Google Drive and Google Cloud Storage, and rewrites each item into a standard shape with three fields: a condition tag string, an instruction, and a response. Sources include FLAN, PleIAs SYNTH, Open Platypus, ARB, scibench, AMPS, the DeepMind mathematics dataset, GSM8K, and others, each with its own cleaning script under pipe/ or pipe_clustered/. The README warns that running the cleaning scripts needs about 512 GiB of RAM, and offers a prebuilt cleaned dataset on Hugging Face so most users can skip ahead. Second, a BPE tokenizer is trained on the cleaned text, though pretrained tokenizers are already shipped in trained_tokenizers/. Third, a Rust program tokenizes the cleaned text into numpy arrays of token IDs, with separate arrays recording where each instruction and response starts and how long it is. The tokenizer is incremental: if a source file changes, only that file is retokenized. Building it needs Rust and Cargo installed. Fourth, stratified sampling assembles balanced training epochs. This step runs on each training node, writes its output into /dev/shm so it lives in RAM, and is driven by a prefix_config.yaml file that controls how many rows to sample from each source file, whether long contexts are dropped or truncated, and how much to repeat small datasets. The sample script writes Markdown analytics to standard output with coverage statistics by category and task. The associated paper is on arXiv at 2605.20613 and the resulting model checkpoint is published as sapientinc/HRM-Text-1B on Hugging Face. The contribution guide separates pull requests into optimizations that must keep outputs identical, and data changes that must include validation.

Copy-paste prompts

Prompt 1
Walk me through the four stages of the sapientinc/data_io pipeline from raw datasets to sampled epochs
Prompt 2
Set up the data_io tokenizer build with Rust and Cargo and explain the incremental retokenization
Prompt 3
Write a prefix_config.yaml that downsamples large sources and repeats small ones for a custom run
Prompt 4
Skip the cleaning stage by pulling the prebuilt cleaned dataset from Hugging Face

Frequently asked questions

What is data_io?

Four-stage data preparation pipeline that cleans Hugging Face datasets into instruction-response pairs, tokenizes them in Rust, and samples balanced epochs for HRM-Text pretraining.

What language is data_io written in?

Mainly Python. The stack also includes Python, Rust, Cargo.

How hard is data_io to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is data_io for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.