Analysis updated 2026-05-18
Replace slow Python data loaders in PyTorch image training pipelines with a Rust-backed loader that decodes in parallel without extra worker processes.
Store a mixed image-and-metadata dataset in a format that supports SQL-style filtering before loading any image data.
Stream a training dataset from cloud object storage at high throughput without caching everything locally.
Add computed columns or new modalities to an existing dataset without rewriting the original data files.
| midhunharikumar/ferroload | callmealphabet/fastcp | codingstark-dev/decant | |
|---|---|---|---|
| Stars | 2 | 2 | 2 |
| Language | Rust | Rust | Rust |
| Setup difficulty | moderate | easy | easy |
| Complexity | 3/5 | 1/5 | 3/5 |
| Audience | data | ops devops | developer |
Figures from each repo's GitHub metadata at analysis time.
pip install ferroload works on most platforms, building from source requires a Rust toolchain and the maturin build tool.
When you train a machine learning model on images, audio, or other mixed data, one of the bottlenecks is how fast you can load that data from disk or cloud storage into your training code. Ferroload is a data format and loading library built in Rust to address that bottleneck, with a Python interface for the people actually writing training code. The format stores data in sharded tar archives (essentially bundled file collections) and pairs them with an index file in a column-oriented format called Parquet. The Parquet index lets you query your dataset using SQL-style conditions, for example filtering to samples with a specific label or caption, without reading any of the actual image or audio data. That index can also be queried directly with a tool called DuckDB without installing Ferroload at all. Because the core decoding logic runs in Rust, Ferroload can do parallel image decoding inside a single process without needing multiple worker processes the way most Python data loaders do. Benchmarks in the repository compare it against two popular alternatives on standard image collections. Results show it loading small images roughly 1.8 times faster than Hugging Face datasets and about 3.9 times faster than WebDataset under equal conditions. For datasets stored in cloud object storage like Google Cloud Storage, the gap is larger because Ferroload coalesces remote reads rather than fetching small chunks one at a time. The Python package installs from PyPI with a single command and provides classes for writing datasets, reading them, filtering by query, and connecting to PyTorch training loops. Cloud storage including Amazon S3, Google Cloud Storage, and Azure is included in the prebuilt package. You can extend an existing dataset by adding computed columns or new modalities as additive layers without rewriting the original data, and those operations are designed to be resumable if interrupted. Building from source requires a Rust toolchain and a tool called maturin that compiles the Rust code into Python extension modules.
A Rust-powered dataset format and loader for machine learning that streams image and audio data from disk or cloud storage with fast parallel decoding and SQL-style filtering.
Mainly Rust. The stack also includes Rust, Python, PyTorch.
Setup difficulty is rated moderate, with roughly 30min to a first successful run.
Mainly data.
This repo across BitVibe Labs
Verify against the repo before relying on details.