midhunharikumar/ferroload

Analysis updated 2026-05-18

★ 2RustAudience · dataComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((ferroload))
    Format
      Sharded tar archives
      Parquet columnar index
      DuckDB queryable
    Performance
      Parallel Rust decode
      Cloud read coalescing
      No extra worker processes
    Python API
      Write and read datasets
      SQL-style filtering
      PyTorch loader integration
    Storage backends
      Local disk
      S3, GCS, Azure

mindmap root((ferroload)) Format Sharded tar archives Parquet columnar index DuckDB queryable Performance Parallel Rust decode Cloud read coalescing No extra worker processes Python API Write and read datasets SQL-style filtering PyTorch loader integration Storage backends Local disk S3, GCS, Azure

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Replace slow Python data loaders in PyTorch image training pipelines with a Rust-backed loader that decodes in parallel without extra worker processes.

USE CASE 2

Store a mixed image-and-metadata dataset in a format that supports SQL-style filtering before loading any image data.

USE CASE 3

Stream a training dataset from cloud object storage at high throughput without caching everything locally.

USE CASE 4

Add computed columns or new modalities to an existing dataset without rewriting the original data files.

What is it built with?

RustPythonPyTorchParquetDuckDBPyO3maturin

How does it compare?

	midhunharikumar/ferroload	callmealphabet/fastcp	codingstark-dev/decant
Stars	2	2	2
Language	Rust	Rust	Rust
Setup difficulty	moderate	easy	easy
Complexity	3/5	1/5	3/5
Audience	data	ops devops	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

pip install ferroload works on most platforms, building from source requires a Rust toolchain and the maturin build tool.

In plain English

When you train a machine learning model on images, audio, or other mixed data, one of the bottlenecks is how fast you can load that data from disk or cloud storage into your training code. Ferroload is a data format and loading library built in Rust to address that bottleneck, with a Python interface for the people actually writing training code. The format stores data in sharded tar archives (essentially bundled file collections) and pairs them with an index file in a column-oriented format called Parquet. The Parquet index lets you query your dataset using SQL-style conditions, for example filtering to samples with a specific label or caption, without reading any of the actual image or audio data. That index can also be queried directly with a tool called DuckDB without installing Ferroload at all. Because the core decoding logic runs in Rust, Ferroload can do parallel image decoding inside a single process without needing multiple worker processes the way most Python data loaders do. Benchmarks in the repository compare it against two popular alternatives on standard image collections. Results show it loading small images roughly 1.8 times faster than Hugging Face datasets and about 3.9 times faster than WebDataset under equal conditions. For datasets stored in cloud object storage like Google Cloud Storage, the gap is larger because Ferroload coalesces remote reads rather than fetching small chunks one at a time. The Python package installs from PyPI with a single command and provides classes for writing datasets, reading them, filtering by query, and connecting to PyTorch training loops. Cloud storage including Amazon S3, Google Cloud Storage, and Azure is included in the prebuilt package. You can extend an existing dataset by adding computed columns or new modalities as additive layers without rewriting the original data, and those operations are designed to be resumable if interrupted. Building from source requires a Rust toolchain and a tool called maturin that compiles the Rust code into Python extension modules.

Copy-paste prompts

Prompt 1

Using ferroload, show me how to write an image dataset with labels and captions, then open it and filter samples where label equals 3.

Prompt 2

Walk me through connecting a ferroload dataset to a PyTorch training loop using make_loader, including GPU tensor transfer.

Prompt 3

How do I query a ferroload Parquet index directly using DuckDB without installing the ferroload Python package?

Prompt 4

Show me how to add a computed column to an existing ferroload dataset using Dataset.map() and make it resumable.

Prompt 5

How do I install ferroload with S3 and libjpeg-turbo support, and what are the build steps using maturin?

Frequently asked questions

What is ferroload?

A Rust-powered dataset format and loader for machine learning that streams image and audio data from disk or cloud storage with fast parallel decoding and SQL-style filtering.

What language is ferroload written in?

Mainly Rust. The stack also includes Rust, Python, PyTorch.

How hard is ferroload to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is ferroload for?

Mainly data.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub midhunharikumar on gitmyhub

Verify against the repo before relying on details.