Analysis updated 2026-05-18
Pre-process a legal or regulatory PDF into clean, section-aware chunks for use in a RAG document question-answering system.
Build a Rust service that accepts PDF bytes from an HTTP upload or S3 and returns structured text chunks with metadata.
Write custom regex profiles to define how headings, definitions, and page-number noise are handled in your specific PDF format.
Embed PDF chunks with section and heading metadata into a vector database for more accurate semantic search results.
| matthiasnordwig/pdf-struct-chunker | 2arons/agent-git | aursen-labs/spume | |
|---|---|---|---|
| Stars | 11 | 11 | 11 |
| Language | Rust | Rust | Rust |
| Setup difficulty | easy | easy | moderate |
| Complexity | 2/5 | 3/5 | 3/5 |
| Audience | developer | developer | developer |
Figures from each repo's GitHub metadata at analysis time.
pdf-struct-chunker is a Rust library and command-line tool for splitting PDF documents into meaningful text chunks based on the document's actual structure rather than a fixed character count. It is aimed at developers building systems that answer questions over documents using AI, a pattern commonly called RAG (Retrieval-Augmented Generation). The problem it addresses is that most tools split PDFs by a fixed number of characters or tokens, which can cut headings away from their content or break sentences in the middle. This tool reads the position and font size of text on each PDF page to understand where headings start, where sections end, and how the document is organized. Each chunk of text it produces includes the section name, heading, and page number as structured data alongside the text itself, so downstream tools know exactly where each chunk came from. No AI model or internet connection is needed for the chunking process. The tool runs entirely on the CPU and processes a 100-page PDF in under one second on a standard laptop. You can use it as a command-line tool to produce JSONL or JSON output, or as a Rust library where you pass raw PDF bytes and get structured chunks back. Custom rules for how to detect headings, definitions, and lines to ignore can be written as JSON configuration files using regular expressions. The built-in defaults are tuned for legal and regulatory documents that use numbered sections. The library is available on crates.io and can be added to a Rust project with one command. The license is MIT.
A Rust library and CLI tool that splits PDFs into semantically meaningful chunks using layout analysis, producing section and heading metadata ready for RAG pipelines.
Mainly Rust. The stack also includes Rust, pdf-oxide.
Use freely for any purpose, including commercial use, as long as you keep the copyright notice.
Setup difficulty is rated easy, with roughly 5min to a first successful run.
Mainly developer.
This repo across BitVibe Labs
Verify against the repo before relying on details.