Extract all text from a folder of PDFs and pass it to an AI pipeline using the TOON format to cut token usage by 30-50%.
Build a document search system that indexes content from Word docs, PDFs, and emails using Kreuzberg as the extraction backend.
Parse a codebase to extract functions, classes, and docstrings from 306 programming languages using the tree-sitter integration.
Run OCR on scanned documents with PaddleOCR or EasyOCR without a GPU, then process the output in your Python code.
Install from PyPI, npm, or Maven Central using your language's standard package manager, no GPU required for most features.
Kreuzberg is a document processing library built in Rust that pulls text, metadata, and structured information out of over 91 file formats. PDFs, Office documents, images, HTML files, emails, archives, and academic formats are all supported. The library does this work at high speed without requiring a GPU. The Rust core is wrapped with native language bindings for over a dozen programming languages: Python, Ruby, PHP, Elixir, R, Dart, Go, Java, Kotlin, C#, TypeScript for Node.js, WebAssembly for browsers and Cloudflare Workers, and C through an FFI interface. Each language gets its own package published to the relevant package repository (PyPI, npm, Maven Central, and so on), so installation follows the usual conventions for your language. Beyond plain text extraction, Kreuzberg can parse code files and pull out functions, classes, imports, and docstrings from 306 programming languages using tree-sitter, a widely used parsing library. It also supports OCR (reading text from scanned documents or images) through several backends including Tesseract, PaddleOCR, EasyOCR, and vision-capable AI models from providers like OpenAI, Anthropic, and Google. For AI pipelines, it includes a wire format called TOON that produces 30-50% fewer tokens than JSON when passing extracted content to language models. The library can be used in four ways: as a code library you call directly, as a command-line tool, as a REST API server, or as an MCP server (a protocol for connecting tools to AI assistants). It uses streaming parsers to handle very large files without loading them entirely into memory. The project is licensed under the Elastic License 2.0. Documentation is at kreuzberg.dev and a live demo is available online.
← kreuzberg-dev on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.