explaingit

kreuzberg-dev/kreuzberg

8,301RustAudience · developerComplexity · 3/5LicenseSetup · easy

TLDR

A fast Rust-powered library that extracts text and data from 91+ file formats, PDFs, Office docs, images, emails, with bindings for Python, Node.js, Go, Java, and 10+ other languages, plus OCR and AI pipeline support.

Mindmap

mindmap
  root((kreuzberg))
    Formats supported
      PDFs and Office
      Images and email
      Archives
      91 plus formats
    Language bindings
      Python and Node
      Go and Java
      Kotlin and C#
    Features
      OCR support
      Code parsing
      TOON format
    Usage modes
      Library
      REST API
      MCP server
      CLI tool
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Extract all text from a folder of PDFs and pass it to an AI pipeline using the TOON format to cut token usage by 30-50%.

USE CASE 2

Build a document search system that indexes content from Word docs, PDFs, and emails using Kreuzberg as the extraction backend.

USE CASE 3

Parse a codebase to extract functions, classes, and docstrings from 306 programming languages using the tree-sitter integration.

USE CASE 4

Run OCR on scanned documents with PaddleOCR or EasyOCR without a GPU, then process the output in your Python code.

Tech stack

RustPythonTypeScriptWebAssemblytree-sitterGoJava

Getting it running

Difficulty · easy Time to first run · 30min

Install from PyPI, npm, or Maven Central using your language's standard package manager, no GPU required for most features.

Free for internal and development use, but the Elastic License 2.0 prohibits offering it as a managed service or hosted SaaS product.

In plain English

Kreuzberg is a document processing library built in Rust that pulls text, metadata, and structured information out of over 91 file formats. PDFs, Office documents, images, HTML files, emails, archives, and academic formats are all supported. The library does this work at high speed without requiring a GPU. The Rust core is wrapped with native language bindings for over a dozen programming languages: Python, Ruby, PHP, Elixir, R, Dart, Go, Java, Kotlin, C#, TypeScript for Node.js, WebAssembly for browsers and Cloudflare Workers, and C through an FFI interface. Each language gets its own package published to the relevant package repository (PyPI, npm, Maven Central, and so on), so installation follows the usual conventions for your language. Beyond plain text extraction, Kreuzberg can parse code files and pull out functions, classes, imports, and docstrings from 306 programming languages using tree-sitter, a widely used parsing library. It also supports OCR (reading text from scanned documents or images) through several backends including Tesseract, PaddleOCR, EasyOCR, and vision-capable AI models from providers like OpenAI, Anthropic, and Google. For AI pipelines, it includes a wire format called TOON that produces 30-50% fewer tokens than JSON when passing extracted content to language models. The library can be used in four ways: as a code library you call directly, as a command-line tool, as a REST API server, or as an MCP server (a protocol for connecting tools to AI assistants). It uses streaming parsers to handle very large files without loading them entirely into memory. The project is licensed under the Elastic License 2.0. Documentation is at kreuzberg.dev and a live demo is available online.

Copy-paste prompts

Prompt 1
Using Kreuzberg's Python bindings, show me how to extract text from a batch of PDF files and output it in TOON format for an LLM pipeline.
Prompt 2
How do I start Kreuzberg's REST API server and send a POST request to extract text from an uploaded Word document?
Prompt 3
Set up Kreuzberg as an MCP server and connect it to Claude so I can ask questions about documents I upload.
Prompt 4
Using Kreuzberg's tree-sitter integration, extract all function names and docstrings from every Python file in a directory.
Prompt 5
How do I use Kreuzberg's streaming parser to process a very large PDF without loading the whole file into memory?
Open on GitHub → Explain another repo

← kreuzberg-dev on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.