Analysis updated 2026-06-20
Turn a library of PDFs into AI-readable text for a chatbot or search tool, preserving tables and structure.
Build a document Q&A app that ingests Word, PowerPoint, and Excel files without losing formatting context.
Process sensitive internal documents locally without sending data to any external service.
Plug document parsing into an existing LangChain or LlamaIndex pipeline with minimal setup.
| docling-project/docling | 666ghj/mirofish | meta-llama/llama | |
|---|---|---|---|
| Stars | 59,251 | 59,373 | 59,389 |
| Language | Python | Python | Python |
| Setup difficulty | moderate | hard | hard |
| Complexity | 3/5 | 4/5 | 3/5 |
| Audience | developer | pm founder | developer |
Figures from each repo's GitHub metadata at analysis time.
Install via pip. First run downloads layout detection models automatically. GPU speeds up PDF processing but is not required. Works fully offline after initial model download.
Docling is a Python library and command-line tool for converting documents from many different file formats into structured, AI-friendly output. The problem it solves is that before you can use a document as context for an AI system, you need to extract its text and structure in a clean, organized form, which is especially difficult for PDFs because they were designed for printing rather than machine reading. PDFs often contain tables, multi-column layouts, headers, footnotes, charts, and mathematical formulas that simple text extraction tools mangle or miss entirely. Docling handles these challenges with purpose-built understanding of page layout, reading order, table structure, and image content. The library accepts a wide range of input formats including PDF, Word documents (DOCX), PowerPoint (PPTX), Excel (XLSX), HTML, images in formats like PNG and JPEG, LaTeX, and audio files through speech recognition. It converts all of these into a unified internal document representation and then exports to Markdown, HTML, or JSON, preserving the structural information that makes the content useful for AI processing. For PDFs, it uses a layout detection model called Heron that identifies different regions of each page. The tool can run entirely locally, which matters when handling sensitive documents in environments without internet access. It integrates directly with popular AI application frameworks like LangChain, LlamaIndex, and Haystack, so you can plug it into an existing retrieval-augmented generation pipeline. The project was started by IBM Research Zurich and is now hosted under the Linux Foundation AI and Data initiative. You would use Docling when building an AI application that needs to ingest and understand a variety of real-world document formats.
Docling converts PDFs, Word docs, PowerPoint, Excel, images, and more into clean, structured text that AI tools can actually read and understand, handling tricky layouts, tables, and multi-column pages automatically.
Mainly Python. The stack also includes Python, LangChain, LlamaIndex.
No license was mentioned in the explanation.
Setup difficulty is rated moderate, with roughly 30min to a first successful run.
Mainly developer.
This repo across BitVibe Labs
Verify against the repo before relying on details.