explaingit

docling-project/docling

🔥 Hot59,937PythonAudience · developerComplexity · 3/5ActiveLicenseSetup · easy

TLDR

Python library that converts PDFs, Word docs, images, and other formats into clean, structured text and JSON for AI systems to understand.

Mindmap

mindmap
  root((Docling))
    What it does
      Converts documents to AI-friendly format
      Extracts text and structure
      Handles tables and layouts
    Input formats
      PDF with layout detection
      Word DOCX PowerPoint
      Images and LaTeX
    Output formats
      Markdown
      HTML
      JSON
    Key features
      Runs locally offline
      Integrates with LangChain
      Preserves document structure
    Use cases
      Build RAG pipelines
      Process sensitive docs
      Feed AI systems documents

Things people build with this

USE CASE 1

Build a retrieval-augmented generation (RAG) system that ingests PDFs and Word documents to answer questions about them.

USE CASE 2

Extract structured data from financial reports, contracts, or research papers to feed into an AI analysis pipeline.

USE CASE 3

Process scanned images and handwritten documents locally without sending them to external services.

USE CASE 4

Convert a folder of mixed document types (PDFs, slides, spreadsheets) into a unified JSON format for indexing.

Tech stack

PythonLangChainLlamaIndexHaystackHeron

Getting it running

Difficulty · easy Time to first run · 5min
MIT License, use freely for any purpose, including commercial, as long as you include the original copyright notice.

In plain English

Docling is a Python library and command-line tool for converting documents from many different file formats into structured, AI-friendly output. The problem it solves is that before you can use a document as context for an AI system, you need to extract its text and structure in a clean, organized form, which is especially difficult for PDFs because they were designed for printing rather than machine reading. PDFs often contain tables, multi-column layouts, headers, footnotes, charts, and mathematical formulas that simple text extraction tools mangle or miss entirely. Docling handles these challenges with purpose-built understanding of page layout, reading order, table structure, and image content. The library accepts a wide range of input formats including PDF, Word documents (DOCX), PowerPoint (PPTX), Excel (XLSX), HTML, images in formats like PNG and JPEG, LaTeX, and audio files through speech recognition. It converts all of these into a unified internal document representation and then exports to Markdown, HTML, or JSON, preserving the structural information that makes the content useful for AI processing. For PDFs, it uses a layout detection model called Heron that identifies different regions of each page. The tool can run entirely locally, which matters when handling sensitive documents in environments without internet access. It integrates directly with popular AI application frameworks like LangChain, LlamaIndex, and Haystack, so you can plug it into an existing retrieval-augmented generation pipeline. The project was started by IBM Research Zurich and is now hosted under the Linux Foundation AI and Data initiative. You would use Docling when building an AI application that needs to ingest and understand a variety of real-world document formats.

Copy-paste prompts

Prompt 1
Show me how to use Docling to convert a PDF into Markdown that I can feed into a language model.
Prompt 2
How do I integrate Docling with LangChain to build a document Q&A chatbot?
Prompt 3
Write a Python script that uses Docling to batch-convert all PDFs in a folder to JSON and extract tables.
Prompt 4
How does Docling's layout detection work for multi-column PDFs, and how do I access the structured output?
Prompt 5
Can I use Docling to process images and extract text, and what formats does it support?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.