explaingit

clovaai/donut

6,864PythonAudience · researcherComplexity · 4/5Setup · hard

TLDR

Donut is an AI model from Naver Clova that reads document images, receipts, forms, IDs, and extracts structured data directly, bypassing the traditional OCR step entirely with a single end-to-end neural network.

Mindmap

mindmap
  root((donut))
    What it does
      Document understanding
      No OCR needed
      Image to structured data
    Tasks
      Classification
      Info extraction
      Question answering
    Tools
      Hugging Face models
      SynthDoG generator
      Colab demos
    Audience
      AI researchers
      Document AI devs
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Extract structured fields like items and totals from receipt images without building or maintaining an OCR pipeline.

USE CASE 2

Classify scanned document images by type using a pre-trained Donut model loaded from Hugging Face.

USE CASE 3

Fine-tune Donut on your own document types using the included training scripts and the SynthDoG synthetic data generator.

USE CASE 4

Answer questions about the content of a document image, such as the payment due date on a contract.

Tech stack

PythonPyTorchHugging Face TransformersGradio

Getting it running

Difficulty · hard Time to first run · 30min

Local inference requires a capable GPU, use the free Colab demo or Hugging Face Spaces to try the pre-trained models without local hardware.

In plain English

Donut is a research project from Naver Clova AI that reads and understands document images without needing a traditional OCR step. OCR, or optical character recognition, is the process of extracting text from an image before a computer can work with it. Most document AI systems do OCR first and then analyze the text. Donut skips that step: it takes a document image as input and produces structured output directly, using a single neural network trained end-to-end. The paper was presented at ECCV 2022. The tasks Donut handles include document classification (deciding what type of document an image is), information extraction (pulling out specific fields like items and totals from a receipt), and document question answering (answering a question about the contents of a document image). Pre-trained models are available on Hugging Face for each of these tasks, and interactive demos are available via Gradio web interfaces and Google Colab notebooks. Alongside Donut, the project includes SynthDoG, a synthetic document generator that creates realistic-looking document images for training data. SynthDoG was used to generate training sets in English, Chinese, Japanese, and Korean, each with 500,000 images. This synthetic data helps the model handle documents in multiple languages and visual styles without needing a large amount of manually labeled real data. The software is installable as a Python package via pip. It is also integrated into the Hugging Face Transformers library, which means users familiar with that ecosystem can load Donut models through the standard Transformers API. Fine-tuning scripts for adapting the model to new document types are included in the repository, along with instructions for training on custom datasets. The base Donut model was trained on 64 A100 GPUs, so running inference requires a reasonably capable GPU setup, though the Colab demos provide a free way to try the pre-trained models without local hardware.

Copy-paste prompts

Prompt 1
Show me how to load the pre-trained Donut receipt extraction model from Hugging Face and run it on a photo of a receipt in Python.
Prompt 2
Write a Python script using Donut and the Hugging Face Transformers API to classify a folder of scanned documents as invoices, receipts, or forms.
Prompt 3
Walk me through fine-tuning Donut on my own custom document layout using SynthDoG-generated training images.
Prompt 4
How do I use the Donut document question answering model to ask questions about the contents of a contract image?
Prompt 5
Set up a Donut batch inference pipeline using the Hugging Face Transformers API to process multiple document images at once.
Open on GitHub → Explain another repo

← clovaai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.