Extract structured fields like items and totals from receipt images without building or maintaining an OCR pipeline.
Classify scanned document images by type using a pre-trained Donut model loaded from Hugging Face.
Fine-tune Donut on your own document types using the included training scripts and the SynthDoG synthetic data generator.
Answer questions about the content of a document image, such as the payment due date on a contract.
Local inference requires a capable GPU, use the free Colab demo or Hugging Face Spaces to try the pre-trained models without local hardware.
Donut is a research project from Naver Clova AI that reads and understands document images without needing a traditional OCR step. OCR, or optical character recognition, is the process of extracting text from an image before a computer can work with it. Most document AI systems do OCR first and then analyze the text. Donut skips that step: it takes a document image as input and produces structured output directly, using a single neural network trained end-to-end. The paper was presented at ECCV 2022. The tasks Donut handles include document classification (deciding what type of document an image is), information extraction (pulling out specific fields like items and totals from a receipt), and document question answering (answering a question about the contents of a document image). Pre-trained models are available on Hugging Face for each of these tasks, and interactive demos are available via Gradio web interfaces and Google Colab notebooks. Alongside Donut, the project includes SynthDoG, a synthetic document generator that creates realistic-looking document images for training data. SynthDoG was used to generate training sets in English, Chinese, Japanese, and Korean, each with 500,000 images. This synthetic data helps the model handle documents in multiple languages and visual styles without needing a large amount of manually labeled real data. The software is installable as a Python package via pip. It is also integrated into the Hugging Face Transformers library, which means users familiar with that ecosystem can load Donut models through the standard Transformers API. Fine-tuning scripts for adapting the model to new document types are included in the repository, along with instructions for training on custom datasets. The base Donut model was trained on 64 A100 GPUs, so running inference requires a reasonably capable GPU setup, though the Colab demos provide a free way to try the pre-trained models without local hardware.
← clovaai on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.