explaingit

bytedance/dolphin

8,934PythonAudience · developerComplexity · 4/5Setup · hard

TLDR

Dolphin is an AI model that reads photos or scans of documents and converts them into structured text, tables, formulas, and code, handling 21 types of document elements with JSON or Markdown output.

Mindmap

mindmap
  root((repo))
    What It Does
      Reads document images
      Classifies document type
      Extracts 21 element types
      Outputs JSON or Markdown
    Tech Stack
      Python
      Hugging Face
      vLLM
      TensorRT-LLM
    Use Cases
      Invoice processing
      Academic paper parsing
      Contract extraction
    Setup
      Clone repo
      Install dependencies
      Download model weights
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Extract structured data from scanned invoices, contracts, or academic papers into JSON automatically.

USE CASE 2

Build a document processing pipeline that converts photographed pages into clean Markdown for further analysis.

USE CASE 3

Parse mixed documents containing text, tables, and math formulas into machine-readable format.

USE CASE 4

Run fast parallel processing on large batches of digital PDFs to extract their layout and content.

Tech stack

PythonPyTorchHugging FacevLLMTensorRT-LLM

Getting it running

Difficulty · hard Time to first run · 1h+

Requires a GPU and downloading pre-trained model weights from Hugging Face before running inference.

In plain English

Dolphin is an AI model from ByteDance that reads document images and converts them into structured, machine-readable output. If you have a PDF, a scanned page, or a photo of a document, Dolphin can analyze it and produce a clean representation of the text, tables, formulas, code blocks, and layout, including the correct reading order. The problem it solves is that documents come in many forms: some are digital files where the text is already embedded, others are photographs or scans where the content only exists as pixels. Previous tools often handled only one type well. Dolphin-v2 (the current version) first classifies what kind of document it is looking at, then applies a different parsing strategy depending on that classification. Photographed documents get processed as a whole, while digital documents are broken into elements and parsed in parallel, which is faster. The model can identify up to 21 types of document elements, extract attribute fields, handle mathematical formulas and code, and output results as JSON or Markdown. It was accepted as a paper at ACL 2025, a major natural language processing research conference. For developers wanting to run it, setup involves cloning the repository, installing Python dependencies, and downloading the pre-trained model weights from Hugging Face. Inference can be run on single images, entire directories, or PDF files. There is also support for faster inference using vLLM and TensorRT-LLM, which are tools for accelerating model serving. This is a research model and developer tool, not a finished consumer product. It is most useful for teams building document processing pipelines, such as extracting structured data from invoices, academic papers, contracts, or scanned records.

Copy-paste prompts

Prompt 1
Show me how to set up Dolphin-v2 locally and run it on a folder of invoice images to extract structured JSON data.
Prompt 2
Write a Python script using Dolphin to parse a PDF and output the result as Markdown, handling multi-column layouts.
Prompt 3
How do I run Dolphin with vLLM for faster batch inference on a set of scanned academic papers?
Prompt 4
Help me extract tables from photographed documents using Dolphin and convert the results to CSV format.
Open on GitHub → Explain another repo

← bytedance on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.