amap-cvlab/abot-ocr

★ 17PythonAudience · researcherComplexity · 3/5Setup · hard

Mindmap

mindmap
  root((abot-ocr))
    What it does
      Image to Markdown
      Tables to HTML
      Math to LaTeX
    Input types
      Scanned PDFs
      Document photos
      Academic papers
    Setup
      Hugging Face weights
      vLLM inference script
      4GB GPU needed
    Audience
      Researchers
      Data teams

mindmap root((abot-ocr)) What it does Image to Markdown Tables to HTML Math to LaTeX Input types Scanned PDFs Document photos Academic papers Setup Hugging Face weights vLLM inference script 4GB GPU needed Audience Researchers Data teams

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Convert scanned academic papers into editable Markdown files with tables and math formulas accurately preserved.

USE CASE 2

Process a folder of document images in batch, outputting one Markdown file per image, and resume interrupted runs automatically.

USE CASE 3

Extract structured text from photographed forms or reports into a searchable, editable format.

Tech stack

PythonvLLM

Getting it running

Difficulty · hard Time to first run · 1h+

Requires a GPU with at least 4 GB of video memory and model weights downloaded separately from Hugging Face before running.

In plain English

ABot-OCR is an AI model that reads images of document pages and converts them into structured Markdown text. OCR stands for optical character recognition, which is the technology that turns images of text into actual readable text. This particular model goes further than basic OCR by also recognizing mathematical formulas, tables, and the overall layout of the document, then outputting everything in a format that preserves that structure. The practical use case is converting scanned PDFs or photographs of documents, academic papers, or forms into text that can be edited, searched, or processed further. Instead of outputting plain unformatted text, the model produces Markdown where tables are encoded as HTML, math formulas are written in LaTeX notation, and the document structure is retained as much as possible. To use it, you download the model weights from Hugging Face (the files are not included in this repository due to their size) and run a Python inference script. The script uses a library called vLLM to load and run the model efficiently on a GPU. You point it at a folder of images, and it writes one Markdown file per image to an output directory. Images that already have a corresponding output file are skipped, so interrupted runs can be resumed. A GPU with around 4 GB of video memory is needed, though actual requirements depend on image size and how many images you process at once. The README is relatively sparse and still contains placeholder notes where benchmark details and training background are intended to be filled in. The benchmark figure references a dataset called OmniDocBench. The project is from a computer vision lab and cites several earlier open-source OCR projects as influences.

Copy-paste prompts

Prompt 1

I have a folder of scanned document images. Walk me through downloading the ABot-OCR model weights from Hugging Face and running the vLLM inference script to convert them all to Markdown.

Prompt 2

How do I set up the Python environment and vLLM to run ABot-OCR on a machine with a 4GB GPU, and what GPU memory settings should I use for a batch of 100 images?

Prompt 3

I need to convert academic paper images that contain LaTeX equations and HTML tables into Markdown. Show me the exact commands to run ABot-OCR and what the output files look like.

Prompt 4

After running ABot-OCR on my images, how do I check which files were already processed so I can safely resume the batch without reprocessing them?

Open on GitHub → Explain another repo

← amap-cvlab on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.