allenai/olmocr

★ 17,320PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((olmocr))
    What it does
      PDF to plain text
      Image OCR
      Markdown output
    Tech Stack
      Python
      Vision model 7B
      vLLM server
      poppler-utils
    Use Cases
      LLM training data
      Table extraction
      Document search
    Audience
      Researchers
      Data engineers
      AI teams

mindmap root((olmocr)) What it does PDF to plain text Image OCR Markdown output Tech Stack Python Vision model 7B vLLM server poppler-utils Use Cases LLM training data Table extraction Document search Audience Researchers Data engineers AI teams

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Convert a folder of scanned research papers into Markdown so they can be searched or fed into an LLM pipeline

USE CASE 2

Extract tables and equations from PDF reports without losing their structure

USE CASE 3

Pre-process millions of documents as LLM training data at under $200 per million pages

USE CASE 4

Make historical or handwritten scanned records machine-readable for downstream analysis

Tech stack

PythonvLLMpoppler-utils

Getting it running

Difficulty · hard Time to first run · 1h+

Requires an NVIDIA GPU with at least 12 GB VRAM, smaller machines must use a remote vLLM server instead.

In plain English

olmOCR is a toolkit from Allen AI for converting PDFs and other image-based document formats, PNG and JPEG scans included, into clean, readable plain text or Markdown. Its stated purpose is to linearize PDFs so the text inside them can be used as training data for large language models, but it works just as well as a general-purpose OCR system for anyone who needs a high-quality text version of a document. The way it works is that the toolkit drives a 7-billion-parameter vision-language model, a neural network that looks at the rendered image of a page and writes out the text it sees. Because it is a vision model rather than a traditional OCR engine, it handles things classical OCR struggles with: equations, tables, handwriting, multi-column layouts, figures with captions, and insets. It detects and strips out repeating headers and footers and tries to emit text in a natural reading order. The project ships its own benchmark, olmOCR-Bench, with over 7,000 test cases across 1,400 documents so you can compare its accuracy against alternatives. Because the model is large, it needs a recent NVIDIA GPU with at least 12 GB of memory, smaller setups can call out to a remote vLLM server instead. Someone would use this if they have a pile of scanned reports, research papers, or old documents and want machine-readable text out of them, either to feed into an LLM pipeline, to make them searchable, or to extract tables and equations cleanly. It is written in Python, depends on poppler-utils for PDF rendering, and the team quotes a cost of under $200 per million pages converted. An online demo lives at olmocr.allenai.org. The full README is longer than what was provided.

Copy-paste prompts

Prompt 1

Using olmOCR, convert a batch of 50 scanned academic PDFs into Markdown, handling multi-column layouts and embedded equations correctly.

Prompt 2

Set up olmOCR with a remote vLLM server so I can run batch PDF extraction on a machine without a local GPU.

Prompt 3

Using olmOCR, extract all tables from a set of annual report PDFs and format them as structured Markdown tables.

Prompt 4

Run olmOCR on a folder of JPEG scans of handwritten lab notebooks and save the extracted plain-text output per page.

Prompt 5

Benchmark olmOCR against Tesseract on 100 scanned documents using the olmOCR-Bench evaluation scripts and report accuracy.

Open on GitHub → Explain another repo

← allenai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.