Convert a folder of scanned research papers into Markdown so they can be searched or fed into an LLM pipeline
Extract tables and equations from PDF reports without losing their structure
Pre-process millions of documents as LLM training data at under $200 per million pages
Make historical or handwritten scanned records machine-readable for downstream analysis
Requires an NVIDIA GPU with at least 12 GB VRAM, smaller machines must use a remote vLLM server instead.
olmOCR is a toolkit from Allen AI for converting PDFs and other image-based document formats, PNG and JPEG scans included, into clean, readable plain text or Markdown. Its stated purpose is to linearize PDFs so the text inside them can be used as training data for large language models, but it works just as well as a general-purpose OCR system for anyone who needs a high-quality text version of a document. The way it works is that the toolkit drives a 7-billion-parameter vision-language model, a neural network that looks at the rendered image of a page and writes out the text it sees. Because it is a vision model rather than a traditional OCR engine, it handles things classical OCR struggles with: equations, tables, handwriting, multi-column layouts, figures with captions, and insets. It detects and strips out repeating headers and footers and tries to emit text in a natural reading order. The project ships its own benchmark, olmOCR-Bench, with over 7,000 test cases across 1,400 documents so you can compare its accuracy against alternatives. Because the model is large, it needs a recent NVIDIA GPU with at least 12 GB of memory, smaller setups can call out to a remote vLLM server instead. Someone would use this if they have a pile of scanned reports, research papers, or old documents and want machine-readable text out of them, either to feed into an LLM pipeline, to make them searchable, or to extract tables and equations cleanly. It is written in Python, depends on poppler-utils for PDF rendering, and the team quotes a cost of under $200 per million pages converted. An online demo lives at olmocr.allenai.org. The full README is longer than what was provided.
← allenai on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.