Extract text from scanned PDFs and document images in 90+ languages without relying on cloud APIs.
Analyze the layout and reading order of complex multi-column documents like academic papers or newspapers.
Automatically detect and extract structured data from tables within documents.
Recognize mathematical formulas and equations in scientific papers using LaTeX OCR.
PyTorch installation and OCR model downloads can be slow on first run.
Surya is a Python toolkit for extracting text and understanding the structure of documents. Optical character recognition (OCR) converts images of text, scanned pages, photos of documents, PDFs, into machine-readable text. Surya does this across more than 90 languages and benchmarks competitively against commercial cloud OCR services. Beyond basic text extraction, Surya offers several complementary capabilities. Layout analysis identifies the structural regions of a page: headers, body text, tables, images, and other zones. Reading order detection determines the logical sequence in which regions should be read, which is important for multi-column layouts or complex documents like scientific papers. Table recognition locates rows and columns within tables so structured data can be extracted accurately. It also supports LaTeX OCR for recognizing mathematical formulas and equations. The tool works on a variety of real-world document types including scanned forms, academic papers, newspaper pages, textbooks, and presentations in languages such as Japanese, Chinese, Arabic, and Hindi. Installation is via pip (pip install surya-ocr) and the model weights download automatically the first time you run it. It includes a graphical interactive app built with Streamlit for trying it on images or PDFs without writing code. The library is written in Python and uses PyTorch as its deep learning backend. For personal, research, and early-stage startup use the model weights are free; broader commercial use requires a license from Datalab, the company behind the project.
Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.