explaingit

datalab-to/surya

19,753PythonAudience · developerComplexity · 2/5ActiveLicenseSetup · moderate

TLDR

Python toolkit that extracts text from documents and images using OCR across 90+ languages, plus layout analysis, reading order detection, and table recognition.

Mindmap

mindmap
  root((Surya))
    What it does
      OCR text extraction
      Layout analysis
      Reading order detection
      Table recognition
    Document types
      Scanned forms
      Academic papers
      Newspaper pages
      Presentations
    Languages supported
      90+ languages
      Japanese, Chinese
      Arabic, Hindi
    Tech stack
      Python
      PyTorch
      Streamlit app
    Use cases
      Extract text from PDFs
      Digitize scanned documents
      Analyze document structure
    Getting started
      pip install
      Auto-download models
      Interactive web app

Things people build with this

USE CASE 1

Extract text from scanned PDFs and document images in 90+ languages without relying on cloud APIs.

USE CASE 2

Analyze the layout and reading order of complex multi-column documents like academic papers or newspapers.

USE CASE 3

Automatically detect and extract structured data from tables within documents.

USE CASE 4

Recognize mathematical formulas and equations in scientific papers using LaTeX OCR.

Tech stack

PythonPyTorchStreamlitOCR

Getting it running

Difficulty · moderate Time to first run · 30min

PyTorch installation and OCR model downloads can be slow on first run.

Free for personal, research, and early-stage startup use; commercial use requires a license from Datalab.

In plain English

Surya is a Python toolkit for extracting text and understanding the structure of documents. Optical character recognition (OCR) converts images of text, scanned pages, photos of documents, PDFs, into machine-readable text. Surya does this across more than 90 languages and benchmarks competitively against commercial cloud OCR services. Beyond basic text extraction, Surya offers several complementary capabilities. Layout analysis identifies the structural regions of a page: headers, body text, tables, images, and other zones. Reading order detection determines the logical sequence in which regions should be read, which is important for multi-column layouts or complex documents like scientific papers. Table recognition locates rows and columns within tables so structured data can be extracted accurately. It also supports LaTeX OCR for recognizing mathematical formulas and equations. The tool works on a variety of real-world document types including scanned forms, academic papers, newspaper pages, textbooks, and presentations in languages such as Japanese, Chinese, Arabic, and Hindi. Installation is via pip (pip install surya-ocr) and the model weights download automatically the first time you run it. It includes a graphical interactive app built with Streamlit for trying it on images or PDFs without writing code. The library is written in Python and uses PyTorch as its deep learning backend. For personal, research, and early-stage startup use the model weights are free; broader commercial use requires a license from Datalab, the company behind the project.

Copy-paste prompts

Prompt 1
How do I use Surya to extract text from a PDF file and get the reading order of regions?
Prompt 2
Show me how to set up Surya's layout analysis to identify headers, body text, and tables in a scanned document.
Prompt 3
I have a folder of images with text in multiple languages. How do I batch process them with Surya OCR?
Prompt 4
Can you help me integrate Surya into a Python script to extract table data from document images?
Prompt 5
How do I use Surya's Streamlit app to test OCR on my own documents without writing code?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.