explaingit

datalab-to/surya

Analysis updated 2026-06-21

19,734PythonAudience · developerComplexity · 3/5LicenseSetup · moderate

TLDR

A Python toolkit that converts scanned documents and images into machine-readable text across 90-plus languages, and also extracts tables, page structure, reading order, and math formulas.

Mindmap

mindmap
  root((repo))
    What It Does
      OCR text extraction
      Layout analysis
      Table recognition
      Math formula OCR
    Tech Stack
      Python
      PyTorch
      Streamlit UI
    Use Cases
      Scanned PDF parsing
      Document pipelines
      Academic paper parsing
    Languages
      90 plus languages
      Arabic and Hindi
      Japanese and Chinese
    Audience
      Developers
      Data engineers
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Convert a folder of scanned PDF invoices into searchable text and extract table data from each page.

USE CASE 2

Build a document processing pipeline that identifies headers, body text, and tables in academic papers and returns them in correct reading order.

USE CASE 3

Recognize and extract LaTeX math equations from images of textbook pages or scientific papers.

USE CASE 4

Process multi-column layouts or documents in Japanese, Chinese, Arabic, or Hindi and get correctly ordered machine-readable text.

What is it built with?

PythonPyTorchStreamlit

How does it compare?

datalab-to/suryagoogle-research/timesfmquantopian/zipline
Stars19,73419,75519,764
LanguagePythonPythonPython
Setup difficultymoderateeasymoderate
Complexity3/52/53/5
Audiencedeveloperdeveloperdata

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

PyTorch model weights download automatically on first run, commercial use beyond early-stage startups requires a paid license from Datalab.

Free for personal, research, and early-stage startup use, broader commercial use requires a paid license from Datalab.

In plain English

Surya is a Python toolkit for extracting text and understanding the structure of documents. Optical character recognition (OCR) converts images of text, scanned pages, photos of documents, PDFs, into machine-readable text. Surya does this across more than 90 languages and benchmarks competitively against commercial cloud OCR services. Beyond basic text extraction, Surya offers several complementary capabilities. Layout analysis identifies the structural regions of a page: headers, body text, tables, images, and other zones. Reading order detection determines the logical sequence in which regions should be read, which is important for multi-column layouts or complex documents like scientific papers. Table recognition locates rows and columns within tables so structured data can be extracted accurately. It also supports LaTeX OCR for recognizing mathematical formulas and equations. The tool works on a variety of real-world document types including scanned forms, academic papers, newspaper pages, textbooks, and presentations in languages such as Japanese, Chinese, Arabic, and Hindi. Installation is via pip (pip install surya-ocr) and the model weights download automatically the first time you run it. It includes a graphical interactive app built with Streamlit for trying it on images or PDFs without writing code. The library is written in Python and uses PyTorch as its deep learning backend. For personal, research, and early-stage startup use the model weights are free, broader commercial use requires a license from Datalab, the company behind the project.

Copy-paste prompts

Prompt 1
Using Surya, write Python code to OCR a PDF and extract all text in correct reading order for a two-column academic paper.
Prompt 2
I have 500 scanned invoice images. Write a Surya script that extracts tables from each image and saves row and column data as CSV files.
Prompt 3
Show me how to use Surya layout analysis to identify and crop just the table regions from a page image, then extract the structured data.
Prompt 4
Help me run the Surya Streamlit demo app locally on my Mac to test OCR on a sample document before integrating it into my pipeline.
Prompt 5
I need to OCR Arabic and Hindi documents. Show me how to configure Surya for right-to-left and complex-script languages and process a batch of images.

Frequently asked questions

What is surya?

A Python toolkit that converts scanned documents and images into machine-readable text across 90-plus languages, and also extracts tables, page structure, reading order, and math formulas.

What language is surya written in?

Mainly Python. The stack also includes Python, PyTorch, Streamlit.

What license does surya use?

Free for personal, research, and early-stage startup use, broader commercial use requires a paid license from Datalab.

How hard is surya to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is surya for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub datalab-to on gitmyhub

Verify against the repo before relying on details.