datalab-to/chandra

★ 10,572PythonAudience · developerComplexity · 3/5LicenseSetup · moderate

Mindmap

mindmap
  root((Chandra OCR))
    What it does
      Text from images
      PDF extraction
      Layout preserved
    Output formats
      Markdown
      HTML
      JSON
    Content types
      Tables and forms
      Handwriting
      Math formulas
    Tech stack
      Python
      PyTorch
      HuggingFace

mindmap root((Chandra OCR)) What it does Text from images PDF extraction Layout preserved Output formats Markdown HTML JSON Content types Tables and forms Handwriting Math formulas Tech stack Python PyTorch HuggingFace

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Extract structured text from scanned PDF invoices, forms, or contracts and output them as Markdown or JSON for further processing.

USE CASE 2

Process a batch of multilingual documents through Chandra's command-line tool to get searchable, structured output from image files.

USE CASE 3

Convert research papers containing mathematical formulas, figures, and tables into clean Markdown files for editing or archiving.

USE CASE 4

Build a document processing pipeline that extracts filled form data including checkboxes from uploaded PDFs and stores the results as structured JSON.

Tech stack

PythonPyTorchHuggingFacevLLM

Getting it running

Difficulty · moderate Time to first run · 30min

Requires PyTorch installed locally for the HuggingFace backend, commercial self-hosting requires a separate license from Datalab.

Code is Apache 2.0 and free for commercial use, model weights use OpenRAIL-M, and commercial self-hosting requires a separate license from Datalab.

In plain English

Chandra is an OCR model, meaning it reads text from images and PDF files and converts that content into structured digital formats. OCR stands for optical character recognition, the technology that lets a computer extract the words from a scanned document or photograph. Chandra goes beyond basic text extraction by preserving the layout of the original document and outputting the result as Markdown, HTML, or JSON. What sets it apart is its handling of difficult content types. It accurately processes complex tables, filled-in forms including checkboxes, handwritten text, mathematical formulas, charts, and documents in over 90 languages. The README includes side-by-side benchmark comparisons showing its accuracy against other publicly available OCR tools on multilingual documents. To use it, you install the Python package with pip, then run a command-line tool pointing it at a file or folder. It supports two ways of running the underlying AI model: one uses HuggingFace, a popular AI model platform that requires the PyTorch library installed locally, and the other uses vLLM, a server-based approach that is lighter to set up. A browser-based demo app is also included for trying it out on single pages. For each processed document, Chandra produces a folder of output files: a Markdown version, an HTML version, a JSON metadata file, and any images extracted from the document. You can control which page range to process, how many pages to handle in parallel, and whether to include page headers and footers in the output. The code is released under the Apache 2.0 license. The underlying model weights use the OpenRAIL-M license. A managed cloud version with higher accuracy, batch processing at scale, and SOC 2 Type 2 compliance is available from Datalab, the company behind the project. Commercial self-hosting requires a separate license.

Copy-paste prompts

Prompt 1

Using Chandra, write a Python script that processes all PDFs in a folder and saves each one as a Markdown file, including tables and handling multi-page documents.

Prompt 2

I want to extract data from filled-in paper forms using Chandra. Walk me through installing it with pip and processing a single scanned form image into JSON output.

Prompt 3

Set up Chandra with vLLM as the backend instead of HuggingFace to reduce local memory usage, then run it on a batch of 50 scanned invoice PDFs.

Prompt 4

Chandra extracted a table from my PDF but the Markdown formatting looks wrong. How do I improve table extraction accuracy or clean up the output automatically?

Open on GitHub → Explain another repo

← datalab-to on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.