Extract structured text from scanned PDF invoices, forms, or contracts and output them as Markdown or JSON for further processing.
Process a batch of multilingual documents through Chandra's command-line tool to get searchable, structured output from image files.
Convert research papers containing mathematical formulas, figures, and tables into clean Markdown files for editing or archiving.
Build a document processing pipeline that extracts filled form data including checkboxes from uploaded PDFs and stores the results as structured JSON.
Requires PyTorch installed locally for the HuggingFace backend, commercial self-hosting requires a separate license from Datalab.
Chandra is an OCR model, meaning it reads text from images and PDF files and converts that content into structured digital formats. OCR stands for optical character recognition, the technology that lets a computer extract the words from a scanned document or photograph. Chandra goes beyond basic text extraction by preserving the layout of the original document and outputting the result as Markdown, HTML, or JSON. What sets it apart is its handling of difficult content types. It accurately processes complex tables, filled-in forms including checkboxes, handwritten text, mathematical formulas, charts, and documents in over 90 languages. The README includes side-by-side benchmark comparisons showing its accuracy against other publicly available OCR tools on multilingual documents. To use it, you install the Python package with pip, then run a command-line tool pointing it at a file or folder. It supports two ways of running the underlying AI model: one uses HuggingFace, a popular AI model platform that requires the PyTorch library installed locally, and the other uses vLLM, a server-based approach that is lighter to set up. A browser-based demo app is also included for trying it out on single pages. For each processed document, Chandra produces a folder of output files: a Markdown version, an HTML version, a JSON metadata file, and any images extracted from the document. You can control which page range to process, how many pages to handle in parallel, and whether to include page headers and footers in the output. The code is released under the Apache 2.0 license. The underlying model weights use the OpenRAIL-M license. A managed cloud version with higher accuracy, batch processing at scale, and SOC 2 Type 2 compliance is available from Datalab, the company behind the project. Commercial self-hosting requires a separate license.
← datalab-to on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.