Analysis updated 2026-07-03
Extract all tables from a government PDF report and export them to CSV for analysis in Excel or Google Sheets.
Convert financial statements stored as PDFs into pandas DataFrames so you can run calculations on the numbers.
Use the command-line interface to batch-extract tables from multiple PDF files without writing any Python code.
Filter out low-quality extractions automatically using Camelot's built-in accuracy and whitespace scores.
| camelot-dev/camelot | purpleailab/decepticon | openai/glide-text2im | |
|---|---|---|---|
| Stars | 3,691 | 3,691 | 3,690 |
| Language | Python | Python | Python |
| Setup difficulty | easy | moderate | easy |
| Complexity | 2/5 | 4/5 | 3/5 |
| Audience | data | ops devops | researcher |
Figures from each repo's GitHub metadata at analysis time.
Only works with text-based PDFs, scanned image PDFs are not supported.
Camelot is a Python library for pulling tables out of PDF files and turning them into structured data you can actually work with. PDFs are notoriously difficult to extract data from because the format is designed for display, not data exchange. Camelot solves that problem for text-based PDFs: the kind where you can click and drag to select text in a PDF viewer. A few lines of Python code are all you need to get started. You point the library at a PDF file, it finds the tables, and returns them as pandas DataFrames (a standard Python format for tabular data). From there you can export the results to CSV, JSON, Excel, HTML, Markdown, or SQLite. Each extracted table also comes with quality metrics including an accuracy score and a whitespace score, so you can filter out poorly extracted tables without having to check each one by hand. The library includes a command-line interface as well, so you do not need to write any Python if you just want to run a quick extraction. Installation is available through pip (the standard Python package installer) or through conda for Anaconda users. One important limitation: Camelot only works with text-based PDFs. Scanned documents (PDFs that are essentially images of pages) are not supported. If you cannot select and copy text from a table in your PDF viewer, Camelot will not be able to extract it. The project is licensed under the MIT license and has community wrappers available for PHP and a separate C# implementation.
A Python library that pulls tables out of PDF files and converts them into spreadsheet-ready data in just a few lines of code.
Mainly Python. The stack also includes Python, pandas.
MIT license, use freely for any purpose including commercial, as long as you keep the copyright notice.
Setup difficulty is rated easy, with roughly 5min to a first successful run.
Mainly data.
This repo across BitVibe Labs
Verify against the repo before relying on details.