explaingit

tabulapdf/tabula

7,401CSSAudience · generalComplexity · 2/5Setup · easy

TLDR

A desktop application that extracts data tables from text-based PDF files and saves them as CSV or spreadsheet-ready data, running locally in your browser so your files never leave your machine.

Mindmap

mindmap
  root((tabula))
    What it does
      Extract tables
      PDF to CSV
      Local browser UI
    Input Types
      Text-based PDFs
      Not scanned images
    Install Options
      Windows app
      macOS app
      Docker Compose
      Java JAR file
    Use Cases
      Data extraction
      Report conversion
      Spreadsheet import
    Limitations
      No OCR support
      Volunteer project
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Extract a table of numbers from a government or financial PDF report into a CSV file for analysis

USE CASE 2

Convert a multi-page PDF data export into a spreadsheet without copy-paste errors

USE CASE 3

Run Tabula via Docker to process PDFs in a repeatable automated workflow

Tech stack

JavaCSSDocker

Getting it running

Difficulty · easy Time to first run · 30min

Only works on text-based PDFs, scanned images need a separate OCR step first. Requires Java 7 or newer.

In plain English

Tabula is a desktop application that extracts data tables from PDF files and converts them into spreadsheet-friendly formats like CSV. If you have ever received a PDF containing a table of numbers or a data report and needed that information in a spreadsheet but found copying it out was impossible or produced garbled results, Tabula addresses exactly that problem. You upload the PDF, draw a selection box around the table you want, and Tabula pulls out the rows and columns as structured data you can open in Excel or import into a database. The application runs locally on your machine and works through a browser interface. After launching it, a web page opens at a local address (127.0.0.1:8080) where you do all the work. Your files never leave your computer, which matters when working with confidential documents. The README does note two small exceptions: the app makes a request to check for newer versions and sends a usage count to a statistics counter, both of which can be disabled with command-line flags if needed. Tabula only works with text-based PDFs, not scanned images. A quick test is whether you can click and drag to select text in the PDF using a standard PDF viewer. If you can, Tabula should be able to read it. Scanned pages that contain pictures of text require a separate optical character recognition step before Tabula can help. Installation is available as a packaged app for Windows and macOS, a snap package for Linux, a plain JAR file runnable with Java on any platform, or via Docker Compose. Java 7 or newer is required. A separate command-line library called tabula-java handles the underlying extraction logic and continues to receive occasional updates from the community. The README opens with a note that Tabula is a volunteer project with no active paid development at this time, and the end-user application here is unlikely to see near-term updates.

Copy-paste prompts

Prompt 1
I have a PDF with a financial data table, walk me through using Tabula to extract it as a CSV I can open in Excel, step by step.
Prompt 2
How do I install Tabula on Windows and process a folder of PDF reports to get CSV files out?
Prompt 3
Write a shell script that uses the tabula-java command-line library to extract tables from every PDF in a given directory.
Prompt 4
What are the limitations of Tabula, when will it fail, and what tool should I use instead for scanned PDFs?
Open on GitHub → Explain another repo

← tabulapdf on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.