explaingit

pdfminer/pdfminer.six

6,972PythonAudience · developerComplexity · 2/5Setup · easy

TLDR

pdfminer.six is a Python library that extracts text, images, form data, and precise layout information directly from PDF files, giving you the font, color, and exact position of every character on the page.

Mindmap

mindmap
  root((repo))
    What it does
      Extract PDF text
      Get layout positions
      Pull embedded images
      Read form fields
    Supported content
      Text with fonts
      Images JPG PNG TIFF
      AcroForms
      Table of contents
    Output formats
      Plain text
      HTML
      hOCR
    Use cases
      Search indexing
      Document pipeline
      Form data extraction
    Audience
      Python developers
      Data engineers
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Extract all text from a PDF document for search indexing or downstream text processing.

USE CASE 2

Pull out interactive form field values from AcroForm PDFs without rendering the file.

USE CASE 3

Retrieve images embedded in a PDF in their original JPG, PNG, or TIFF format.

USE CASE 4

Convert a PDF to HTML while preserving layout for use in a document processing pipeline.

Tech stack

Python

Getting it running

Difficulty · easy Time to first run · 5min

Install with pip, a basic two-line extraction works out of the box with no configuration needed.

In plain English

pdfminer.six is a Python library for extracting text and other content from PDF files. It works by reading the PDF source directly rather than rendering the page visually, which means it can pull out not just the text itself but also the precise position, font, and color of each piece of text on a page. This makes it more useful than tools that simply render a PDF to an image and then try to recognize characters. Beyond plain text, the library can extract images embedded in PDFs (in formats like JPG, PNG, and TIFF), pull out interactive form data (AcroForms), retrieve the table of contents, and output content as HTML or hOCR (a format used in document processing workflows). It handles encrypted PDFs using RC4 and AES, and it supports CJK (Chinese, Japanese, Korean) languages as well as vertical text layouts, which are common in those scripts. The library is designed to be modular. Each part of the extraction pipeline can be replaced with a custom implementation, so developers building specialized document processing tools can slot in their own components while reusing the rest of the library. Installation is a single pip command. A basic use case, extracting all text from a PDF, takes two lines of Python code. A command-line tool called pdf2txt.py is also included for quick extraction without writing any code. This project is a community-maintained fork of the original PDFMiner, which is no longer actively developed. The README notes that the maintainers have limited availability, so the most reliable way to get a bug fixed is to submit a pull request yourself rather than waiting for a maintainer to handle it.

Copy-paste prompts

Prompt 1
Show me the minimal Python code to extract all text from a multi-page PDF using pdfminer.six, printing each page's content separately.
Prompt 2
I need to extract form field values from an AcroForm PDF using pdfminer.six. Give me the Python code to read all the fields and their values.
Prompt 3
How do I use pdfminer.six to get the bounding box coordinates, font name, and font size for every piece of text on a specific page?
Prompt 4
Use the pdf2txt.py command-line tool from pdfminer.six to extract text from a PDF to a text file without writing any Python code.
Open on GitHub → Explain another repo

← pdfminer on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.