Extract all text from a PDF document for search indexing or downstream text processing.
Pull out interactive form field values from AcroForm PDFs without rendering the file.
Retrieve images embedded in a PDF in their original JPG, PNG, or TIFF format.
Convert a PDF to HTML while preserving layout for use in a document processing pipeline.
Install with pip, a basic two-line extraction works out of the box with no configuration needed.
pdfminer.six is a Python library for extracting text and other content from PDF files. It works by reading the PDF source directly rather than rendering the page visually, which means it can pull out not just the text itself but also the precise position, font, and color of each piece of text on a page. This makes it more useful than tools that simply render a PDF to an image and then try to recognize characters. Beyond plain text, the library can extract images embedded in PDFs (in formats like JPG, PNG, and TIFF), pull out interactive form data (AcroForms), retrieve the table of contents, and output content as HTML or hOCR (a format used in document processing workflows). It handles encrypted PDFs using RC4 and AES, and it supports CJK (Chinese, Japanese, Korean) languages as well as vertical text layouts, which are common in those scripts. The library is designed to be modular. Each part of the extraction pipeline can be replaced with a custom implementation, so developers building specialized document processing tools can slot in their own components while reusing the rest of the library. Installation is a single pip command. A basic use case, extracting all text from a PDF, takes two lines of Python code. A command-line tool called pdf2txt.py is also included for quick extraction without writing any code. This project is a community-maintained fork of the original PDFMiner, which is no longer actively developed. The README notes that the maintainers have limited availability, so the most reliable way to get a bug fixed is to submit a pull request yourself rather than waiting for a maintainer to handle it.
← pdfminer on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.