Extract readable text from a folder of PDF, Word, and spreadsheet files to build a search index
Pre-process uploaded documents in a web app so their content can be stored and searched
Pull text from a mixed-format document collection to feed into an NLP or classification pipeline
Convert a legacy archive of office files into plain text for data mining
Requires system-level tools such as antiword and pdftotext in addition to the Python package.
textract is a Python library for extracting plain text out of documents. The project tagline is "extract text from any document, no muss, no fuss," which signals that the goal is a straightforward interface regardless of what file format you hand it. The topics listed on the repository include text mining, data mining, and natural language processing, pointing to use cases where a developer needs readable text as an input to some further analysis or processing step, such as searching, summarizing, or classifying a collection of files. The README is sparse. It gives the project name, the one-line description, and a link to the full documentation hosted on the Read the Docs platform at textract.readthedocs.org. Details about which file formats are supported, how to install the library, and how to call it in code are not included in the README and would need to be read from that external documentation site. The repository has accumulated over 4,500 stars on GitHub, which suggests it has been widely used or referenced in the Python data-processing community over the years. The project is available on PyPI, the standard Python package registry, based on the version and download badges shown in the README. Beyond those signals, the README does not describe licensing terms, the project's current maintenance status, or contribution guidelines.
← deanmalmgren on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.