unstructured is an open source Python library for turning messy documents into clean, structured data that language models can consume. Its core job is to take inputs like PDFs, HTML pages, Microsoft Word files, images, and many other formats, and produce a consistent output where text is broken into labelled pieces such as titles, paragraphs, lists, and tables. The README describes this as pre-processing for language model workflows, sometimes called an ETL step (extract, transform, load) for AI. The README explains that the library is built out of modular functions and connectors that fit together into one system. Partitioning is the central feature: each supported file type has its own partitioner that knows how to extract structured elements from that format. The documentation page linked in the README lists the full set of supported file types. By treating many formats with the same output shape, downstream code can ingest data without writing custom parsers for every source. Alongside the open source library, the company sells a hosted product called Unstructured Platform. The README pitches it as the path to production, with extra features like chunking (splitting text into pieces sized for a model), embedding (turning text into numerical vectors), and enrichment for images and tables. It is offered as a low-code interface or an API, and the README points readers to a sales contact form for a demo. The quick start section gives several ways to install the library. You can run it inside a Docker container, pulling a prebuilt image from the project's image repository and shelling in. You can install the Python package from PyPI for use in your own code. You can also clone the repository and set it up for local development. There is a note about installing with conda on Windows, and a mention that the published images support both x86_64 and Apple silicon machines. The top of the README links to a Slack community, a LinkedIn page, and a contributor covenant code of conduct. Download counters from pepy.tech show that the package has substantial pip download volume. The license badge in the header indicates an open source license, with the full license text in LICENSE.md inside the repository.
Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.