explaingit

unstructured-io/unstructured

Analysis updated 2026-06-24 · repo last pushed 2026-05-18

14,696HTMLAudience · dataComplexity · 3/5MaintainedSetup · moderate

TLDR

Python library that turns PDFs, Word docs, HTML, and images into labelled text elements like titles, paragraphs, lists, and tables, ready to feed into language model pipelines.

Mindmap

mindmap
  root((unstructured))
    Inputs
      PDF
      DOCX
      HTML
      Images
    Outputs
      Labelled elements
      Tables
      Titles and lists
    Use Cases
      RAG ingestion
      Document ETL
      Search index prep
      Training data prep
    Tech Stack
      Python
      Docker
      PyPI
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Pre-process a folder of PDFs into clean chunks for a RAG pipeline

USE CASE 2

Extract tables and titles from Word documents to load into a vector database

USE CASE 3

Build a unified ingest job that handles PDF, HTML, and DOCX with one library

USE CASE 4

Strip boilerplate and parse images out of email attachments before indexing

What is it built with?

PythonDocker

How does it compare?

unstructured-io/unstructuredshengqiangzhang/examples-of-web-crawlersterkelg/awesome-creative-coding
Stars14,69614,62914,797
LanguageHTMLHTMLHTML
Last pushed2026-05-182026-04-01
MaintenanceMaintainedMaintained
Setup difficultymoderatemoderateeasy
Complexity3/52/51/5
Audiencedatadeveloperdesigner

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Some partitioners pull in heavy system dependencies like tesseract and poppler, which is why the Docker image is the easiest path.

In plain English

unstructured is an open source Python library for turning messy documents into clean, structured data that language models can consume. Its core job is to take inputs like PDFs, HTML pages, Microsoft Word files, images, and many other formats, and produce a consistent output where text is broken into labelled pieces such as titles, paragraphs, lists, and tables. The README describes this as pre-processing for language model workflows, sometimes called an ETL step (extract, transform, load) for AI. The README explains that the library is built out of modular functions and connectors that fit together into one system. Partitioning is the central feature: each supported file type has its own partitioner that knows how to extract structured elements from that format. The documentation page linked in the README lists the full set of supported file types. By treating many formats with the same output shape, downstream code can ingest data without writing custom parsers for every source. Alongside the open source library, the company sells a hosted product called Unstructured Platform. The README pitches it as the path to production, with extra features like chunking (splitting text into pieces sized for a model), embedding (turning text into numerical vectors), and enrichment for images and tables. It is offered as a low-code interface or an API, and the README points readers to a sales contact form for a demo. The quick start section gives several ways to install the library. You can run it inside a Docker container, pulling a prebuilt image from the project's image repository and shelling in. You can install the Python package from PyPI for use in your own code. You can also clone the repository and set it up for local development. There is a note about installing with conda on Windows, and a mention that the published images support both x86_64 and Apple silicon machines. The top of the README links to a Slack community, a LinkedIn page, and a contributor covenant code of conduct. Download counters from pepy.tech show that the package has substantial pip download volume. The license badge in the header indicates an open source license, with the full license text in LICENSE.md inside the repository. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1
Use unstructured to partition a PDF and write each labelled element to a JSON lines file
Prompt 2
Build a small RAG pipeline that uses unstructured for parsing and Chroma for storage
Prompt 3
Pull the unstructured Docker image and process a folder of mixed format docs from the host
Prompt 4
Compare unstructured table extraction quality against pdfplumber on the same PDF
Prompt 5
Stream a DOCX through unstructured and split the output into 1000 token chunks for embedding

Frequently asked questions

What is unstructured?

Python library that turns PDFs, Word docs, HTML, and images into labelled text elements like titles, paragraphs, lists, and tables, ready to feed into language model pipelines.

What language is unstructured written in?

Mainly HTML. The stack also includes Python, Docker.

Is unstructured actively maintained?

Maintained — commit in last 6 months (last push 2026-05-18).

How hard is unstructured to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is unstructured for?

Mainly data.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.