Analysis updated 2026-07-03
Build a pipeline that reads hundreds of research papers and extracts key findings from each into a structured table using an AI model.
Use the DocWrangler browser playground to prototype and refine prompts for a document classification task before running it at scale.
Extract topic clusters from a large set of YouTube transcripts by wiring a topic-extraction step into a DocETL config pipeline.
Set up a local DocETL instance with Docker to process sensitive documents without sending them to a hosted service.
| ucbepic/docetl | cuemacro/finmarketpy | jcrist/msgspec | |
|---|---|---|---|
| Stars | 3,751 | 3,752 | 3,752 |
| Language | Python | Python | Python |
| Setup difficulty | moderate | moderate | easy |
| Complexity | 3/5 | 3/5 | 2/5 |
| Audience | researcher | data | developer |
Figures from each repo's GitHub metadata at analysis time.
Requires an OpenAI API key or compatible provider, running DocWrangler locally also requires Docker for the two-service setup.
DocETL is a tool for building data processing pipelines that use AI language models to work through large collections of documents. Instead of writing custom code to send documents to an AI and collect the results, you define a pipeline as a configuration, and DocETL runs it, handles errors, and manages the flow of data between steps. It comes from the EPIC research lab at UC Berkeley and is accompanied by a published paper. There are two ways to use it. The first is DocWrangler, an interactive browser-based playground where you can experiment with different prompts and pipeline steps and immediately see what the output looks like. DocWrangler is available as a hosted version at docetl.org, and can also be run locally using Docker. The second option is a Python package installed via pip, which lets you run finalized pipelines from code or the command line in a production context. Running DocETL requires an API key for an AI language model. The default configuration expects an OpenAI key, but the system also supports other providers including AWS Bedrock. When running DocWrangler locally there are two separate configuration files: one for the Python backend that executes the pipeline, and one for the frontend UI. The README includes detailed instructions for both Docker and manual setup paths. The project supports a range of community-built examples, including tools for generating conversations from documents, converting text to speech, and extracting topics from YouTube transcripts. Educational resources from the team cover topics like how to improve output quality using a technique called gleaning and how the entity resolution operator works. To get help writing a pipeline definition, the README suggests using an AI coding assistant with the project's documentation as context. A basic test suite is included for contributors who want to verify their local setup, and the cost of running it is noted as under one cent with OpenAI.
DocETL is a UC Berkeley tool for building AI-powered document processing pipelines using a config file rather than custom code, with a browser playground for testing prompts and a Python package for production runs.
Mainly Python. The stack also includes Python, Docker, OpenAI API.
No specific license terms were mentioned in the explanation.
Setup difficulty is rated moderate, with roughly 30min to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.