explaingit

ucbepic/docetl

Analysis updated 2026-07-03

3,751PythonAudience · researcherComplexity · 3/5Setup · moderate

TLDR

DocETL is a UC Berkeley tool for building AI-powered document processing pipelines using a config file rather than custom code, with a browser playground for testing prompts and a Python package for production runs.

Mindmap

mindmap
  root((DocETL))
    What it does
      AI document pipelines
      Config-based steps
      Error handling
    Interfaces
      DocWrangler browser UI
      Python package
      Command line
    Setup Options
      OpenAI API key
      Docker local run
      AWS Bedrock option
    Use Cases
      Document extraction
      Topic analysis
      Text conversion
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Build a pipeline that reads hundreds of research papers and extracts key findings from each into a structured table using an AI model.

USE CASE 2

Use the DocWrangler browser playground to prototype and refine prompts for a document classification task before running it at scale.

USE CASE 3

Extract topic clusters from a large set of YouTube transcripts by wiring a topic-extraction step into a DocETL config pipeline.

USE CASE 4

Set up a local DocETL instance with Docker to process sensitive documents without sending them to a hosted service.

What is it built with?

PythonDockerOpenAI API

How does it compare?

ucbepic/docetlcuemacro/finmarketpyjcrist/msgspec
Stars3,7513,7523,752
LanguagePythonPythonPython
Setup difficultymoderatemoderateeasy
Complexity3/53/52/5
Audienceresearcherdatadeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Requires an OpenAI API key or compatible provider, running DocWrangler locally also requires Docker for the two-service setup.

No specific license terms were mentioned in the explanation.

In plain English

DocETL is a tool for building data processing pipelines that use AI language models to work through large collections of documents. Instead of writing custom code to send documents to an AI and collect the results, you define a pipeline as a configuration, and DocETL runs it, handles errors, and manages the flow of data between steps. It comes from the EPIC research lab at UC Berkeley and is accompanied by a published paper. There are two ways to use it. The first is DocWrangler, an interactive browser-based playground where you can experiment with different prompts and pipeline steps and immediately see what the output looks like. DocWrangler is available as a hosted version at docetl.org, and can also be run locally using Docker. The second option is a Python package installed via pip, which lets you run finalized pipelines from code or the command line in a production context. Running DocETL requires an API key for an AI language model. The default configuration expects an OpenAI key, but the system also supports other providers including AWS Bedrock. When running DocWrangler locally there are two separate configuration files: one for the Python backend that executes the pipeline, and one for the frontend UI. The README includes detailed instructions for both Docker and manual setup paths. The project supports a range of community-built examples, including tools for generating conversations from documents, converting text to speech, and extracting topics from YouTube transcripts. Educational resources from the team cover topics like how to improve output quality using a technique called gleaning and how the entity resolution operator works. To get help writing a pipeline definition, the README suggests using an AI coding assistant with the project's documentation as context. A basic test suite is included for contributors who want to verify their local setup, and the cost of running it is noted as under one cent with OpenAI.

Copy-paste prompts

Prompt 1
I have 500 research abstracts I want to process with DocETL to extract the main hypothesis and methodology from each. Write a minimal pipeline config that does this using an OpenAI key.
Prompt 2
Show me how to run DocWrangler locally with Docker, connect it to my OpenAI API key, and test a prompt that summarizes customer feedback documents.
Prompt 3
I want to use DocETL's entity resolution operator to deduplicate company names across a large set of documents. Give me a sample pipeline config and explain how the operator works.
Prompt 4
Write a DocETL pipeline config that reads a folder of text files, runs each through an AI model to extract named entities, and saves the results as a JSON file.

Frequently asked questions

What is docetl?

DocETL is a UC Berkeley tool for building AI-powered document processing pipelines using a config file rather than custom code, with a browser playground for testing prompts and a Python package for production runs.

What language is docetl written in?

Mainly Python. The stack also includes Python, Docker, OpenAI API.

What license does docetl use?

No specific license terms were mentioned in the explanation.

How hard is docetl to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is docetl for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub ucbepic on gitmyhub

Verify against the repo before relying on details.