explaingit

google/langextract

📈 Trending36,487PythonAudience · developerComplexity · 2/5ActiveLicenseSetup · moderate

TLDR

Python library that uses LLMs to extract structured data from unstructured text documents, with source grounding and interactive visualization.

Mindmap

mindmap
  root((LangExtract))
    What it does
      Extract entities from text
      Map to source locations
      Visualize results
    How it works
      Plain English prompts
      Few-shot examples
      LLM processing
    Features
      Chunk long documents
      Parallel processing
      Multi-pass extraction
    Supported models
      Google Gemini
      OpenAI models
      Ollama local models
    Use cases
      Clinical notes
      Legal contracts
      Literary analysis

Things people build with this

USE CASE 1

Extract medications and dosages from clinical notes and automatically populate a database.

USE CASE 2

Pull key terms, obligations, and parties from legal contracts for compliance review.

USE CASE 3

Identify characters, relationships, and plot events from literary texts for analysis.

USE CASE 4

Extract structured data from domain-specific documents without training a custom model.

Tech stack

PythonGoogle Gemini APIOpenAI APIOllama

Getting it running

Difficulty · moderate Time to first run · 30min

Requires API key from Google Gemini or OpenAI, or local Ollama setup.

Use freely for any purpose including commercial. Keep the notice and disclose changes to the patent grant.

In plain English

LangExtract is a Python library from Google that uses large language models (LLMs) to pull structured information out of unstructured text documents. "Structured information" means organized, categorized data, like a table of named entities, a list of medications with dosages, or characters and their relationships, drawn from a free-form document like a clinical note, a legal contract, or a literary text. This solves the gap between the unstructured world (documents written in natural language) and the structured world (databases, spreadsheets, analytics pipelines) that applications need. The library works by letting you describe your extraction task using plain English instructions and a few hand-crafted examples that show the model what you want. You provide a text document, your prompt description, and your examples, and LangExtract sends everything to an LLM and returns the extracted entities as structured Python objects. A key feature is precise source grounding: every extracted item is mapped back to its exact character position in the original text. This lets the library generate an interactive HTML visualization where you can see each extraction highlighted in context, making it easy to verify that the model stayed faithful to the source rather than hallucinating. For long documents, the library handles chunking the text into manageable pieces, processing chunks in parallel, and running multiple passes to improve recall (the proportion of relevant items actually found). It supports Google Gemini models by default, including Gemini Flash and Gemini Pro, but also works with OpenAI models and local open-source models running via Ollama. You would use LangExtract when you need to extract entities from clinical notes, radiology reports, contracts, or any domain-specific text without fine-tuning a model. The tech stack is Python, primarily targeting the Gemini API, and distributed as a pip package.

Copy-paste prompts

Prompt 1
Use LangExtract to extract all medication names and dosages from this clinical note, with source locations highlighted in an interactive HTML view.
Prompt 2
Show me how to set up LangExtract with Google Gemini to extract named entities (people, organizations, locations) from a batch of news articles.
Prompt 3
How do I use LangExtract to extract contract terms and obligations from a PDF, then verify the extractions are grounded in the original text?
Prompt 4
Extract character relationships and plot events from a novel chapter using LangExtract with few-shot examples, then visualize the results.
Prompt 5
Configure LangExtract to process a long document by chunking it, running parallel extraction, and merging results to improve recall.
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.