google/langextract

Analysis updated 2026-05-18

★ 36,390PythonAudience · developerComplexity · 2/5LicenseSetup · moderate

Mindmap

mindmap
  root((LangExtract))
    What it does
      Extract entities from text
      Map to source locations
      Visualize results
    How it works
      Plain English prompts
      Few-shot examples
      LLM processing
    Features
      Chunk long documents
      Parallel processing
      Multi-pass extraction
    Supported models
      Google Gemini
      OpenAI models
      Ollama local models
    Use cases
      Clinical notes
      Legal contracts
      Literary analysis

mindmap root((LangExtract)) What it does Extract entities from text Map to source locations Visualize results How it works Plain English prompts Few-shot examples LLM processing Features Chunk long documents Parallel processing Multi-pass extraction Supported models Google Gemini OpenAI models Ollama local models Use cases Clinical notes Legal contracts Literary analysis

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Extract medications and dosages from clinical notes and automatically populate a database.

USE CASE 2

Pull key terms, obligations, and parties from legal contracts for compliance review.

USE CASE 3

Identify characters, relationships, and plot events from literary texts for analysis.

USE CASE 4

Extract structured data from domain-specific documents without training a custom model.

What is it built with?

PythonGoogle Gemini APIOpenAI APIOllama

How does it compare?

	google/langextract	myshell-ai/openvoice	hankcs/hanlp
Stars	36,390	36,463	36,296
Language	Python	Python	Python
Setup difficulty	moderate	hard	moderate
Complexity	2/5	4/5	3/5
Audience	developer	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Requires API key from Google Gemini or OpenAI, or local Ollama setup.

Use freely for any purpose including commercial. Keep the notice and disclose changes to the patent grant.

In plain English

LangExtract is a Python library from Google that uses large language models (LLMs) to pull structured information out of unstructured text documents. "Structured information" means organized, categorized data, like a table of named entities, a list of medications with dosages, or characters and their relationships, drawn from a free-form document like a clinical note, a legal contract, or a literary text. This solves the gap between the unstructured world (documents written in natural language) and the structured world (databases, spreadsheets, analytics pipelines) that applications need. The library works by letting you describe your extraction task using plain English instructions and a few hand-crafted examples that show the model what you want. You provide a text document, your prompt description, and your examples, and LangExtract sends everything to an LLM and returns the extracted entities as structured Python objects. A key feature is precise source grounding: every extracted item is mapped back to its exact character position in the original text. This lets the library generate an interactive HTML visualization where you can see each extraction highlighted in context, making it easy to verify that the model stayed faithful to the source rather than hallucinating. For long documents, the library handles chunking the text into manageable pieces, processing chunks in parallel, and running multiple passes to improve recall (the proportion of relevant items actually found). It supports Google Gemini models by default, including Gemini Flash and Gemini Pro, but also works with OpenAI models and local open-source models running via Ollama. You would use LangExtract when you need to extract entities from clinical notes, radiology reports, contracts, or any domain-specific text without fine-tuning a model. The tech stack is Python, primarily targeting the Gemini API, and distributed as a pip package.

Copy-paste prompts

Prompt 1

Use LangExtract to extract all medication names and dosages from this clinical note, with source locations highlighted in an interactive HTML view.

Prompt 2

Show me how to set up LangExtract with Google Gemini to extract named entities (people, organizations, locations) from a batch of news articles.

Prompt 3

How do I use LangExtract to extract contract terms and obligations from a PDF, then verify the extractions are grounded in the original text?

Prompt 4

Extract character relationships and plot events from a novel chapter using LangExtract with few-shot examples, then visualize the results.

Prompt 5

Configure LangExtract to process a long document by chunking it, running parallel extraction, and merging results to improve recall.

Frequently asked questions

What is langextract?

Python library that uses LLMs to extract structured data from unstructured text documents, with source grounding and interactive visualization.

What language is langextract written in?

Mainly Python. The stack also includes Python, Google Gemini API, OpenAI API.

What license does langextract use?

Use freely for any purpose including commercial. Keep the notice and disclose changes to the patent grant.

How hard is langextract to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is langextract for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub google on gitmyhub

Verify against the repo before relying on details.