Extract medications and dosages from clinical notes and automatically populate a database.
Pull key terms, obligations, and parties from legal contracts for compliance review.
Identify characters, relationships, and plot events from literary texts for analysis.
Extract structured data from domain-specific documents without training a custom model.
Requires API key from Google Gemini or OpenAI, or local Ollama setup.
LangExtract is a Python library from Google that uses large language models (LLMs) to pull structured information out of unstructured text documents. "Structured information" means organized, categorized data, like a table of named entities, a list of medications with dosages, or characters and their relationships, drawn from a free-form document like a clinical note, a legal contract, or a literary text. This solves the gap between the unstructured world (documents written in natural language) and the structured world (databases, spreadsheets, analytics pipelines) that applications need. The library works by letting you describe your extraction task using plain English instructions and a few hand-crafted examples that show the model what you want. You provide a text document, your prompt description, and your examples, and LangExtract sends everything to an LLM and returns the extracted entities as structured Python objects. A key feature is precise source grounding: every extracted item is mapped back to its exact character position in the original text. This lets the library generate an interactive HTML visualization where you can see each extraction highlighted in context, making it easy to verify that the model stayed faithful to the source rather than hallucinating. For long documents, the library handles chunking the text into manageable pieces, processing chunks in parallel, and running multiple passes to improve recall (the proportion of relevant items actually found). It supports Google Gemini models by default, including Gemini Flash and Gemini Pro, but also works with OpenAI models and local open-source models running via Ollama. You would use LangExtract when you need to extract entities from clinical notes, radiology reports, contracts, or any domain-specific text without fine-tuning a model. The tech stack is Python, primarily targeting the Gemini API, and distributed as a pip package.
Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.