adbar/trafilatura

★ 5,933PythonAudience · researcherComplexity · 2/5LicenseSetup · easy

Mindmap

mindmap
  root((trafilatura))
    What it does
      Web text extraction
      Noise removal
      Metadata extraction
      Web crawling
    Tech stack
      Python
      pip
      CLI
    Use cases
      Content scraping
      Dataset building
      Research corpora
    Audience
      Researchers
      Data engineers
      NLP practitioners

mindmap root((trafilatura)) What it does Web text extraction Noise removal Metadata extraction Web crawling Tech stack Python pip CLI Use cases Content scraping Dataset building Research corpora Audience Researchers Data engineers NLP practitioners

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Extract clean article text from any web page URL and save it as plain text or Markdown for further processing

USE CASE 2

Build a text dataset by crawling an entire website's sitemap or RSS feed and extracting the article content from each page

USE CASE 3

Pull article metadata, title, author, date, tags, along with the text in a single function call for each URL

USE CASE 4

Feed extracted web text into an NLP pipeline or language model training process without manual HTML cleaning

Tech stack

Pythonpip

Getting it running

Difficulty · easy Time to first run · 5min

Use freely for any purpose, including commercial use, as long as you include the license and copyright notice.

In plain English

Trafilatura is a Python library and command-line tool for pulling text out of web pages. When you fetch a web page, the raw HTML contains a lot of noise: navigation menus, headers, footers, ads, cookie banners, and other repeating elements that are not the actual article or content. Trafilatura strips that noise away and gives you the main text, along with metadata like the title, author, publication date, site name, and tags. You can point it at a single URL, a list of URLs, or feed it previously downloaded HTML files. It also supports web crawling: give it a sitemap (the XML file that lists all pages on a website) or a news feed in RSS or ATOM format, and it will work through the list automatically. It filters out duplicate URLs and processes downloads in a way that does not overload the servers it visits. Output can be written in several formats depending on what you need: plain text, Markdown, CSV, JSON, standard HTML, or XML. The XML-TEI format is included for researchers who work with text corpora in academic settings. Language detection for the extracted content is available as an optional add-on. The project has been cited in academic research and ranked at the top of several independent benchmarks comparing open-source text extraction tools. Organizations including HuggingFace, IBM, and Microsoft Research have integrated it into their own projects. The library was originally built for linguistics research at a Berlin academy and has grown into a general-purpose web scraping tool. It is available through pip (Python's package installer) and works on the command line without writing any code. The license is Apache 2.0.

Copy-paste prompts

Prompt 1

Use trafilatura to download and extract the main article text from a URL, then save it as a Markdown file. Show me the Python code.

Prompt 2

I have a list of 500 URLs in a text file. Write a Python script using trafilatura to extract the text from each one and save each article as a separate JSON file with title, date, and content.

Prompt 3

Show me how to use trafilatura from the command line to crawl a website sitemap and extract text from every page listed in it.

Prompt 4

Extract article text with trafilatura and also get the metadata, author, publication date, and tags, returned as a Python dictionary.

Prompt 5

I want to build a small text corpus for NLP research. Use trafilatura to scrape articles from an RSS feed, deduplicate URLs, and save the results as a JSONL file.

Open on GitHub → Explain another repo

← adbar on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.