Extract clean article text from any web page URL and save it as plain text or Markdown for further processing
Build a text dataset by crawling an entire website's sitemap or RSS feed and extracting the article content from each page
Pull article metadata, title, author, date, tags, along with the text in a single function call for each URL
Feed extracted web text into an NLP pipeline or language model training process without manual HTML cleaning
Trafilatura is a Python library and command-line tool for pulling text out of web pages. When you fetch a web page, the raw HTML contains a lot of noise: navigation menus, headers, footers, ads, cookie banners, and other repeating elements that are not the actual article or content. Trafilatura strips that noise away and gives you the main text, along with metadata like the title, author, publication date, site name, and tags. You can point it at a single URL, a list of URLs, or feed it previously downloaded HTML files. It also supports web crawling: give it a sitemap (the XML file that lists all pages on a website) or a news feed in RSS or ATOM format, and it will work through the list automatically. It filters out duplicate URLs and processes downloads in a way that does not overload the servers it visits. Output can be written in several formats depending on what you need: plain text, Markdown, CSV, JSON, standard HTML, or XML. The XML-TEI format is included for researchers who work with text corpora in academic settings. Language detection for the extracted content is available as an optional add-on. The project has been cited in academic research and ranked at the top of several independent benchmarks comparing open-source text extraction tools. Organizations including HuggingFace, IBM, and Microsoft Research have integrated it into their own projects. The library was originally built for linguistics research at a Berlin academy and has grown into a general-purpose web scraping tool. It is available through pip (Python's package installer) and works on the command line without writing any code. The license is Apache 2.0.
← adbar on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.