explaingit

codelucas/newspaper

15,043Python

TLDR

Newspaper3k is a Python library for pulling news articles off websites and turning them into clean, structured data.

Mindmap

A visual breakdown will appear here once this repo is fully enriched.

In plain English

Newspaper3k is a Python library for pulling news articles off websites and turning them into clean, structured data. Given the URL of a news story, it downloads the page, extracts the article body from the surrounding clutter of menus and ads, and exposes useful fields: the author list, the publish date, the main text, the top image, embedded videos, and so on. There is also a build mode where you point it at a whole news site such as cnn.com, and it discovers all the article URLs and category pages on that site. The library is written in pure Python 3, using the requests library as inspiration for its simple API and the lxml library for fast HTML parsing. An older Python 2 version exists on a separate branch, but the README labels it deprecated and buggy. A typical workflow shown in the README is short: create an Article object from a URL, call download then parse, and read the fields. There is also a separate nlp method that produces a list of keywords and a short text summary from the article body. Newspaper3k is multi-lingual. The README lists more than 30 input languages, including Arabic, Chinese, English, French, German, Greek, Hebrew, Hindi, Japanese, Korean, Russian, Spanish, and Swahili. If no language is given, the library will try to detect one automatically. The same API works for non-English sources, for example building a paper from the Chinese site sina.com.cn with language set to zh. The README also lists the main features in one place: a multi-threaded article download framework, URL identification for news pages, text and image extraction, keyword and summary extraction, author detection, and a helper for fetching Google trending terms. There is a short helper function called fulltext that takes raw HTML and returns just the article text, useful when the HTML is already in hand. One section of the README is a sponsored note from the author about routing scrapers through a paid residential proxy service called Swiftproxy, with a referral link and a discount code, presented as a way to avoid 403 responses, captchas, and rate limits when scraping at scale. Treat that section as a paid recommendation rather than core project documentation. Full guides and the list of supported languages live in the project docs on Read the Docs.

Open on GitHub → Explain another repo

Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.