Analysis updated 2026-06-24
Pull the clean article body and metadata out of a news URL without writing custom selectors
Discover all article and category URLs for a site like cnn.com using the build mode
Generate a list of keywords and a short summary from a downloaded article using the nlp method
Extract article text in 30+ languages including Chinese, Arabic, and Russian
| codelucas/newspaper | bloomberg/memray | andkret/cookbook | |
|---|---|---|---|
| Stars | 15,043 | 15,019 | 15,082 |
| Language | Python | Python | Python |
| Last pushed | — | 2026-05-21 | — |
| Maintenance | — | Maintained | — |
| Setup difficulty | easy | moderate | easy |
| Complexity | 2/5 | 3/5 | 2/5 |
| Audience | data | developer | data |
Figures from each repo's GitHub metadata at analysis time.
Install with pip install newspaper3k, the Python 2 branch is deprecated and the README contains a sponsored proxy section that is not core docs.
Newspaper3k is a Python library for pulling news articles off websites and turning them into clean, structured data. Given the URL of a news story, it downloads the page, extracts the article body from the surrounding clutter of menus and ads, and exposes useful fields: the author list, the publish date, the main text, the top image, embedded videos, and so on. There is also a build mode where you point it at a whole news site such as cnn.com, and it discovers all the article URLs and category pages on that site. The library is written in pure Python 3, using the requests library as inspiration for its simple API and the lxml library for fast HTML parsing. An older Python 2 version exists on a separate branch, but the README labels it deprecated and buggy. A typical workflow shown in the README is short: create an Article object from a URL, call download then parse, and read the fields. There is also a separate nlp method that produces a list of keywords and a short text summary from the article body. Newspaper3k is multi-lingual. The README lists more than 30 input languages, including Arabic, Chinese, English, French, German, Greek, Hebrew, Hindi, Japanese, Korean, Russian, Spanish, and Swahili. If no language is given, the library will try to detect one automatically. The same API works for non-English sources, for example building a paper from the Chinese site sina.com.cn with language set to zh. The README also lists the main features in one place: a multi-threaded article download framework, URL identification for news pages, text and image extraction, keyword and summary extraction, author detection, and a helper for fetching Google trending terms. There is a short helper function called fulltext that takes raw HTML and returns just the article text, useful when the HTML is already in hand. One section of the README is a sponsored note from the author about routing scrapers through a paid residential proxy service called Swiftproxy, with a referral link and a discount code, presented as a way to avoid 403 responses, captchas, and rate limits when scraping at scale. Treat that section as a paid recommendation rather than core project documentation. Full guides and the list of supported languages live in the project docs on Read the Docs.
Newspaper3k is a Python 3 library that downloads news articles by URL and extracts the title, author, date, body text, top image, keywords, and summary.
Mainly Python. The stack also includes Python, lxml, requests.
Setup difficulty is rated easy, with roughly 5min to a first successful run.
Mainly data.
This repo across BitVibe Labs
Verify against the repo before relying on details.