explaingit

codelucas/newspaper

Analysis updated 2026-06-24

15,043PythonAudience · dataComplexity · 2/5Setup · easy

TLDR

Newspaper3k is a Python 3 library that downloads news articles by URL and extracts the title, author, date, body text, top image, keywords, and summary.

Mindmap

mindmap
  root((newspaper))
    Inputs
      Article URL
      News site URL
      Raw HTML
    Outputs
      Article text
      Authors
      Publish date
      Keywords summary
    Use Cases
      Scrape news articles
      Build news dataset
      Multi language extraction
    Tech Stack
      Python
      lxml
      requests
      NLTK
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Pull the clean article body and metadata out of a news URL without writing custom selectors

USE CASE 2

Discover all article and category URLs for a site like cnn.com using the build mode

USE CASE 3

Generate a list of keywords and a short summary from a downloaded article using the nlp method

USE CASE 4

Extract article text in 30+ languages including Chinese, Arabic, and Russian

What is it built with?

PythonlxmlrequestsNLTK

How does it compare?

codelucas/newspaperbloomberg/memrayandkret/cookbook
Stars15,04315,01915,082
LanguagePythonPythonPython
Last pushed2026-05-21
MaintenanceMaintained
Setup difficultyeasymoderateeasy
Complexity2/53/52/5
Audiencedatadeveloperdata

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · easy Time to first run · 5min

Install with pip install newspaper3k, the Python 2 branch is deprecated and the README contains a sponsored proxy section that is not core docs.

In plain English

Newspaper3k is a Python library for pulling news articles off websites and turning them into clean, structured data. Given the URL of a news story, it downloads the page, extracts the article body from the surrounding clutter of menus and ads, and exposes useful fields: the author list, the publish date, the main text, the top image, embedded videos, and so on. There is also a build mode where you point it at a whole news site such as cnn.com, and it discovers all the article URLs and category pages on that site. The library is written in pure Python 3, using the requests library as inspiration for its simple API and the lxml library for fast HTML parsing. An older Python 2 version exists on a separate branch, but the README labels it deprecated and buggy. A typical workflow shown in the README is short: create an Article object from a URL, call download then parse, and read the fields. There is also a separate nlp method that produces a list of keywords and a short text summary from the article body. Newspaper3k is multi-lingual. The README lists more than 30 input languages, including Arabic, Chinese, English, French, German, Greek, Hebrew, Hindi, Japanese, Korean, Russian, Spanish, and Swahili. If no language is given, the library will try to detect one automatically. The same API works for non-English sources, for example building a paper from the Chinese site sina.com.cn with language set to zh. The README also lists the main features in one place: a multi-threaded article download framework, URL identification for news pages, text and image extraction, keyword and summary extraction, author detection, and a helper for fetching Google trending terms. There is a short helper function called fulltext that takes raw HTML and returns just the article text, useful when the HTML is already in hand. One section of the README is a sponsored note from the author about routing scrapers through a paid residential proxy service called Swiftproxy, with a referral link and a discount code, presented as a way to avoid 403 responses, captchas, and rate limits when scraping at scale. Treat that section as a paid recommendation rather than core project documentation. Full guides and the list of supported languages live in the project docs on Read the Docs.

Copy-paste prompts

Prompt 1
Write a Python script using newspaper3k that downloads every article from cnn.com and saves title, author, date, and body to CSV
Prompt 2
Show me how to use newspaper3k on a Chinese news site with language set to zh
Prompt 3
Use newspaper3k.fulltext to extract clean text from an HTML string I already have
Prompt 4
Build a multi-threaded news scraper with newspaper3k that politely rate limits requests

Frequently asked questions

What is newspaper?

Newspaper3k is a Python 3 library that downloads news articles by URL and extracts the title, author, date, body text, top image, keywords, and summary.

What language is newspaper written in?

Mainly Python. The stack also includes Python, lxml, requests.

How hard is newspaper to set up?

Setup difficulty is rated easy, with roughly 5min to a first successful run.

Who is newspaper for?

Mainly data.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.