explaingit

remitchell/python-scraping

4,708Jupyter NotebookAudience · developerComplexity · 2/5Setup · moderate

TLDR

Code samples from the O'Reilly book 'Web Scraping with Python, 2nd Edition', teaching you how to write Python programs that automatically collect data from websites at scale.

Mindmap

mindmap
  root((python-scraping))
    What it is
      Book companion code
      Jupyter notebooks
      Chapter-by-chapter
    Techniques covered
      Basic HTML parsing
      Dynamic page scraping
      Large-scale crawling
    Tools used
      Python
      Jupyter Notebook
      BeautifulSoup
      Scrapy
    Use cases
      Price monitoring
      Data collection
      News aggregation
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Learn how to extract product prices from an e-commerce website using Python step by step.

USE CASE 2

Build a news headline collector that automatically pulls articles from multiple sites.

USE CASE 3

Practice scraping JavaScript-rendered pages that don't work with basic HTML parsing.

Tech stack

PythonJupyter NotebookBeautifulSoupScrapySelenium

Getting it running

Difficulty · moderate Time to first run · 30min

Requires installing Python and libraries like BeautifulSoup and Scrapy, the companion book provides the full explanations.

In plain English

This repository contains the code samples that accompany the book "Web Scraping with Python, 2nd Edition", published by O'Reilly. The book teaches readers how to write Python programs that automatically collect information from websites, a technique called web scraping. Web scraping is the practice of writing code that visits a web page, reads its content, and extracts specific pieces of information, such as product prices, news headlines, or data tables. Instead of copying and pasting information by hand, a scraping program can do the same thing automatically, at scale, across many pages. The code in this repository is organized into Jupyter notebooks. Jupyter is a tool that lets you run Python code in a browser-based document alongside text explanations and output. Each notebook corresponds to a chapter or concept from the book. The author recommends cloning the repository and running the notebooks locally rather than reading them directly on GitHub, because some formatting may not display correctly in the browser. The repository also includes a separate folder with code from the first edition of the book, for readers working from the older version. Because websites change their structure over time and Python libraries receive updates, some code samples may become outdated after publication. The author acknowledges this and invites readers to submit corrections through GitHub pull requests. The README for this repository is brief and points to the book itself for full context. The repository is primarily a companion resource rather than a standalone project.

Copy-paste prompts

Prompt 1
Using techniques from Web Scraping with Python, write a Python script that collects all product names and prices from an e-commerce site and saves them to a CSV.
Prompt 2
Show me how to use BeautifulSoup to extract all links and their anchor text from a Wikipedia article.
Prompt 3
How do I scrape a website that loads content dynamically with JavaScript using Selenium in Python?
Prompt 4
Write a Scrapy spider that crawls an entire blog site and saves each post title and URL to a JSON file.
Open on GitHub → Explain another repo

← remitchell on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.