Learn how to automatically collect data from websites and apps using Python from scratch.
Build a web scraper that handles logins, CAPTCHAs, and anti-scraping protections.
Set up a distributed scraping system that runs across multiple servers to gather large amounts of data.
Understand how to inspect network traffic and parse web page content programmatically.
This repository is a Chinese-language tutorial series that teaches Python web scraping from absolute scratch. Web scraping means writing a program that automatically visits web pages or mobile apps and pulls data out of them. The series is presented as a curated reading list: the README is essentially a table of contents, with each entry linking out to a full article hosted on WeChat or a separate blog. The value is the structured curriculum. The curriculum walks through the topic in order. It starts with how to inspect the traffic a browser or mobile app sends and receives, using packet-capture tools like Fiddler and mitmproxy. It then introduces Python libraries used to fetch pages and pull information out of them, including urllib, requests, BeautifulSoup, and selenium, and shows how to use selenium with phantomJS to drive a browser. Later articles cover handling login pages, recognising image-based verification codes, defeating anti-scraping tricks like CSS-based font encryption and JavaScript obfuscation, scraping mobile apps with Appium, running scrapers across multiple threads and processes, using IP proxy pools to avoid being blocked, saving results into CSV files or MySQL and MongoDB databases, visualising scraped data, building a scrapy-based crawler, and finally running a distributed scraper across several servers. You would use this repository if you read Chinese and want a step-by-step path into Python scraping without prior experience, working through each linked article in order. The full README is longer than what was provided.
Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.