Learn to fetch web pages and extract structured data using Python requests and JSON parsing by following step-by-step lessons.
Build a movie data scraper for Douban that collects titles, ratings, and reviews into a structured format.
Set up Scrapy for larger crawling projects and store results automatically in a MongoDB database.
This repository is a Chinese-language tutorial series for learning how to write web scrapers in Python, starting from the very basics. A web scraper is a program that automatically visits websites and collects information from them, such as product listings, article titles, or user reviews. The tutorial is structured as a series of numbered lessons. Early chapters cover foundational topics: what a scraper is, how web requests work, how to use Python's requests library to fetch pages, and how to pull specific pieces of data out of a page using tools like regular expressions and JSON parsing. The lessons are written as Markdown documents linked from the README. Alongside the core lessons, the repository includes practical worked examples targeting real Chinese websites, including a movie site (Douban) and Baidu's discussion forums. These examples walk through building an actual scraper step by step. More advanced topics mentioned in the project description include reversing JavaScript code to bypass protections, using Selenium to control a browser programmatically, reading text from images using OCR, storing results in a MongoDB database, and using the Scrapy framework for larger crawling projects. The README is written in Chinese and the external references it links to are a mix of Wikipedia, Chinese tech documentation, and developer tutorials. The project is aimed at Chinese-speaking beginners who want a structured path into web scraping with Python. The README is relatively short and the bulk of the learning content lives in the linked Markdown files rather than in the repository itself.
← kr1s77 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.