Collect Weibo user profiles and post histories for academic or market research
Gather comments and repost graphs for social network analysis or NLP datasets
Monitor keyword topics on Weibo by scheduling periodic crawls across multiple workers
Build a Weibo dataset for Chinese-language natural language processing projects
Requires MySQL and Redis instances, a valid Weibo account, and a configured YAML file before starting workers. Cookie refresh via Celery beat is mandatory every 24 hours.
Weibospider is a distributed data-collection tool for Weibo, the large Chinese social media platform. It gathers public information including user profiles, original posts from a specific account's homepage, comments on posts, repost relationships, and posts matching a given keyword search. The README is written in Chinese, and the project targets researchers and developers working with Weibo data for analysis or natural language processing. The system is built on top of two popular Python libraries: Celery, which handles task scheduling and distribution across multiple machines, and Requests, which handles the underlying HTTP communication. Data is stored in a MySQL database, and Redis is used to coordinate the Celery workers. The project explicitly avoids browser automation for login, relying instead on manually analyzed network requests, which the authors say makes the scraper more stable over long runs. Setting the system up requires configuring a YAML file with your MySQL and Redis connection details, Weibo account credentials, and notification email settings. You then create the database tables, optionally start a small Django-based web interface for managing crawl targets, and launch one or more Celery workers. A separate Celery beat process handles periodic tasks such as refreshing login cookies, which Weibo invalidates every 24 hours. Because it runs as separate workers, you can spread the load across multiple machines simply by installing the dependencies on each machine and pointing them at the same Redis and MySQL instances. The project includes rate-limiting controls in its configuration file, and the authors ask users to keep crawl frequency reasonable to avoid disrupting the Weibo platform.
← spiderclub on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.