spiderclub/weibospider

★ 4,790PythonAudience · researcherComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((Weibospider))
    Data Collection
      User Profiles
      Posts and Comments
      Repost Relationships
      Keyword Search
    Core Libraries
      Celery Workers
      Requests HTTP
    Storage Layer
      MySQL Database
      Redis Coordinator
    Management
      Django Web UI
      Celery Beat Scheduler
    Deployment
      Multi Machine Scale
      YAML Configuration

mindmap root((Weibospider)) Data Collection User Profiles Posts and Comments Repost Relationships Keyword Search Core Libraries Celery Workers Requests HTTP Storage Layer MySQL Database Redis Coordinator Management Django Web UI Celery Beat Scheduler Deployment Multi Machine Scale YAML Configuration

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Collect Weibo user profiles and post histories for academic or market research

USE CASE 2

Gather comments and repost graphs for social network analysis or NLP datasets

USE CASE 3

Monitor keyword topics on Weibo by scheduling periodic crawls across multiple workers

USE CASE 4

Build a Weibo dataset for Chinese-language natural language processing projects

Tech stack

PythonCeleryRequestsMySQLRedisDjangoYAML

Getting it running

Difficulty · moderate Time to first run · 1h+

Requires MySQL and Redis instances, a valid Weibo account, and a configured YAML file before starting workers. Cookie refresh via Celery beat is mandatory every 24 hours.

License not mentioned in the explanation.

In plain English

Weibospider is a distributed data-collection tool for Weibo, the large Chinese social media platform. It gathers public information including user profiles, original posts from a specific account's homepage, comments on posts, repost relationships, and posts matching a given keyword search. The README is written in Chinese, and the project targets researchers and developers working with Weibo data for analysis or natural language processing. The system is built on top of two popular Python libraries: Celery, which handles task scheduling and distribution across multiple machines, and Requests, which handles the underlying HTTP communication. Data is stored in a MySQL database, and Redis is used to coordinate the Celery workers. The project explicitly avoids browser automation for login, relying instead on manually analyzed network requests, which the authors say makes the scraper more stable over long runs. Setting the system up requires configuring a YAML file with your MySQL and Redis connection details, Weibo account credentials, and notification email settings. You then create the database tables, optionally start a small Django-based web interface for managing crawl targets, and launch one or more Celery workers. A separate Celery beat process handles periodic tasks such as refreshing login cookies, which Weibo invalidates every 24 hours. Because it runs as separate workers, you can spread the load across multiple machines simply by installing the dependencies on each machine and pointing them at the same Redis and MySQL instances. The project includes rate-limiting controls in its configuration file, and the authors ask users to keep crawl frequency reasonable to avoid disrupting the Weibo platform.

Copy-paste prompts

Prompt 1

I have the Weibospider repo cloned and workers running. Write a Python script that queries my MySQL database to count the total posts collected per user and export the top 20 most-collected users to a CSV file.

Prompt 2

Using the Weibospider codebase, show me how to add a new Celery task that crawls follower lists for a given Weibo user ID and stores each follower's UID and username in a new MySQL table.

Prompt 3

I want to analyze sentiment on Weibo posts collected by Weibospider. Write a Python script that reads the posts table from MySQL and runs a simple Chinese-language sentiment classifier using snownlp on each post body.

Prompt 4

Help me configure Weibospider's YAML file to crawl keyword AI every 6 hours across 3 worker machines sharing the same Redis and MySQL instances, with a rate limit of 10 requests per minute.

Prompt 5

Explain the Celery beat schedule in Weibospider and show me how to change the cookie-refresh interval from 24 hours to 12 hours in the configuration.

Open on GitHub → Explain another repo

← spiderclub on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.