explaingit

dataabc/weibo-crawler

4,480PythonAudience · dataComplexity · 3/5Setup · moderate

TLDR

A Python tool that collects posts and profile data from Weibo accounts and saves everything to local files or a database, with support for image and video download and scheduled automatic runs.

Mindmap

mindmap
  root((weibo-crawler))
    Data collected
      Post text and dates
      Media images and video
      Profile info
      Likes and comments
    Output formats
      CSV files
      JSON files
      MySQL MongoDB SQLite
    Features
      Scheduled runs
      Media download
      Cookie auth
    Setup
      Config file
      Python or Docker
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Archive all posts from a set of Weibo accounts, including text, timestamps, likes, and media, to a local database.

USE CASE 2

Build a dataset of Weibo posts from specific users for research or social media trend analysis.

USE CASE 3

Download all images and videos attached to Weibo posts to a local folder.

USE CASE 4

Schedule automatic daily runs to keep a Weibo user's post history continuously up to date.

Tech stack

PythonMySQLMongoDBSQLiteDockerCSV

Getting it running

Difficulty · moderate Time to first run · 30min

Requires editing a config file with Weibo user IDs, providing a browser cookie gives access to login-gated content.

License terms are not described in the explanation, check the repository directly.

In plain English

This is a Python tool for collecting posts and profile data from Weibo, which is a large social media platform in China similar in style to Twitter. Given one or more Weibo user IDs, the tool fetches everything those accounts have posted and saves the results to files on your computer. The README and most of the documentation are written in Chinese. For each user, the tool collects two categories of data. The first is profile information: the user's display name, follower and following counts, number of posts, location, verified status, and similar account details. The second is post data: the text of each post, when it was published, how many likes and comments it received, what device it was posted from, any hashtags or mentions it contains, and the original post if the item is a repost rather than new content. Beyond text, the tool can also download images and videos attached to posts. You can configure separately whether to download media from original posts and from reposts. Collected data can be written to CSV files, JSON files, or stored in a MySQL, MongoDB, or SQLite database, depending on what you configure. Setup involves editing a configuration file to specify the user IDs you want to collect, the date range, whether to include only original posts or reposts as well, and your output format preferences. Providing a browser cookie is optional but allows access to data that would otherwise require being logged in to Weibo. The tool also supports scheduled automatic runs, so you can set it up to check back every few days and download only new posts since the last run. A Docker image is available if you prefer not to install Python dependencies directly on your machine. An optional API service mode is also mentioned, though details are in the full README.

Copy-paste prompts

Prompt 1
How do I configure weibo-crawler to collect posts from a specific Weibo user ID and save them to a SQLite database? Show me the config file setup.
Prompt 2
I want to download all images from a Weibo account using weibo-crawler. What settings do I need in the config file to enable media download?
Prompt 3
Show me how to run weibo-crawler using Docker so I don't need to install Python dependencies directly on my machine.
Prompt 4
How do I provide a Weibo browser cookie to weibo-crawler so it can access content that requires being logged in?
Prompt 5
How do I set up weibo-crawler to run on a schedule and only download posts that are newer than the last run?
Open on GitHub → Explain another repo

← dataabc on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.