explaingit

crawlab-team/crawlab

12,199GoAudience · developerComplexity · 4/5Setup · moderate

TLDR

A web-based platform for running, scheduling, and monitoring web scrapers written in any language across multiple machines, upload your crawler code and manage everything from one dashboard.

Mindmap

mindmap
  root((repo))
    What it does
      Crawler management
      Job scheduling
      Result monitoring
    Tech stack
      Go master node
      MongoDB storage
      gRPC messaging
      SeaweedFS files
    Use cases
      Data collection
      Distributed scraping
      Multi-language crawlers
    Audience
      Data engineers
      Backend developers
    Setup
      Docker Compose
      Master and workers
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Run dozens of web scrapers in parallel across multiple servers and monitor logs and results from a single dashboard.

USE CASE 2

Schedule a nightly data-collection job that distributes work across a pool of worker machines.

USE CASE 3

Manage Scrapy, Puppeteer, or Selenium crawlers without writing your own job-queue or task-scheduling infrastructure.

Tech stack

GoDockerMongoDBgRPCSeaweedFS

Getting it running

Difficulty · moderate Time to first run · 30min

Requires Docker and a YAML config file, the quick start runs a master node and two workers on one machine.

In plain English

Web crawlers are programs that automatically browse websites and collect data from them. Managing many crawlers at once across multiple computers is complex, and that is what Crawlab addresses. It is a platform for running, scheduling, and monitoring web crawlers built in any programming language, all from a central web dashboard. You install Crawlab using Docker, a tool that packages software for easy setup. One computer acts as the main control point (called the master node), and any number of additional computers can serve as worker nodes that run the actual crawling jobs. This setup lets you spread heavy crawling workloads across many machines and scale up by simply adding more workers. Through the web interface, you can upload crawler code, assign tasks to specific nodes, schedule jobs to run on a timed basis, and view results and logs for each run. The platform works with crawlers written in Python, NodeJS, Go, Java, and PHP, as well as specific popular crawling tools like Scrapy, Puppeteer, and Selenium. It does not care what technology your crawler uses internally, as long as it can run on the worker nodes. Internally, the master and worker nodes talk to each other using gRPC, a framework for sending structured messages between programs across a network. Crawler files are synchronized across nodes using SeaweedFS, a distributed file system. Task data, scheduling information, and logs are stored in MongoDB, a database suited for this kind of unstructured data. The quick start requires only Docker and a short configuration file to get a working local setup with a master node and two workers running together. Documentation is available in both English and Chinese.

Copy-paste prompts

Prompt 1
Give me a docker-compose.yml to run a Crawlab master node and two worker nodes on my local machine for testing.
Prompt 2
Write a Python Scrapy spider that collects product names and prices from an e-commerce site, packaged to upload to Crawlab.
Prompt 3
Configure Crawlab to run a web scraping task every night at midnight and store the output in MongoDB.
Open on GitHub → Explain another repo

← crawlab-team on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.