Run dozens of web scrapers in parallel across multiple servers and monitor logs and results from a single dashboard.
Schedule a nightly data-collection job that distributes work across a pool of worker machines.
Manage Scrapy, Puppeteer, or Selenium crawlers without writing your own job-queue or task-scheduling infrastructure.
Requires Docker and a YAML config file, the quick start runs a master node and two workers on one machine.
Web crawlers are programs that automatically browse websites and collect data from them. Managing many crawlers at once across multiple computers is complex, and that is what Crawlab addresses. It is a platform for running, scheduling, and monitoring web crawlers built in any programming language, all from a central web dashboard. You install Crawlab using Docker, a tool that packages software for easy setup. One computer acts as the main control point (called the master node), and any number of additional computers can serve as worker nodes that run the actual crawling jobs. This setup lets you spread heavy crawling workloads across many machines and scale up by simply adding more workers. Through the web interface, you can upload crawler code, assign tasks to specific nodes, schedule jobs to run on a timed basis, and view results and logs for each run. The platform works with crawlers written in Python, NodeJS, Go, Java, and PHP, as well as specific popular crawling tools like Scrapy, Puppeteer, and Selenium. It does not care what technology your crawler uses internally, as long as it can run on the worker nodes. Internally, the master and worker nodes talk to each other using gRPC, a framework for sending structured messages between programs across a network. Crawler files are synchronized across nodes using SeaweedFS, a distributed file system. Task data, scheduling information, and logs are stored in MongoDB, a database suited for this kind of unstructured data. The quick start requires only Docker and a short configuration file to get a working local setup with a master node and two workers running together. Documentation is available in both English and Chinese.
← crawlab-team on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.