explaingit

code4craft/webmagic

11,681JavaAudience · developerComplexity · 2/5LicenseSetup · easy

TLDR

WebMagic is a Java library for building web crawlers that handles page fetching, link following, multi-threading, and data extraction so you only write the rules for what to collect.

Mindmap

mindmap
  root((WebMagic))
    What it does
      Web crawling
      Page fetching
      Data extraction
    Extraction methods
      XPath selectors
      CSS selectors
      Regex patterns
    Setup
      Maven dependency
      Java project
      No extra infra
    Audience
      Java developers
      Data collectors
      Web scrapers
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Scrape product listings or article content from a website into a structured Java object using annotation-based field extraction.

USE CASE 2

Build a multi-threaded crawler that follows pagination links across many pages and stores results automatically.

USE CASE 3

Extract specific HTML elements from pages using XPath or CSS selectors without writing custom HTML-parsing code.

Tech stack

JavaMavenXPathCSS selectors

Getting it running

Difficulty · easy Time to first run · 30min

Add via Maven, no external infrastructure needed. A basic crawler works with just a few lines of Java.

Use freely for any purpose, including commercial use, with attribution (Apache 2.0).

In plain English

WebMagic is a Java library for building web crawlers. A web crawler is a program that automatically visits web pages and collects information from them. WebMagic is designed to make writing a crawler in Java straightforward, handling the repetitive parts of the job so developers can focus on what data they want to extract. The framework covers the full cycle of crawling: it fetches web pages, manages which URLs to visit next, extracts specific pieces of content from each page, and saves the results somewhere. It runs multiple threads at once so it can process many pages in parallel without requiring the developer to manage that complexity manually. Developers interact with WebMagic in two main ways. The first is by writing a class that implements a provided interface, where you specify what links to follow and what data to pull from each page. The second is an annotation-based approach where you define a plain Java object and mark its fields with labels that describe how to extract each value from the page's HTML. Both styles are shown in the README with example code that crawls GitHub repository pages. The extraction tools in WebMagic support XPath selectors (a standard way to pick specific elements from HTML), regular expressions, and CSS selectors. The library was influenced by a Python crawling framework called Scrapy, which inspired its overall architecture. WebMagic is intended to be easy to integrate into existing Java projects. It is added as a dependency through Maven, the standard Java build tool. The project is licensed under the Apache 2.0 license, which allows free use in both personal and commercial projects. Documentation and additional examples are available on the project's website.

Copy-paste prompts

Prompt 1
Show me a complete WebMagic Spider that visits GitHub repository pages and extracts the repo name, star count, and description using the annotation-based model.
Prompt 2
Write a WebMagic Spider that crawls all pages of a blog starting from the homepage, extracts article titles and publish dates, and prints them to the console.
Prompt 3
How do I configure WebMagic to use 5 download threads and add a 1-second delay between requests to avoid getting blocked?
Prompt 4
Show me how to add WebMagic to a Maven project and write the minimal Spider to scrape a single page and extract a value by XPath selector.
Open on GitHub → Explain another repo

← code4craft on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.