explaingit

yasserg/crawler4j

4,624JavaAudience · developerComplexity · 2/5Setup · easy

TLDR

A Java library for building multi-threaded web crawlers that lets you define which URLs to follow and what to do with each downloaded page by overriding two simple methods.

Mindmap

mindmap
  root((crawler4j))
    What it does
      Follows links automatically
      Downloads pages
      Multi-threaded crawling
    Programming model
      Extend WebCrawler
      shouldVisit method
      visit method
    Configuration
      Max depth setting
      Page limit
      robots.txt support
      Politeness delay
    Examples included
      Basic domain crawler
      Image downloader
      Dual crawler
      PostgreSQL integration
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Crawl a website and extract all text content by overriding the visit method in your WebCrawler subclass

USE CASE 2

Download all images from a domain into a local folder using the built-in image downloader example

USE CASE 3

Index crawled web pages into a PostgreSQL database using the provided database integration example

USE CASE 4

Run two separate crawlers with different URL policies at the same time to scrape multiple sites in parallel

Tech stack

JavaMavenGradle

Getting it running

Difficulty · easy Time to first run · 30min

Add the Maven Central dependency in your pom.xml or build.gradle, no external services required for basic crawling.

License not described in the explanation.

In plain English

crawler4j is a Java library for building web crawlers. A web crawler is a program that starts at one or more seed URLs, downloads those pages, finds all the links on them, then follows those links and repeats the process across an entire website (or the whole web, if you choose). crawler4j gives Java developers a simple API to do this without writing all the link-following and HTTP fetching logic from scratch. The programming model is straightforward. You create a class that extends WebCrawler and implement two methods: shouldVisit, which decides whether a given URL should be fetched at all (for example, you might skip image and stylesheet files, or restrict crawling to a single domain), and visit, which runs every time a page is downloaded and receives the page's text, HTML, and the list of outgoing links. You also write a short controller class that sets the seed URLs, the folder where crawl data is stored between runs, and the number of threads to run in parallel. Configuration options let you set a maximum crawl depth (how many links away from the seed you will follow), a maximum number of pages to download, whether to follow HTTPS links, how long to wait between requests, and whether to honor the politeness rules described in a site's robots.txt file (which specifies pages that should not be crawled). The repository includes several example programs: a basic domain crawler, an image downloader that saves files to a folder, an example of running two separate crawlers with different policies at the same time, a graceful shutdown example, and an integration example that saves crawled content into a PostgreSQL database. The library is available through Maven Central and works with both Maven and Gradle projects.

Copy-paste prompts

Prompt 1
Using crawler4j, write a Java WebCrawler subclass that crawls a news website, extracts the article title and body text from each page, and saves them to a CSV file.
Prompt 2
I want to use crawler4j to download all PDF files linked from a university website. Show me the shouldVisit and visit methods that filter for PDF URLs and save each file to disk.
Prompt 3
Set up a crawler4j controller that seeds from a homepage, limits crawl depth to 3 levels, respects robots.txt, and waits 500ms between requests to avoid overloading the server.
Prompt 4
Using the crawler4j PostgreSQL example as a starting point, show me how to save the crawled URL, page title, and outgoing links into a database table for later analysis.
Open on GitHub → Explain another repo

← yasserg on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.