Crawl a website and extract all text content by overriding the visit method in your WebCrawler subclass
Download all images from a domain into a local folder using the built-in image downloader example
Index crawled web pages into a PostgreSQL database using the provided database integration example
Run two separate crawlers with different URL policies at the same time to scrape multiple sites in parallel
Add the Maven Central dependency in your pom.xml or build.gradle, no external services required for basic crawling.
crawler4j is a Java library for building web crawlers. A web crawler is a program that starts at one or more seed URLs, downloads those pages, finds all the links on them, then follows those links and repeats the process across an entire website (or the whole web, if you choose). crawler4j gives Java developers a simple API to do this without writing all the link-following and HTTP fetching logic from scratch. The programming model is straightforward. You create a class that extends WebCrawler and implement two methods: shouldVisit, which decides whether a given URL should be fetched at all (for example, you might skip image and stylesheet files, or restrict crawling to a single domain), and visit, which runs every time a page is downloaded and receives the page's text, HTML, and the list of outgoing links. You also write a short controller class that sets the seed URLs, the folder where crawl data is stored between runs, and the number of threads to run in parallel. Configuration options let you set a maximum crawl depth (how many links away from the seed you will follow), a maximum number of pages to download, whether to follow HTTPS links, how long to wait between requests, and whether to honor the politeness rules described in a site's robots.txt file (which specifies pages that should not be crawled). The repository includes several example programs: a basic domain crawler, an image downloader that saves files to a folder, an example of running two separate crawlers with different policies at the same time, a graceful shutdown example, and an integration example that saves crawled content into a PostgreSQL database. The library is available through Maven Central and works with both Maven and Gradle projects.
← yasserg on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.