explaingit

yujiosaka/headless-chrome-crawler

5,639JavaScriptAudience · developerComplexity · 2/5Setup · easy

TLDR

A Node.js library that crawls websites, including JavaScript-heavy single-page apps, by running a real Chromium browser in the background, automatically following links and saving extracted data to CSV or JSON files.

Mindmap

mindmap
  root((repo))
    What it does
      Website crawling
      JS page rendering
      Data extraction
    Tech stack
      Node.js
      Puppeteer
      Chromium
      Redis cache
    Features
      Parallel browsers
      Crawl depth control
      robots.txt support
    Outputs
      CSV files
      JSON Lines
      Page screenshots
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Scrape product listings or prices from JavaScript-rendered e-commerce sites that traditional scrapers cannot read

USE CASE 2

Build a site map by automatically following all links from a starting URL and saving the results to JSON

USE CASE 3

Extract structured data from dynamic web apps using custom jQuery-based extraction logic you define

USE CASE 4

Crawl a large website in parallel with multiple browser instances while using Redis to skip already-visited URLs

Tech stack

JavaScriptNode.jsPuppeteerChromiumRedis

Getting it running

Difficulty · easy Time to first run · 30min

Chromium is downloaded automatically on npm install. Redis is optional but needed to persist visited-URL cache across multiple crawl runs.

License not specified in the explanation.

In plain English

headless-chrome-crawler is a Node.js library for automatically visiting and collecting data from websites, including ones built with modern JavaScript frameworks where the page content is generated dynamically in the browser rather than being present in the raw HTML. Traditional crawlers often fail on these sites because they only read the initial HTML without executing any scripts. This library solves that by running a real Chromium browser in the background (without a visible window) to load pages the same way a human visitor's browser would. You give it a list of URLs to start from, and it follows links outward from there, collecting whatever data you tell it to extract from each page. You can write custom extraction logic using jQuery, which the library injects into each page automatically. Results can be saved to CSV or JSON Lines files. The library supports running multiple browser instances in parallel for speed, configuring how deep or wide the crawl goes, adding delays and retries between requests, and using Redis as a cache so it skips URLs it has already visited. It also respects robots.txt files (which websites use to indicate which pages crawlers should avoid) and can follow sitemap.xml files to discover pages. Screenshots of visited pages can be saved as evidence. Installation is done through npm or yarn. When installed, it automatically downloads a compatible version of Chromium, so no separate browser setup is needed. The project is built on top of Puppeteer, which is a lower-level library for controlling Chromium. This library adds the crawling layer on top: link following, deduplication, queuing, and output formatting.

Copy-paste prompts

Prompt 1
How do I use headless-chrome-crawler to scrape all product names and prices from a single-page app built with React? Show me a minimal working example with custom jQuery extraction.
Prompt 2
I want to crawl a website but limit the crawl to 3 levels deep and only follow links on the same domain. How do I configure maxDepth and the URL filter in headless-chrome-crawler?
Prompt 3
How do I set up Redis as a cache with headless-chrome-crawler so that if I restart the crawl it skips URLs it has already visited?
Prompt 4
I need to take a screenshot of each page visited during a crawl. How do I enable screenshot saving in headless-chrome-crawler and where are the files saved?
Prompt 5
How do I configure headless-chrome-crawler to respect robots.txt and use a sitemap.xml file to discover pages instead of only following links?
Open on GitHub → Explain another repo

← yujiosaka on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.