Scrape product listings or prices from JavaScript-rendered e-commerce sites that traditional scrapers cannot read
Build a site map by automatically following all links from a starting URL and saving the results to JSON
Extract structured data from dynamic web apps using custom jQuery-based extraction logic you define
Crawl a large website in parallel with multiple browser instances while using Redis to skip already-visited URLs
Chromium is downloaded automatically on npm install. Redis is optional but needed to persist visited-URL cache across multiple crawl runs.
headless-chrome-crawler is a Node.js library for automatically visiting and collecting data from websites, including ones built with modern JavaScript frameworks where the page content is generated dynamically in the browser rather than being present in the raw HTML. Traditional crawlers often fail on these sites because they only read the initial HTML without executing any scripts. This library solves that by running a real Chromium browser in the background (without a visible window) to load pages the same way a human visitor's browser would. You give it a list of URLs to start from, and it follows links outward from there, collecting whatever data you tell it to extract from each page. You can write custom extraction logic using jQuery, which the library injects into each page automatically. Results can be saved to CSV or JSON Lines files. The library supports running multiple browser instances in parallel for speed, configuring how deep or wide the crawl goes, adding delays and retries between requests, and using Redis as a cache so it skips URLs it has already visited. It also respects robots.txt files (which websites use to indicate which pages crawlers should avoid) and can follow sitemap.xml files to discover pages. Screenshots of visited pages can be saved as evidence. Installation is done through npm or yarn. When installed, it automatically downloads a compatible version of Chromium, so no separate browser setup is needed. The project is built on top of Puppeteer, which is a lower-level library for controlling Chromium. This library adds the crawling layer on top: link following, deduplication, queuing, and output formatting.
← yujiosaka on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.