explaingit

bda-research/node-crawler

6,787TypeScriptAudience · developerComplexity · 2/5Setup · easy

TLDR

Node-crawler is a Node.js web scraping library that fetches pages and lets you pull data from HTML using familiar jQuery-style selectors, with built-in rate limiting, priority queues, and duplicate detection.

Mindmap

mindmap
  root((node-crawler))
    What it does
      Page fetching
      HTML parsing
      Queue management
    Tech stack
      TypeScript
      Node.js
      Cheerio
      got HTTP client
    Key features
      Rate limiting
      Priority queue
      Duplicate detection
      preRequest hooks
    Use cases
      Web scraping
      Data extraction
      Binary file download
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Scrape product prices or article titles from a website and save the results as structured data

USE CASE 2

Crawl a multi-page site while controlling request rate and concurrency to avoid getting blocked

USE CASE 3

Download binary files such as images or PDFs by disabling HTML parsing and working with raw response bodies

USE CASE 4

Pass state between chained requests using custom parameters attached to each queued URL

Tech stack

TypeScriptNode.jsCheeriogot

Getting it running

Difficulty · easy Time to first run · 30min

Use Node.js 18 on Linux, higher versions have known stability issues in the project's own test suite.

In plain English

Node-crawler is a web scraping library for Node.js. Its job is to fetch web pages and let you extract data from their HTML content in JavaScript code. The library pairs HTTP requests with Cheerio, a server-side tool that understands HTML the same way jQuery does in a browser, so you can select and read page elements using familiar CSS-style selectors. The library manages a queue of URLs to visit and provides several controls for how aggressively it fetches them. You can set a maximum number of simultaneous connections, apply a rate limit to enforce a minimum gap between requests to the same site, and assign priority levels so some requests are handled before others. Duplicate URL detection can be turned on to avoid fetching the same page twice. Callbacks receive each fetched page along with a Cheerio-loaded document, so extracting text, links, or attributes from the page is straightforward. For pages where you do not need HTML parsing (such as when downloading images or PDF files), you can disable Cheerio and work with the raw response body directly. Custom parameters can be passed in with each queued URL and accessed later in the callback, which is useful when you need to carry context from one request to the next. A preRequest hook lets you run code before each request, either synchronously or asynchronously, which can be used for adding authentication headers or logging. Direct one-off requests are also supported via a send method that returns a Promise. Version 2 is a TypeScript rewrite of the original library. It uses the got HTTP library internally and is distributed as an ES module, meaning it no longer works with the older CommonJS require style of imports. The README notes that Node.js 18 is the recommended version on Linux due to stability issues observed with higher versions in the project's test suite.

Copy-paste prompts

Prompt 1
Using node-crawler, scrape the title, price, and rating of every product on this e-commerce page and output the results as a JSON array.
Prompt 2
Set up node-crawler to crawl a paginated blog, follow next-page links automatically, and collect the title and publication date of each post.
Prompt 3
Add a preRequest hook to my node-crawler setup that injects an Authorization header before each request to scrape a login-protected API.
Prompt 4
Configure node-crawler with a rate limit of one request every 2 seconds and a max of 3 simultaneous connections to stay under a site's bot detection threshold.
Prompt 5
Use node-crawler to download all PDF files linked on a page, disabling Cheerio and saving each raw response body to disk.
Open on GitHub → Explain another repo

← bda-research on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.