Scrape product prices or article titles from a website and save the results as structured data
Crawl a multi-page site while controlling request rate and concurrency to avoid getting blocked
Download binary files such as images or PDFs by disabling HTML parsing and working with raw response bodies
Pass state between chained requests using custom parameters attached to each queued URL
Use Node.js 18 on Linux, higher versions have known stability issues in the project's own test suite.
Node-crawler is a web scraping library for Node.js. Its job is to fetch web pages and let you extract data from their HTML content in JavaScript code. The library pairs HTTP requests with Cheerio, a server-side tool that understands HTML the same way jQuery does in a browser, so you can select and read page elements using familiar CSS-style selectors. The library manages a queue of URLs to visit and provides several controls for how aggressively it fetches them. You can set a maximum number of simultaneous connections, apply a rate limit to enforce a minimum gap between requests to the same site, and assign priority levels so some requests are handled before others. Duplicate URL detection can be turned on to avoid fetching the same page twice. Callbacks receive each fetched page along with a Cheerio-loaded document, so extracting text, links, or attributes from the page is straightforward. For pages where you do not need HTML parsing (such as when downloading images or PDF files), you can disable Cheerio and work with the raw response body directly. Custom parameters can be passed in with each queued URL and accessed later in the callback, which is useful when you need to carry context from one request to the next. A preRequest hook lets you run code before each request, either synchronously or asynchronously, which can be used for adding authentication headers or logging. Direct one-off requests are also supported via a send method that returns a Promise. Version 2 is a TypeScript rewrite of the original library. It uses the got HTTP library internally and is distributed as an ES module, meaning it no longer works with the older CommonJS require style of imports. The README notes that Node.js 18 is the recommended version on Linux due to stability issues observed with higher versions in the project's test suite.
← bda-research on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.