Scrape a list of articles from a blog, following 'next page' links automatically until you have all entries as a JSON file.
Extract product names, prices, and links from an e-commerce site by describing the data shape as a JavaScript object with CSS selectors.
Crawl multiple pages of search results with rate limiting so you don't overload the server, streaming results to a file as they arrive.
Scrape a JavaScript-rendered page by swapping in a PhantomJS driver so the page fully loads before extraction begins.
x-ray is a JavaScript library for Node.js that makes it straightforward to extract data from web pages. You give it a URL and a set of selectors (descriptions of which parts of a page to read), and it returns the matching text or attributes as structured data. The selector syntax builds on standard CSS class and tag patterns, with an extension for reading HTML attributes like link href values. The library is designed to handle the common challenges of web scraping. You can paginate through multiple pages automatically by pointing it at the "next page" link, and you can set limits on how many pages to visit. To avoid overloading websites, it supports adding delays between requests, throttling to a certain number of requests per second, setting timeouts, and controlling how many pages are fetched concurrently. You describe the data shape you want as a JavaScript object, and x-ray fills it in from the page. Lists of items, nested objects, and attributes are all supported. When working across multiple pages or sites, you can compose x-ray instances so one request can follow a link and scrape a different page as part of the same operation. Filters can be applied to values after they are scraped, for tasks like trimming whitespace or normalizing strings. Results can be streamed directly to a JSON file, piped into an HTTP response, or returned via a Promise. The library supports pluggable drivers, so you can swap in a PhantomJS driver to handle pages that require JavaScript execution before content appears. The project is written in JavaScript and installable via npm. The README includes code examples showing single-page scraping, multi-page crawling, and nested data extraction.
← matthewmueller on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.