matthewmueller/x-ray

★ 5,904JavaScriptAudience · developerComplexity · 2/5Setup · easy

Mindmap

mindmap
  root((x-ray))
    What it does
      Extract data from pages
      CSS selector based
      Returns structured data
    Features
      Automatic pagination
      Throttle and delay
      Concurrent requests
      Attribute extraction
    Output Options
      JSON file streaming
      HTTP response pipe
      Promise-based
    Audience
      Node.js developers
      Data collectors

mindmap root((x-ray)) What it does Extract data from pages CSS selector based Returns structured data Features Automatic pagination Throttle and delay Concurrent requests Attribute extraction Output Options JSON file streaming HTTP response pipe Promise-based Audience Node.js developers Data collectors

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Scrape a list of articles from a blog, following 'next page' links automatically until you have all entries as a JSON file.

USE CASE 2

Extract product names, prices, and links from an e-commerce site by describing the data shape as a JavaScript object with CSS selectors.

USE CASE 3

Crawl multiple pages of search results with rate limiting so you don't overload the server, streaming results to a file as they arrive.

USE CASE 4

Scrape a JavaScript-rendered page by swapping in a PhantomJS driver so the page fully loads before extraction begins.

Tech stack

JavaScriptNode.jsnpmPhantomJS

Getting it running

Difficulty · easy Time to first run · 5min

No license information was mentioned in the explanation.

In plain English

x-ray is a JavaScript library for Node.js that makes it straightforward to extract data from web pages. You give it a URL and a set of selectors (descriptions of which parts of a page to read), and it returns the matching text or attributes as structured data. The selector syntax builds on standard CSS class and tag patterns, with an extension for reading HTML attributes like link href values. The library is designed to handle the common challenges of web scraping. You can paginate through multiple pages automatically by pointing it at the "next page" link, and you can set limits on how many pages to visit. To avoid overloading websites, it supports adding delays between requests, throttling to a certain number of requests per second, setting timeouts, and controlling how many pages are fetched concurrently. You describe the data shape you want as a JavaScript object, and x-ray fills it in from the page. Lists of items, nested objects, and attributes are all supported. When working across multiple pages or sites, you can compose x-ray instances so one request can follow a link and scrape a different page as part of the same operation. Filters can be applied to values after they are scraped, for tasks like trimming whitespace or normalizing strings. Results can be streamed directly to a JSON file, piped into an HTTP response, or returned via a Promise. The library supports pluggable drivers, so you can swap in a PhantomJS driver to handle pages that require JavaScript execution before content appears. The project is written in JavaScript and installable via npm. The README includes code examples showing single-page scraping, multi-page crawling, and nested data extraction.

Copy-paste prompts

Prompt 1

Using x-ray, write a Node.js script that scrapes all article titles and URLs from a blog's listing page, then follows the next-page link to collect titles from the next 5 pages.

Prompt 2

I want to use x-ray to extract product names, prices, and image URLs from an e-commerce category page. Show me how to define the data shape as a JavaScript object with CSS selectors.

Prompt 3

How do I use x-ray with throttling so my scraper only makes 2 requests per second and waits 500ms between pages to avoid getting blocked?

Prompt 4

Set up x-ray to scrape a site that requires JavaScript to render content by using the PhantomJS driver. Show the driver setup and a basic scrape example.

Open on GitHub → Explain another repo

← matthewmueller on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.