Scrape product prices, headlines, or article text from websites using CSS selectors or XPath queries.
Fetch data from JavaScript-rendered pages by triggering headless Chromium to execute the page's scripts before parsing.
Scrape multiple URLs simultaneously using async requests instead of waiting for each page one at a time.
Maintain cookies and session state across multiple requests to scrape pages that require login.
JavaScript rendering downloads headless Chromium on first use, which requires additional disk space and time.
Requests-HTML is a Python library for fetching web pages and pulling specific data out of them. It extends the popular requests HTTP library with the ability to parse the HTML that comes back from a web request, which is useful for scraping information from websites. The library handles several things that make web scraping tricky. It automatically follows redirects, maintains cookies between requests, pools connections for efficiency, and sends a browser-like user-agent header so servers treat the requests as though they came from a real web browser. You get these behaviors without any extra configuration. For extracting data from a page, the library supports two query styles. The first is CSS selectors, which work similarly to jQuery and let you find elements by tag name, class, ID, or combinations. The second is XPath, an older path-based query language that is more verbose but also more precise. Once you find an element, you can read its text, access its attributes, or pull sub-elements from it. One notable feature is JavaScript rendering. Many modern websites load their content dynamically via JavaScript after the initial HTML arrives. Requests-HTML can run JavaScript by launching a headless Chromium browser in the background, waiting for it to finish executing, and then parsing the resulting page. This is an optional step you call explicitly when needed. The library also supports async requests, meaning you can fetch several pages at the same time rather than waiting for each one to finish before starting the next. This speeds things up considerably when you need to scrape many URLs. Requests-HTML is part of the Python Software Foundation's GitHub organization and was created by the author of the requests library. It is available via pip and targets Python 3.
← psf on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.