explaingit

apify/crawlee-python

9,042PythonAudience · developerComplexity · 2/5Setup · easy

TLDR

A Python library for building web scrapers that visit websites and collect structured data, with built-in support for JavaScript-heavy pages via a real browser, proxy rotation, and bot-detection evasion.

Mindmap

mindmap
  root((crawlee-python))
    What It Does
      Web scraping
      Data collection
    Crawl Modes
      HTTP plus BeautifulSoup
      Playwright browser
      Parsel or raw HTTP
    Key Features
      Bot evasion
      Proxy rotation
      Local data storage
    Use Cases
      AI training data
      Research datasets
      Product data
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Scrape product names and prices from an e-commerce site that loads its content with JavaScript using a headless browser.

USE CASE 2

Collect article titles and dates from a news archive using the fast HTTP crawler with BeautifulSoup parsing.

USE CASE 3

Build a dataset of web pages at scale with proxy rotation to avoid being blocked, for use as AI training data.

USE CASE 4

Generate a ready-to-run crawling project from a CLI template, then customize it to collect specific data fields.

Tech stack

PythonBeautifulSoupPlaywrightParsel

Getting it running

Difficulty · easy Time to first run · 30min

Browser-based crawling requires installing Playwright and its browser binaries as an optional extra.

License not specified in the explanation.

In plain English

Crawlee for Python is a library that lets you build programs to automatically visit websites, collect information from them, and save that information in a structured format. If you have ever wanted to pull data from a website without doing it by hand, this is the kind of tool that handles that work for you. The library gives you two main ways to crawl. The first uses a simple HTTP approach paired with a parser called BeautifulSoup, which is fast and works well for pages where the content is already present in the HTML source. The second uses a real browser running in the background, controlled through a tool called Playwright, which is better for pages that build their content using JavaScript after the page loads. You can also use Parsel or raw HTTP if your project has different needs. A key feature is that Crawlee tries to make your crawlers look like regular human visitors rather than automated bots, which helps them work reliably against sites that normally block automated requests. It also handles proxy rotation, meaning it can send requests through different network addresses to further reduce the chance of being blocked. Setting up is straightforward. You install the package from PyPI, choose which extras you need (for example, adding Playwright support), and write a short script that tells Crawlee which URLs to start from and what data to collect. There is also a command-line tool that generates a starter project for you from a template, which can speed things up if you are new to the library. The data you collect gets saved automatically to a local storage folder in a format you can open and process further. Common uses include gathering training data for AI models, building datasets for language model applications, and pulling product or research information from the web at scale. There is also a TypeScript version of the same library available separately for projects not using Python.

Copy-paste prompts

Prompt 1
Using crawlee-python with Playwright, write a scraper that visits a product listing page that loads via JavaScript and saves product names, prices, and URLs to a JSON file.
Prompt 2
Using crawlee-python with BeautifulSoup, write a crawler that collects all article titles and publication dates from a news website's archive pages and saves them as CSV.
Prompt 3
Set up crawlee-python with proxy rotation enabled to scrape 1000 product descriptions from an e-commerce site without triggering bot detection.
Prompt 4
Use the crawlee-python CLI to generate a starter project template, then modify it to crawl a documentation site and save each page's text content to a local folder.
Open on GitHub → Explain another repo

← apify on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.