rmax/scrapy-redis

★ 5,634PythonAudience · developerComplexity · 3/5LicenseSetup · moderate

Mindmap

mindmap
  root((scrapy-redis))
    What it does
      Distributed web crawling
      Shared URL queue
      Deduplication filter
      Item pipeline
    Components
      Redis scheduler
      Dupe filter
      Item pipeline
    Tech Stack
      Python
      Redis
      Scrapy
    Use Cases
      Multi-machine crawling
      Data collection pipelines
      Post-processing workflows

mindmap root((scrapy-redis)) What it does Distributed web crawling Shared URL queue Deduplication filter Item pipeline Components Redis scheduler Dupe filter Item pipeline Tech Stack Python Redis Scrapy Use Cases Multi-machine crawling Data collection pipelines Post-processing workflows

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run multiple Scrapy crawlers in parallel across different machines, all drawing from the same shared Redis URL queue.

USE CASE 2

Track which URLs have already been crawled using Redis so your distributed scrapers never visit the same page twice.

USE CASE 3

Push scraped items into a Redis queue so separate post-processing scripts can consume and handle them asynchronously.

USE CASE 4

Pass structured JSON data with URL, metadata, and form data through the Redis queue to crawlers that need rich context per request.

Tech stack

PythonRedisScrapy

Getting it running

Difficulty · moderate Time to first run · 30min

Requires a running Redis 5.0+ server and an existing Scrapy project, configure via scrapy settings after pip install.

Released under the MIT license, use freely for any purpose, including commercial use, as long as you keep the copyright notice.

In plain English

Scrapy-Redis adds Redis-based components to Scrapy, a Python library used for crawling websites and extracting data from them. Redis is an in-memory data store commonly used to share information quickly between multiple running processes. By connecting the two, Scrapy-Redis lets you run several crawlers at the same time, all drawing from the same shared queue of URLs to visit, which is useful when you need to collect data from many websites faster than a single process can manage. The library provides three main pieces: a scheduler that stores the crawl queue in Redis instead of in memory, a duplication filter that records which URLs have already been visited so they are not crawled twice, and an item pipeline that pushes scraped results into a Redis queue so separate post-processing scripts can pick them up. These components are described as plug-and-play, meaning you configure them in your Scrapy settings without rewriting your spiders. This particular fork also supports passing structured JSON data through the Redis queue. Each entry can include a URL, metadata, and optional form data, which the spider then reads when making requests. This extends the basic distributed crawling use case to workflows that need to pass richer context alongside each URL. The library requires Python 3.7 or newer, Redis 5.0 or newer, and Scrapy 2.0 or newer. It is installed via pip and released under the MIT license.

Copy-paste prompts

Prompt 1

I have a Scrapy spider and want to scale it to run on 5 machines at once. Show me how to set up scrapy-redis so they all share the same Redis URL queue.

Prompt 2

How do I configure scrapy-redis's DupeFilter so my distributed crawlers never visit the same URL twice across different machines?

Prompt 3

Show me how to push scraped items from scrapy-redis into a Redis queue and then read them in a separate Python script for post-processing.

Prompt 4

I want to pass a URL plus extra metadata through the scrapy-redis queue. How do I structure the JSON payload and read it in my spider?

Open on GitHub → Explain another repo

← rmax on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.