Build a web scraper that automatically rotates through working proxy IPs to avoid being blocked by target websites.
Run a self-hosted proxy pool that continuously crawls public proxy lists and checks which ones are still alive.
Route any HTTP-aware tool through a Squid server that automatically pulls live proxies from the pool in the background.
Monitor proxy pool health over time using built-in Prometheus and Grafana metrics.
Requires Redis running separately or via Docker Compose, some proxy sources may be blocked in certain network environments.
Haipproxy is a tool for building and running your own pool of working proxy IP addresses. A proxy IP is a middleman address you route internet requests through, which web scrapers commonly use to avoid being blocked by websites. This project collects proxy IPs from public sources on the internet, tests them to make sure they actually work, and keeps them organized in a database so your scraper can pull a fresh, working address whenever it needs one. The system is built around two main frameworks: Scrapy handles the crawling side, meaning it fetches and filters IP addresses from various public proxy listing sites. Redis acts as the shared memory that all the pieces of the system read from and write to, storing both the raw IP lists and the validation results. You run separate processes for crawling (collecting new proxy IPs), validating (checking whether those IPs are still alive), and scheduling (deciding when to re-run those checks on a timer). Clients connect to the pool in two ways. There is a Python library you import directly into your scraper code, calling a simple function to get one working IP or a list of them. There is also a Squid integration, where Squid acts as a local proxy server that automatically pulls addresses from the pool in the background, so any tool that supports HTTP proxies can use haipproxy without code changes. Deployment can be done on a single machine by installing Python, Redis, and the project dependencies, then starting the crawler, validator, and scheduler processes individually. There is also a Docker Compose configuration that starts all components together in containers, including Squid. For monitoring, the project supports Sentry for tracking errors and unexpected crashes, and Prometheus combined with Grafana for watching metrics about how many proxies are available and how healthy the system is over time. The README notes that some proxy sources may be blocked in certain network environments, and there is a configuration flag to disable crawling those particular sources if needed.
← spiderclub on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.