explaingit

scrapinghub/portia

9,495PythonAudience · generalComplexity · 3/5Setup · moderate

TLDR

Portia is a visual, no-code web scraping tool where you click on page elements to teach it what data to extract, then it crawls similar pages automatically, runs locally via Docker, built on Scrapy.

Mindmap

mindmap
  root((portia))
    What It Does
      Visual web scraping
      No coding required
      Point and click setup
    How It Works
      Annotate page elements
      Pattern detection
      Crawl similar pages
    Tech Underneath
      Python
      Scrapy crawler
      Browser-based UI
    Setup
      Docker one command
      Docker Compose option
      Port 9001
    Use Cases
      Product data extraction
      Article scraping
      Price monitoring
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Scrape structured data from websites by clicking on examples instead of writing any code.

USE CASE 2

Extract product listings, prices, or article data from pages that follow a consistent layout.

USE CASE 3

Run a self-hosted visual web scraper locally via Docker without setting up a Python coding environment.

USE CASE 4

Build a web crawler on top of Scrapy using a visual interface to define what fields to collect.

Tech stack

PythonScrapyDockerJavaScript

Getting it running

Difficulty · moderate Time to first run · 30min

Requires Docker, pull the image and start the server on port 9001 with one command.

In plain English

Portia is a visual web scraping tool that lets you pull data from websites without writing any code. You point it at a web page, click on the pieces of information you want to collect, and Portia figures out from those annotations how to extract the same kind of data from other pages that follow a similar structure. It is built on top of Scrapy, a Python-based web crawling library, but Portia is intended for people who do not want to write Python or deal with the technical details of a crawler. The tool runs as a local web application that you access through a browser. The quickest way to get it running is via Docker: one command pulls the official image and starts the server on port 9001. You can also use Docker Compose by cloning the repository and running a single command from the project root. The documentation, hosted on Read the Docs, covers those steps in detail and describes alternatives for setups without Docker. The README is brief and focuses almost entirely on getting the server started. It does not describe the full set of features, explain how annotation works in depth, or mention pricing. The project was created by Scrapinghub, a company that builds web scraping products and infrastructure, and Portia appears to be the open-source self-hosted version of a visual scraping product they also offered as a hosted cloud service. The README does not indicate whether the open-source version is still actively maintained or when it last received updates.

Copy-paste prompts

Prompt 1
I want to use Portia to scrape a product listing page without writing code. Walk me through starting Portia with Docker and clicking to annotate product names and prices.
Prompt 2
Using Portia running on localhost:9001, how do I set up a spider to crawl multiple pages on the same website that all follow the same template?
Prompt 3
I've annotated a page in Portia and want to export the scraped data. What output formats does it support and how do I trigger a crawl from the interface or command line?
Prompt 4
Help me understand when to use Portia's visual annotation approach versus writing a Scrapy spider directly, what kinds of sites work well with Portia and where does it fall short?
Open on GitHub → Explain another repo

← scrapinghub on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.