alex000kim/nsfw_data_scraper

★ 12,558ShellAudience · researcherComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((repo))
    What It Does
      Image dataset collection
      Content classification
      Moderation model training
    Data Pipeline
      URL collection scripts
      Docker image downloader
      Train and test split
    Categories
      Safe-for-work images
      Explicit content types
      Drawing and hentai
    Training
      fastai CNN model
      Jupyter notebook
      91 percent accuracy

mindmap root((repo)) What It Does Image dataset collection Content classification Moderation model training Data Pipeline URL collection scripts Docker image downloader Train and test split Categories Safe-for-work images Explicit content types Drawing and hentai Training fastai CNN model Jupyter notebook 91 percent accuracy

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Download and organize a pre-labeled image dataset split into training and test sets for content moderation research.

USE CASE 2

Train a convolutional neural network classifier with ~91% accuracy on explicit vs. safe-for-work content using the included notebook.

USE CASE 3

Build a custom content filter by adding new image sources to the URL collection step and rerunning the pipeline.

Tech stack

ShellDockerPythonfastaiJupyter

Getting it running

Difficulty · moderate Time to first run · 1h+

Download can take several hours, Docker handles tool installation so no manual Python environment setup is needed.

In plain English

This repository contains a set of shell scripts for collecting a large image dataset to train an image classifier that distinguishes between explicit and non-explicit content. The classifier is designed around five categories: pornography, hentai-style drawings, sexually explicit but non-pornographic images, neutral everyday images, and safe-for-work drawings. The scripts are numbered and meant to be run in order. The first script collects URLs of images from various sources, primarily Reddit, using a tool called Ripme that can scrape image galleries from supported websites. A pre-collected set of URLs is already included in the repository, so you can skip that step unless you want to change the sources. Subsequent scripts download the actual images from those URLs, optionally pull in additional safe-for-work image datasets from existing public collections, and then split everything into training and test folders organized by category. The whole collection process runs inside Docker, which handles the required tools so you do not need to install them manually. The README warns that the download can take several hours and suggests leaving it running overnight. Once the data is collected, a Jupyter notebook is included for training a convolutional neural network on the images using the fastai library. The author reports reaching 91% accuracy with this approach. The README also notes that the dataset is noisy, meaning some images may be miscategorized, and that certain categories (drawings versus hentai, and pornography versus sexy) are more likely to be confused with each other. This is a data collection and training toolkit for researchers or developers building content moderation systems.

Copy-paste prompts

Prompt 1

Run the nsfw_data_scraper pipeline from alex000kim/nsfw_data_scraper using the pre-collected URLs to download and split the image dataset without re-scraping.

Prompt 2

Train the fastai content moderation model from nsfw_data_scraper on the downloaded dataset and report accuracy per category.

Prompt 3

How do I add a new safe-for-work image source to the nsfw_data_scraper URL collection script?

Prompt 4

Set up Docker for alex000kim/nsfw_data_scraper and run the full image download pipeline, then verify the folder structure before training.

Open on GitHub → Explain another repo

← alex000kim on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.