explaingit

idealo/imagededup

5,630PythonAudience · developerComplexity · 2/5LicenseSetup · easy

TLDR

imagededup is a Python library that finds duplicate and near-duplicate images in a folder using fast hashing or AI, point it at a directory and get back a list of duplicates.

Mindmap

mindmap
  root((imagededup))
    What it does
      Finds duplicate images
      Near-duplicate detection
      Folder scanning
    Methods
      Perceptual hashing
      Difference hashing
      Wavelet hashing
      CNN model
    Use cases
      Product image cleanup
      Dataset deduplication
      Photo library cleanup
    Tech
      Python
      pip install
      Apache 2.0 license
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Scan a product image folder and automatically find all duplicate or near-duplicate photos.

USE CASE 2

Use the CNN-based method to detect images that have been cropped, resized, or recolored but are still the same photo.

USE CASE 3

Evaluate which duplicate-detection method works best for your specific image dataset using the built-in benchmarking framework.

USE CASE 4

Visualize which images were flagged as duplicates by plotting them side by side.

Tech stack

PythonPyTorchCNN

Getting it running

Difficulty · easy Time to first run · 30min

CNN method requires downloading a pretrained model on first use, needs Python 3.9+.

Apache 2.0, use freely for any purpose including commercial, as long as you keep the copyright notice and license file.

In plain English

imagededup is a Python library for finding duplicate and near-duplicate images in a folder. It is built by idealo, a German e-commerce company, and was originally developed for cleaning up product image collections where the same photo might appear multiple times or be present in slightly altered versions. The library offers two categories of detection methods. The first category uses image hashing algorithms. These convert each image into a short numeric fingerprint based on its visual content, then compare fingerprints to find matches. Four hashing methods are included: perceptual hashing, difference hashing, wavelet hashing, and average hashing. These are fast and work well when images are exact or nearly exact copies. The second category uses a convolutional neural network (a type of AI model trained on images), which is better at finding near-duplicates where images have been cropped, resized, recolored, or otherwise transformed. You can use one of the included pretrained models or provide your own. The basic workflow is: point the library at a directory of images, generate encodings (fingerprints) for all of them, then call a function to find which images match each other. You get back a dictionary mapping each image filename to a list of its duplicates. A utility function lets you visualize the results by plotting a given image alongside the duplicates found for it. An evaluation framework is included for measuring how well a given method performs on a dataset where you already know the correct duplicate pairs, which helps you choose between methods for your specific use case. The library works on Linux, macOS, and Windows and requires Python 3.9 or newer. Installation is via pip. It is licensed under the Apache 2.0 license.

Copy-paste prompts

Prompt 1
Using imagededup, write Python code that scans a folder of product images, generates perceptual hash encodings, and prints a list of all duplicate image filenames.
Prompt 2
I have a dataset with near-duplicate images that have been resized and recolored. Show me how to use imagededup's CNN method to find them.
Prompt 3
How do I use imagededup's evaluation framework to compare the accuracy of perceptual hashing vs the CNN method on my image dataset?
Prompt 4
I want to visualize the duplicates imagededup found. Show me how to plot an image alongside its detected near-duplicates using the built-in utility.
Prompt 5
What are the tradeoffs between imagededup's four hashing methods and when should I use the CNN method instead?
Open on GitHub → Explain another repo

← idealo on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.