mlfoundations/open_clip

★ 13,800PythonAudience · researcherComplexity · 4/5Setup · moderate

Mindmap

mindmap
  root((OpenCLIP))
    What it does
      Image text matching
      Zero shot classify
      Image search
    Models
      LAION 400M
      LAION 2B
      DataComp 1B
    Tech Stack
      Python
      PyTorch
      pip package
    Training
      Single machine
      Distributed FSDP2
    Audience
      AI researchers
      ML engineers

mindmap root((OpenCLIP)) What it does Image text matching Zero shot classify Image search Models LAION 400M LAION 2B DataComp 1B Tech Stack Python PyTorch pip package Training Single machine Distributed FSDP2 Audience AI researchers ML engineers

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Search a photo library by typing a plain-English description instead of using tags.

USE CASE 2

Classify images into categories without collecting any labeled training examples for those categories.

USE CASE 3

Load and compare dozens of pre-trained CLIP variants through a single consistent Python interface.

USE CASE 4

Train a new CLIP model from scratch on a custom large-scale image-text dataset.

Tech stack

PythonPyTorchpip

Getting it running

Difficulty · moderate Time to first run · 30min

Training large models requires a distributed compute cluster, inference-only use needs only pip install and a few lines of Python.

In plain English

OpenCLIP is an open-source implementation of CLIP, a type of AI model originally created by OpenAI. CLIP (Contrastive Language-Image Pre-training) is trained to understand the relationship between images and text, so it can compare a photo to a description and judge how well they match. This enables a range of applications: searching a collection of images using plain text queries, classifying images into categories without needing labeled training examples for every category, and building systems that understand both visual and written content together. This repository provides code to both use pre-trained CLIP models and train new ones from scratch. The project has trained dozens of models on publicly available large-scale image-text datasets, including LAION-400M (400 million image-text pairs), LAION-2B (2 billion pairs), and DataComp-1B. The best of these models reach zero-shot accuracy above 80% on ImageNet, a standard image recognition benchmark, without any ImageNet-specific training examples. Models are available as a Python package (open_clip_torch, installable via pip) and can be loaded with a few lines of code. The project also makes it straightforward to load the original OpenAI CLIP weights alongside community-trained alternatives such as SigLIP and DFN models, all through the same interface. The codebase supports training on single machines and on distributed computing clusters. The current main branch is in the middle of a significant refactor that adds support for a newer distributed training approach called FSDP2. The previous stable training pipeline remains available on the v3 branch. Users who only need to load and run pre-trained models for inference are unaffected by the training-side changes. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1

Using open_clip_torch, load the ViT-B-32 model pretrained on LAION-400M and compute the similarity score between an image and a list of text descriptions.

Prompt 2

Show me how to use OpenCLIP to do zero-shot image classification on my own photos using a list of category names I provide.

Prompt 3

I want to search a folder of images using a text query with OpenCLIP. Write Python code that embeds all images and returns the top 5 matches for a query string.

Prompt 4

How do I fine-tune an OpenCLIP model on my own image-text dataset using the distributed training script?

Open on GitHub → Explain another repo

← mlfoundations on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.