explaingit

compvis/stable-diffusion

72,976Jupyter NotebookAudience · researcherComplexity · 4/5LicenseSetup · hard

TLDR

The original research code for Stable Diffusion, an AI model that generates images from text prompts using latent diffusion, built for researchers and developers, not casual end users.

Mindmap

mindmap
  root((repo))
    What it Does
      Text to image
      Latent diffusion
      Research artifact
    How it Works
      CLIP text encoder
      Latent compression
      Noise refinement
    Tech Stack
      Python
      PyTorch
      CLIP
    Audience
      ML researchers
      Technical developers
    Use Cases
      Local image generation
      Model experimentation
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Run text-to-image generation locally on a GPU to produce images from written prompts for research or creative experiments.

USE CASE 2

Study how latent diffusion models work by reading and modifying the sampling and training code directly.

USE CASE 3

Experiment with the pretrained model weights to understand how text prompts influence image generation output.

Tech stack

PythonPyTorchCLIPJupyter Notebook

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a GPU with at least 10GB of VRAM and separately downloaded model weights from Hugging Face.

Commercial use is permitted, but the license includes responsible-use conditions that restrict certain harmful applications.

In plain English

This is the original research repository for Stable Diffusion, an AI model that generates images from text descriptions. You type a written prompt like "a photograph of an astronaut riding a horse" and the model produces a realistic or artistic image matching that description. The core problem it solves is turning natural language into visual output, which has uses in art, design, prototyping, and creative exploration. The model works using a technique called latent diffusion. Rather than working directly with full-size pixel images, it compresses images into a smaller mathematical representation called a latent space, then applies a diffusion process in that compressed space. Diffusion works by starting from random noise and gradually refining it, guided by a text encoder (specifically CLIP ViT-L/14) that translates your written prompt into numerical signals the model can follow. The result is decoded back into a 512x512 pixel image. This approach is more computationally efficient than operating on raw pixels, allowing the model to run on consumer GPUs with at least 10GB of video memory. You would use this repository if you are a researcher or technically experienced developer who wants to run text-to-image generation locally, experiment with the model weights, or study how latent diffusion models work. It is not a polished user-facing application, it is a research artifact with command-line scripts and Jupyter Notebooks. End users looking for a friendlier experience would typically use this model through a tool like Hugging Face Diffusers instead. The tech stack is Python, PyTorch, and CLIP, with the repository organized as Jupyter Notebooks and Python scripts. Model weights are distributed separately via Hugging Face under a license that permits commercial use but includes responsible-use conditions.

Copy-paste prompts

Prompt 1
Using compvis/stable-diffusion, write a Python script that loads the pretrained weights from Hugging Face and generates a 512x512 image from the prompt 'a sunset over the ocean, oil painting style'.
Prompt 2
How do I run the stable-diffusion sampling script from the command line? Give me the exact command with flags for a basic text-to-image generation.
Prompt 3
Walk me through the latent diffusion architecture in compvis/stable-diffusion, what does each key component do and how do they connect?
Prompt 4
How do I change the classifier-free guidance scale in stable-diffusion to make the output follow my text prompt more or less strictly?
Open on GitHub → Explain another repo

← compvis on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.