explaingit

lllyasviel/controlnet

33,886PythonAudience · developerComplexity · 4/5DormantLicenseSetup · hard

TLDR

Add visual control to AI image generation by guiding it with sketches, poses, depth maps, or edge outlines instead of relying only on text descriptions.

Mindmap

mindmap
  root((repo))
    What it does
      Guide image generation
      Control composition
      Match poses and structure
    How it works
      Locked and trainable copies
      Zero convolution connectors
      Learn visual conditions
    Use cases
      Generate from sketches
      Replicate poses
      Match depth maps
      Follow edge outlines
    Tech stack
      Python
      Stable Diffusion
      Gradio interface
    Inputs and outputs
      Text prompts
      Visual conditions
      Generated images

Things people build with this

USE CASE 1

Generate portraits where the subject holds a specific pose you provide as a skeleton outline.

USE CASE 2

Create illustrations that follow the edges and composition of a hand-drawn sketch or line art.

USE CASE 3

Generate images with depth structure matching a reference photo's spatial layout.

USE CASE 4

Produce consistent character poses across multiple generated images for animation or storyboarding.

Tech stack

PythonStable Diffusion 1.5GradioOpenPoseMidasPyTorch

Getting it running

Difficulty · hard Time to first run · 1h+

Requires PyTorch installation, multiple model downloads (Stable Diffusion, OpenPose, Midas), and GPU for reasonable inference speed.

Use freely for research and non-commercial purposes; commercial use requires permission from the authors.

In plain English

ControlNet solves a real creative problem: when you use AI image generators like Stable Diffusion, you can describe what you want in text, but you have very little control over the exact composition, pose, or structure of the result. ControlNet adds a way to guide image generation using visual signals, things like edge outlines, human body poses, depth maps, or hand-drawn scribbles, so the AI generates images that follow your provided structure, not just your words. The way it works is clever: it makes a copy of part of the image-generation neural network. One copy is "locked" and stays unchanged (preserving the original model's capability), while the other copy is "trainable" and learns to respond to your extra visual condition. These two copies are connected through special "zero convolution" layers, small 1x1 filters initialized to output nothing at the start, which means the system begins training without causing any disruption to the original model. As training continues, these connectors gradually learn to inject the visual condition into the generation process. You would use ControlNet when you want to generate an image that matches a specific pose, follows the edges of a sketch you drew, mirrors the depth structure of a reference photo, or replicates the layout from a line drawing. Instead of prompting and hoping, you get reproducible control. The stack is Python, built on top of Stable Diffusion 1.5 (the popular open-source image model), and uses Gradio to provide interactive browser-based demos. Supporting tools include OpenPose for body detection, Midas for depth, and various edge-detection algorithms. Training can run on consumer GPUs with limited memory.

Copy-paste prompts

Prompt 1
How do I set up ControlNet with Stable Diffusion to generate images from pose skeletons?
Prompt 2
Show me how to use a hand-drawn sketch as a visual condition to guide image generation with ControlNet.
Prompt 3
What are the different condition types (edge, pose, depth) I can use with ControlNet, and when should I use each one?
Prompt 4
How do I train a custom ControlNet model on my own dataset to control image generation with a specific visual style?
Prompt 5
Can I use ControlNet to generate images that match both a text description and a depth map at the same time?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.