explaingit

sunwood-ai-labs/sana

0Audience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

A fork of NVIDIA's SANA repo with training and inference code for a family of efficient text-to-image and text-to-video diffusion models, including a 2.6B world model with camera control.

Mindmap

mindmap
  root((Sana))
    Inputs
      Text prompts
      Reference images
      Camera controls
    Outputs
      High-res images
      Short videos
      World model clips
    Use Cases
      Generate 1024px images
      Make 5s text-to-video clips
      Serve a SANA API
      Fine-tune SANA variants
    Tech Stack
      Python
      PyTorch
      Diffusers
      SGLang
      ComfyUI

Things people build with this

USE CASE 1

Generate images locally with the SANA Linear Diffusion Transformer

USE CASE 2

Produce 5 second text-to-video clips with SANA-Video

USE CASE 3

Serve a SANA model through SGLang with an OpenAI-compatible API

USE CASE 4

Post-train SANA with supervised fine-tuning or RL via Cosmos-RL

Tech stack

PythonPyTorchDiffusersSGLangComfyUICUDA

Getting it running

Difficulty · hard Time to first run · 1day+

Real use needs an NVIDIA GPU, a PyTorch and CUDA toolchain, and downloading multi-gigabyte SANA checkpoints from Hugging Face.

In plain English

SANA is a codebase from NVIDIA Labs for generating images and short videos from text prompts. The repository contains the training and inference code for a family of related models: SANA, SANA-1.5, SANA-Sprint, SANA-Video, SANA-WM, and Sol-RL. Each one targets a different size, resolution, or use case, and several have been accepted at major machine learning conferences such as ICLR, ICML, and ICCV. The stated focus is efficiency. The original SANA model is described as a Linear Diffusion Transformer, a design meant to keep high resolution image generation fast. SANA-Sprint is a one step diffusion variant aimed at very fast inference. SANA-Video covers text to video and text plus image to video, with a 5 second model and an experimental setup that can stretch generation toward minute long, real time clips. SANA-WM, the most recent addition, is a 2.6B parameter controllable world model that produces 720p, one minute videos with six degree of freedom camera control, pitched as a baseline for world modeling and embodied AI work. The project is wired into a wide ecosystem. There are hosted demo links on Hugging Face and an MIT lab server, an API on Replicate, integration with ComfyUI, serving through SGLang with an OpenAI compatible API, and recipes for post training (supervised fine tuning and reinforcement learning) through Cosmos-RL. Many of the models are also merged into the Hugging Face diffusers library. This particular copy of the repository is a fork under the Sunwood-ai-labs account. The README is mirrored from the upstream NVlabs project and does not describe any fork specific changes, so the content above describes the upstream SANA work it tracks.

Copy-paste prompts

Prompt 1
Walk me through running the SANA inference script on a single image prompt with a sensible default config
Prompt 2
Show me how to point ComfyUI at the SANA checkpoints from this repo
Prompt 3
Help me launch SANA through SGLang and hit it from the OpenAI Python client
Prompt 4
Explain how SANA-Sprint reaches one step inference and what trade-offs that brings
Prompt 5
Set up a small supervised fine-tune of SANA on my own image-caption pairs using the Cosmos-RL recipes
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.