explaingit

a-little-hoof/dsr

0PythonAudience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

Reference code for the Rice and Apple paper Taming Outlier Tokens in Diffusion Transformers, adding Dual Stage Registers to encoder and denoiser for better FID on ImageNet 256.

Mindmap

mindmap
  root((DSR))
    Inputs
      ImageNet 1k 256x256
      SigLIP2 encoder
      Decoder checkpoint
      Normalization stats
    Outputs
      Trained DiT model
      FID and IS scores
      Sampled images
    Use Cases
      Reproduce paper results
      Try registers in your DiT
      Compare RAE-DiT backbones
    Tech Stack
      Python
      PyTorch
      CUDA
      TensorFlow
      SLURM

Things people build with this

USE CASE 1

Reproduce DSR results on class-conditional ImageNet 256 generation

USE CASE 2

Add register tokens to your own diffusion transformer training run

USE CASE 3

Benchmark RAE-DiT against the SigLIP2-So400M backbone

Tech stack

PythonPyTorchCUDATensorFlowSLURMtorchrun

Getting it running

Difficulty · hard Time to first run · 1day+

Needs 8 GPUs, ImageNet 1k at 256x256, two separate conda envs, and Google Drive artifacts before any training launcher will run.

In plain English

This repo is the code release for a research paper from Rice University and Apple called Taming Outlier Tokens in Diffusion Transformers. The work is about image generators that use a class of models called diffusion transformers, which build up an image step by step from random noise while paying attention to small patches of the picture. The authors noticed that a few of these patches end up acting as outliers, attracting most of the model's attention while carrying almost no useful local information, and that this problem shows up in both halves of a modern pipeline: the encoder that compresses images into latent codes and the denoiser that generates new images from those codes. Their fix is called Dual Stage Registers, or DSR. The idea is to give each half of the pipeline a small set of extra slots, called registers, where the outlier tokens can park their excess attention without polluting the real image patches. There are two flavours: registers added to the encoder, which can be either fine tuned or just bolted on at test time, and registers added to the denoising transformer during training. The paper reports that DSR consistently improves a standard generation quality score called FID on ImageNet at 256 by 256 resolution, and reaches the same quality as the baseline using roughly four times fewer training epochs. The repository releases the class conditional ImageNet 256 training and sampling code for two backbones, called RAE-DiT and RAE-DiT with a separate diffusion head, on top of two image encoders, SigLIP2-B and the larger SigLIP2-So400M. A results table from the paper shows the DSR variants beating the matching baseline on FID, Inception Score, precision, and recall across several settings. The setup instructions ask for a conda environment with Python 3.10, PyTorch 2.8 with CUDA 12.9, and a separate environment for the official ImageNet FID and Inception Score evaluator, which is pinned to TensorFlow 2.19. Training expects ImageNet 1k at 256 by 256, either as a torchvision ImageFolder layout or downloaded from Kaggle. Two extra artifacts must be downloaded from a Google Drive link: a stage one decoder checkpoint and a per encoder normalization statistics file that whitens the latents before they reach the diffusion model. Training is run through torchrun with one of six provided shell launchers, each mapping to a specific row in the paper's main results table, targeting 800 epochs at a global batch size of 1024 on 8 GPUs. The launchers include SLURM headers for cluster use that can be stripped if running torchrun directly. The repo currently has zero stars and is clearly aimed at machine learning researchers with access to high end GPUs.

Copy-paste prompts

Prompt 1
Set up the conda env with PyTorch 2.8 plus CUDA 12.9 and the TensorFlow 2.19 FID evaluator
Prompt 2
Walk me through which shell launcher maps to which row of the paper results table
Prompt 3
Show me where register tokens are inserted into the RAE-DiT denoiser
Prompt 4
Adapt the SLURM launcher to a single 8xH100 node using torchrun directly
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.