Reproduce DSR results on class-conditional ImageNet 256 generation
Add register tokens to your own diffusion transformer training run
Benchmark RAE-DiT against the SigLIP2-So400M backbone
Needs 8 GPUs, ImageNet 1k at 256x256, two separate conda envs, and Google Drive artifacts before any training launcher will run.
This repo is the code release for a research paper from Rice University and Apple called Taming Outlier Tokens in Diffusion Transformers. The work is about image generators that use a class of models called diffusion transformers, which build up an image step by step from random noise while paying attention to small patches of the picture. The authors noticed that a few of these patches end up acting as outliers, attracting most of the model's attention while carrying almost no useful local information, and that this problem shows up in both halves of a modern pipeline: the encoder that compresses images into latent codes and the denoiser that generates new images from those codes. Their fix is called Dual Stage Registers, or DSR. The idea is to give each half of the pipeline a small set of extra slots, called registers, where the outlier tokens can park their excess attention without polluting the real image patches. There are two flavours: registers added to the encoder, which can be either fine tuned or just bolted on at test time, and registers added to the denoising transformer during training. The paper reports that DSR consistently improves a standard generation quality score called FID on ImageNet at 256 by 256 resolution, and reaches the same quality as the baseline using roughly four times fewer training epochs. The repository releases the class conditional ImageNet 256 training and sampling code for two backbones, called RAE-DiT and RAE-DiT with a separate diffusion head, on top of two image encoders, SigLIP2-B and the larger SigLIP2-So400M. A results table from the paper shows the DSR variants beating the matching baseline on FID, Inception Score, precision, and recall across several settings. The setup instructions ask for a conda environment with Python 3.10, PyTorch 2.8 with CUDA 12.9, and a separate environment for the official ImageNet FID and Inception Score evaluator, which is pinned to TensorFlow 2.19. Training expects ImageNet 1k at 256 by 256, either as a torchvision ImageFolder layout or downloaded from Kaggle. Two extra artifacts must be downloaded from a Google Drive link: a stage one decoder checkpoint and a per encoder normalization statistics file that whitens the latents before they reach the diffusion model. Training is run through torchrun with one of six provided shell launchers, each mapping to a specific row in the paper's main results table, targeting 800 epochs at a global batch size of 1024 on 8 GPUs. The launchers include SLURM headers for cluster use that can be stripped if running torchrun directly. The repo currently has zero stars and is clearly aimed at machine learning researchers with access to high end GPUs.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.