a-little-hoof/dsr

Analysis updated 2026-06-24

★ 0PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((DSR))
    Inputs
      ImageNet 1k 256x256
      SigLIP2 encoder
      Decoder checkpoint
      Normalization stats
    Outputs
      Trained DiT model
      FID and IS scores
      Sampled images
    Use Cases
      Reproduce paper results
      Try registers in your DiT
      Compare RAE-DiT backbones
    Tech Stack
      Python
      PyTorch
      CUDA
      TensorFlow
      SLURM

mindmap root((DSR)) Inputs ImageNet 1k 256x256 SigLIP2 encoder Decoder checkpoint Normalization stats Outputs Trained DiT model FID and IS scores Sampled images Use Cases Reproduce paper results Try registers in your DiT Compare RAE-DiT backbones Tech Stack Python PyTorch CUDA TensorFlow SLURM

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Reproduce DSR results on class-conditional ImageNet 256 generation

USE CASE 2

Add register tokens to your own diffusion transformer training run

USE CASE 3

Benchmark RAE-DiT against the SigLIP2-So400M backbone

What is it built with?

PythonPyTorchCUDATensorFlowSLURMtorchrun

How does it compare?

	a-little-hoof/dsr	0xhassaan/nn-from-scratch	aashish2998/langchainmultiagentresearchsystem_project
Stars	0	0	0
Language	Python	Python	Python
Setup difficulty	hard	moderate	moderate
Complexity	5/5	4/5	2/5
Audience	researcher	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Needs 8 GPUs, ImageNet 1k at 256x256, two separate conda envs, and Google Drive artifacts before any training launcher will run.

In plain English

This repo is the code release for a research paper from Rice University and Apple called Taming Outlier Tokens in Diffusion Transformers. The work is about image generators that use a class of models called diffusion transformers, which build up an image step by step from random noise while paying attention to small patches of the picture. The authors noticed that a few of these patches end up acting as outliers, attracting most of the model's attention while carrying almost no useful local information, and that this problem shows up in both halves of a modern pipeline: the encoder that compresses images into latent codes and the denoiser that generates new images from those codes. Their fix is called Dual Stage Registers, or DSR. The idea is to give each half of the pipeline a small set of extra slots, called registers, where the outlier tokens can park their excess attention without polluting the real image patches. There are two flavours: registers added to the encoder, which can be either fine tuned or just bolted on at test time, and registers added to the denoising transformer during training. The paper reports that DSR consistently improves a standard generation quality score called FID on ImageNet at 256 by 256 resolution, and reaches the same quality as the baseline using roughly four times fewer training epochs. The repository releases the class conditional ImageNet 256 training and sampling code for two backbones, called RAE-DiT and RAE-DiT with a separate diffusion head, on top of two image encoders, SigLIP2-B and the larger SigLIP2-So400M. A results table from the paper shows the DSR variants beating the matching baseline on FID, Inception Score, precision, and recall across several settings. The setup instructions ask for a conda environment with Python 3.10, PyTorch 2.8 with CUDA 12.9, and a separate environment for the official ImageNet FID and Inception Score evaluator, which is pinned to TensorFlow 2.19. Training expects ImageNet 1k at 256 by 256, either as a torchvision ImageFolder layout or downloaded from Kaggle. Two extra artifacts must be downloaded from a Google Drive link: a stage one decoder checkpoint and a per encoder normalization statistics file that whitens the latents before they reach the diffusion model. Training is run through torchrun with one of six provided shell launchers, each mapping to a specific row in the paper's main results table, targeting 800 epochs at a global batch size of 1024 on 8 GPUs. The launchers include SLURM headers for cluster use that can be stripped if running torchrun directly. The repo currently has zero stars and is clearly aimed at machine learning researchers with access to high end GPUs.

Copy-paste prompts

Prompt 1

Set up the conda env with PyTorch 2.8 plus CUDA 12.9 and the TensorFlow 2.19 FID evaluator

Prompt 2

Walk me through which shell launcher maps to which row of the paper results table

Prompt 3

Show me where register tokens are inserted into the RAE-DiT denoiser

Prompt 4

Adapt the SLURM launcher to a single 8xH100 node using torchrun directly

Frequently asked questions

What is dsr?

Reference code for the Rice and Apple paper Taming Outlier Tokens in Diffusion Transformers, adding Dual Stage Registers to encoder and denoiser for better FID on ImageNet 256.

What language is dsr written in?

Mainly Python. The stack also includes Python, PyTorch, CUDA.

How hard is dsr to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is dsr for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.