explaingit

nanovisionx/raev2

Analysis updated 2026-06-24

70PythonAudience · researcherComplexity · 5/5Setup · hard

TLDR

Official PyTorch code for Improved Baselines with Representation Autoencoders, reaching state-of-the-art latent diffusion training in 80 epochs instead of 800 by reusing pretrained vision encoders.

Mindmap

mindmap
  root((RAEv2))
    Inputs
      Pretrained vision encoder
      Image dataset
      Diffusion config
    Outputs
      Trained autoencoder
      Latent diffusion transformer
      Reconstruction metrics
      Generated images
    Use Cases
      Train image generation faster
      Replace SD-VAE in a pipeline
      Navigation world models
      Text-to-image research
    Tech Stack
      Python
      PyTorch
      DINOv3
      uv
      WandB
      HuggingFace
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Train a latent diffusion transformer on ImageNet in 80 epochs instead of 800

USE CASE 2

Swap SD-VAE or FLUX VAE for a representation autoencoder in your image generator

USE CASE 3

Build a world model for robot navigation on the RECON dataset

USE CASE 4

Compare 80+ pretrained vision encoders as autoencoder backbones with one config switch

What is it built with?

PythonPyTorchDINOv3uvWandB

How does it compare?

nanovisionx/raev2hiangx-robotics/metafinewanshuiyin/aris-in-ai-offer
Stars707071
LanguagePythonPythonPython
Setup difficultyhardhardeasy
Complexity5/55/52/5
Audienceresearcherresearcherresearcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Multi-GPU torchrun with BF16 is required for training, datasets and checkpoints are large Hugging Face downloads.

In plain English

RAEv2 is the official PyTorch code release for an Adobe Research and NYU paper titled Improved Baselines with Representation Autoencoders. The paper deals with image generation models, in particular a class of models called latent diffusion transformers. These models first compress an image into a smaller numerical representation called a latent, then learn to generate new latents which are decoded back into images. RAEv2 focuses on the compression part of that pipeline. The headline claim, as stated by the README, is that RAEv2 reaches state-of-the-art generation and reconstruction scores in just 80 training epochs where earlier baselines needed 800, giving a more than 10 times speedup. The authors also report that the same training recipe improves text-to-image generation and a kind of world model used for robot navigation, suggesting the gains are not tied to one dataset. A hero image in the README shows the reconstruction-versus-generation trade-off curve compared to FLUX VAE, SD-VAE, and SDXL-VAE. The code is split into two stages. Stage 1 trains the representation autoencoder itself. The repository supports more than 80 pretrained vision encoders from different families, including DINOv2, DINOv3, WebSSL, EUPE, MAE, iJEPA, MoCov3, CLIP, and SigLIP2, and uses configuration files to pick which encoder, which layers, and which dataset to train on. Helper scripts pull out the exponential-moving-average decoder from the final checkpoint and compute encoder statistics that are used to normalise the latents. Stage 2 then trains the latent diffusion transformer on top of those frozen latents, with separate configs for ImageNet, text-to-image, and navigation world models, and at three settings labelled k=1, k=7, and k=23. Setup uses the uv project manager for Python dependencies and the Hugging Face CLI for downloads. The repository ships pre-processed 256 by 256 datasets covering ImageNet, two text-to-image sources called BLIP3o and RenderedText, a synthetic FLUX-image dataset called Scale-RAE, and a robot navigation dataset called RECON. Pretrained encoders, Stage 1 checkpoints, and Stage 2 checkpoints are all hosted under separate Hugging Face dataset and model collections that can be downloaded individually. Training is multi-GPU through torchrun with eight processes per node, using BF16 mixed precision, model compilation, and Weights and Biases logging through environment variables. Evaluation scripts compute reconstruction metrics like rFID, PSNR, SSIM, and LPIPS for Stage 1, and the README also links to a sampling script that reconstructs a single image to compare the model with other VAEs.

Copy-paste prompts

Prompt 1
Install RAEv2 with uv, download the Stage 1 checkpoint for DINOv3, and reconstruct a sample 256x256 image
Prompt 2
Run RAEv2 Stage 2 training on ImageNet at k=7 on an 8-GPU node and log to Weights and Biases
Prompt 3
Swap the DINOv3 encoder in RAEv2 for SigLIP2 and compare rFID, PSNR, SSIM, LPIPS on the validation set
Prompt 4
Adapt the RAEv2 Stage 2 text-to-image config to the BLIP3o dataset and produce sample generations
Prompt 5
Train an RAEv2 navigation world model on RECON and explain how the latents flow back into the diffusion transformer

Frequently asked questions

What is raev2?

Official PyTorch code for Improved Baselines with Representation Autoencoders, reaching state-of-the-art latent diffusion training in 80 epochs instead of 800 by reusing pretrained vision encoders.

What language is raev2 written in?

Mainly Python. The stack also includes Python, PyTorch, DINOv3.

How hard is raev2 to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is raev2 for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.