Analysis updated 2026-06-24
Train a latent diffusion transformer on ImageNet in 80 epochs instead of 800
Swap SD-VAE or FLUX VAE for a representation autoencoder in your image generator
Build a world model for robot navigation on the RECON dataset
Compare 80+ pretrained vision encoders as autoencoder backbones with one config switch
| nanovisionx/raev2 | hiangx-robotics/metafine | wanshuiyin/aris-in-ai-offer | |
|---|---|---|---|
| Stars | 70 | 70 | 71 |
| Language | Python | Python | Python |
| Setup difficulty | hard | hard | easy |
| Complexity | 5/5 | 5/5 | 2/5 |
| Audience | researcher | researcher | researcher |
Figures from each repo's GitHub metadata at analysis time.
Multi-GPU torchrun with BF16 is required for training, datasets and checkpoints are large Hugging Face downloads.
RAEv2 is the official PyTorch code release for an Adobe Research and NYU paper titled Improved Baselines with Representation Autoencoders. The paper deals with image generation models, in particular a class of models called latent diffusion transformers. These models first compress an image into a smaller numerical representation called a latent, then learn to generate new latents which are decoded back into images. RAEv2 focuses on the compression part of that pipeline. The headline claim, as stated by the README, is that RAEv2 reaches state-of-the-art generation and reconstruction scores in just 80 training epochs where earlier baselines needed 800, giving a more than 10 times speedup. The authors also report that the same training recipe improves text-to-image generation and a kind of world model used for robot navigation, suggesting the gains are not tied to one dataset. A hero image in the README shows the reconstruction-versus-generation trade-off curve compared to FLUX VAE, SD-VAE, and SDXL-VAE. The code is split into two stages. Stage 1 trains the representation autoencoder itself. The repository supports more than 80 pretrained vision encoders from different families, including DINOv2, DINOv3, WebSSL, EUPE, MAE, iJEPA, MoCov3, CLIP, and SigLIP2, and uses configuration files to pick which encoder, which layers, and which dataset to train on. Helper scripts pull out the exponential-moving-average decoder from the final checkpoint and compute encoder statistics that are used to normalise the latents. Stage 2 then trains the latent diffusion transformer on top of those frozen latents, with separate configs for ImageNet, text-to-image, and navigation world models, and at three settings labelled k=1, k=7, and k=23. Setup uses the uv project manager for Python dependencies and the Hugging Face CLI for downloads. The repository ships pre-processed 256 by 256 datasets covering ImageNet, two text-to-image sources called BLIP3o and RenderedText, a synthetic FLUX-image dataset called Scale-RAE, and a robot navigation dataset called RECON. Pretrained encoders, Stage 1 checkpoints, and Stage 2 checkpoints are all hosted under separate Hugging Face dataset and model collections that can be downloaded individually. Training is multi-GPU through torchrun with eight processes per node, using BF16 mixed precision, model compilation, and Weights and Biases logging through environment variables. Evaluation scripts compute reconstruction metrics like rFID, PSNR, SSIM, and LPIPS for Stage 1, and the README also links to a sampling script that reconstructs a single image to compare the model with other VAEs.
Official PyTorch code for Improved Baselines with Representation Autoencoders, reaching state-of-the-art latent diffusion training in 80 epochs instead of 800 by reusing pretrained vision encoders.
Mainly Python. The stack also includes Python, PyTorch, DINOv3.
Setup difficulty is rated hard, with roughly 1day+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.