explaingit

foundationvision/var

8,683Jupyter NotebookAudience · researcherComplexity · 4/5Setup · hard

TLDR

VAR is a 2024 NeurIPS Best Paper AI image generation project that builds images coarse-to-fine across scales using next-scale prediction, outperforming diffusion models in several benchmarks. Pretrained models from 310M to 2.3B parameters are available on Hugging Face.

Mindmap

mindmap
  root((VAR))
    What It Does
      Next-scale prediction
      Coarse to fine images
      Beats diffusion models
    Models Available
      310M parameters
      2.3B parameters
      Hugging Face weights
    How To Use
      Demo notebook
      Pretrained inference
      Train from scratch
    Research Context
      NeurIPS 2024 Best Paper
      Scaling laws
      Infinity follow-on
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Generate high-quality images using pretrained VAR models via the included Jupyter notebook without writing much code.

USE CASE 2

Reproduce or extend next-scale image generation research using the provided training scripts and ImageNet dataset.

USE CASE 3

Compare VAR's coarse-to-fine image generation against diffusion-based methods on standard FID benchmarks.

Tech stack

PythonPyTorchJupyter NotebookHugging Face

Getting it running

Difficulty · hard Time to first run · 30min

Inference with pretrained weights requires a GPU, training from scratch needs the full ImageNet dataset and substantial compute resources.

In plain English

VAR, short for Visual Autoregressive Modeling, is a research project from 2024 that introduced a new way to generate images with artificial intelligence. It won the Best Paper Award at NeurIPS 2024, one of the most prestigious AI research conferences. The paper argues for an approach to image generation that competes with and in some benchmarks outperforms the diffusion-based methods (like Stable Diffusion) that have dominated AI image generation in recent years. The core idea is a shift in how image generation is framed. Most autoregressive image models generate an image pixel-by-pixel or token-by-token in a left-to-right, top-to-bottom order, similar to how a language model writes text one word at a time. VAR instead generates images coarse-to-fine: it first predicts a very low-resolution version of the whole image, then progressively refines it at increasing resolutions until the final image is complete. The paper calls this "next-scale prediction" rather than "next-token prediction." The repository provides pretrained models of several sizes, ranging from 310 million to 2.3 billion parameters, available for download from Hugging Face. A Jupyter notebook is included so you can load a model and generate images without writing much code yourself. Larger models produce better results as measured by a standard quality metric called FID (lower is better), and the paper documents that these improvements follow predictable scaling laws similar to what has been observed in large language models. Training the model from scratch requires the ImageNet dataset and substantial compute. The README includes training scripts and configuration details for researchers who want to reproduce or extend the work. For most people, the pretrained weights and the demo notebook are the practical entry point. A follow-on project called Infinity, also linked from this repository, extends the VAR approach to text-to-image generation and was accepted at CVPR 2025.

Copy-paste prompts

Prompt 1
Using the foundationvision/var pretrained VAR-d30 model from Hugging Face, write a Python script to generate an image from a class label using the demo notebook as a reference.
Prompt 2
Show me how to load a VAR model checkpoint and run next-scale prediction inference to generate a 256x256 image in Python.
Prompt 3
Write a script using the VAR repository to generate a batch of images from ImageNet class labels and compute FID scores.
Prompt 4
How do I set up the VAR training pipeline from scratch using ImageNet, following the configuration in the foundationvision/var repository?
Prompt 5
Explain how VAR's next-scale prediction differs from standard left-to-right token prediction for image generation, based on the VAR NeurIPS 2024 paper.
Open on GitHub → Explain another repo

← foundationvision on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.