google-research/vision_transformer

★ 12,518Jupyter NotebookAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((vision_transformer))
    What it does
      Image recognition
      Patch-based input
      Pre-trained models
    Architectures
      Vision Transformer ViT
      MLP-Mixer
    Tech stack
      Python and JAX
      Flax framework
      Jupyter Notebooks
    Usage
      Fine-tuning
      Colab experiments
      Cloud VM training

mindmap root((vision_transformer)) What it does Image recognition Patch-based input Pre-trained models Architectures Vision Transformer ViT MLP-Mixer Tech stack Python and JAX Flax framework Jupyter Notebooks Usage Fine-tuning Colab experiments Cloud VM training

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Fine-tune a pre-trained ViT model on your own image dataset to build a custom image classifier without training from scratch.

USE CASE 2

Run the included Colab notebooks to experiment with Vision Transformer inference directly in a browser.

USE CASE 3

Compare MLP-Mixer against ViT on your image task to choose the architecture that fits your compute budget.

Tech stack

PythonJAXFlaxJupyter Notebook

Getting it running

Difficulty · hard Time to first run · 1h+

Requires JAX with GPU or TPU support and a cloud VM for serious training, Colab notebooks allow quick browser-based experiments with no local setup.

No license information is mentioned in the explanation.

In plain English

This repository, published by Google Research, contains the code and pre-trained models from several research papers on image recognition. The central idea behind the Vision Transformer (ViT) approach is treating an image the same way a language model treats a sequence of words: by slicing the image into small patches and feeding those patches through the same kind of model architecture used in natural language processing. This was a notable departure from how image recognition had traditionally been done, and the papers demonstrate that this approach can match or outperform older methods when trained on large datasets. Alongside the Vision Transformer models, the repository also includes MLP-Mixer, a related architecture that takes a different approach by using only simple matrix operations rather than the attention mechanism. The repository additionally covers follow-up research on how to train these models more effectively, including what data volumes, augmentation techniques, and regularization strategies produce the best results. All the models were pre-trained on large image datasets and are made available for fine-tuning. Fine-tuning means taking one of these pre-trained models and continuing to train it on a smaller, task-specific dataset. The code is written in JAX and Flax, two Python-based frameworks for numerical computing and neural network research developed at Google. The repository includes interactive Jupyter notebooks hosted on Google Colab, which let people experiment with the models through a browser without setting up a local environment. For more serious training runs, the README walks through setting up a cloud-based virtual machine. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1

Using google-research/vision_transformer, how do I fine-tune the ViT-B/16 pre-trained model on my custom 10-class image dataset with JAX and Flax?

Prompt 2

Show me how to load a Vision Transformer checkpoint from this repo and run inference on a single image file.

Prompt 3

What data augmentation and regularization strategies does the vision_transformer repo recommend for getting the best fine-tuning results on a small dataset?

Prompt 4

How do I set up a Google Cloud VM to run a full training job using the google-research/vision_transformer codebase?

Open on GitHub → Explain another repo

← google-research on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.