explaingit

tencent-hunyuan/gear

Analysis updated 2026-05-18

60PythonAudience · researcherComplexity · 5/5Setup · hard

TLDR

A research codebase from Tencent Hunyuan that trains an image tokenizer and autoregressive generator together end-to-end, achieving roughly 10x faster convergence to strong image quality compared to separate training.

Mindmap

mindmap
  root((GEAR))
    Core Idea
      End-to-end training
      Dual-path gradient
      Tokenizer guided by AR
    Tokenizer Types
      VQ-16
      LFQ-16
      IBQ-16
    Results
      10x faster convergence
      Better gFID ImageNet
      Text-to-image gains
    Setup
      PyTorch and CUDA
      Conda environment
      HuggingFace weights
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Replace the separately-trained tokenizer in an autoregressive image generation pipeline with a GEAR-tuned one to improve generation quality faster

USE CASE 2

Reproduce the ImageNet class-conditional and text-to-image results from the GEAR paper using the provided training and evaluation scripts

USE CASE 3

Fine-tune the released GEAR tokenizer weights on your own image dataset and drop them into a standard AR generation pipeline

What is it built with?

PythonPyTorchCUDAHugging Faceconda

How does it compare?

tencent-hunyuan/gear0xh4ku/manga-pdf-to-epubayyouboss0011/sherlockmaps
Stars606060
LanguagePythonPythonPython
Setup difficultyhardmoderatemoderate
Complexity5/52/53/5
Audienceresearchergeneraldata

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Requires NVIDIA GPU with CUDA, benchmark evaluation needs multiple separate conda environments, training needs the full ImageNet-1K dataset.

In plain English

GEAR is a research project from Tencent Hunyuan and Peking University that proposes a new way to train AI image generation models. It accompanies a published paper and provides the official PyTorch code. Most modern AI image generators that use autoregressive (token-by-token) generation follow a two-step pipeline: first a "tokenizer" compresses images into a sequence of discrete codes (tokens), and then a separate model learns to predict those codes in order to generate new images. These two components are almost always trained independently. GEAR's core contribution is training them together in a single end-to-end pass, so the tokenizer learns to produce tokens that are easier for the generator to predict. The technical challenge is that the tokenization step involves choosing the single best code for each image patch (an argmax operation), which is not differentiable and normally blocks gradient information from flowing back into the tokenizer during generator training. GEAR works around this with a dual-path approach: one path uses the hard discrete codes to train the generator as usual, while a second, mathematically softer version of the same step carries a gradient signal back to update only the tokenizer. The two remain decoupled, so neither interferes with the other's training objective. The practical result, shown on the standard ImageNet benchmark, is roughly 10 times faster convergence to a strong generation quality score (gFID), compared to training the tokenizer and generator separately. On a text-to-image task the improvement is even larger on certain metrics. The repository releases pre-trained tokenizer weights for three different quantizer variants (VQ, LFQ, IBQ) and provides training and evaluation scripts. An NVIDIA GPU with CUDA support is required to run any part of it. This project is intended for AI researchers, not general users.

Copy-paste prompts

Prompt 1
I want to download the GEAR-VQ tokenizer checkpoint and use it to run inference on ImageNet images. Show me the exact huggingface-cli download command and how to call the inference script
Prompt 2
Explain the GEAR dual-path training trick in plain terms: why can't you train the tokenizer and generator end-to-end normally, and how does the soft branch solve that?
Prompt 3
I have an existing LlamaGen VQ-16 autoregressive model. How do I fine-tune its tokenizer with GEAR and what datasets and GPU memory do I need?
Prompt 4
What are the differences between the VQ, LFQ, and IBQ tokenizer variants in GEAR and which one gives the best gFID on ImageNet class-conditional generation?

Frequently asked questions

What is gear?

A research codebase from Tencent Hunyuan that trains an image tokenizer and autoregressive generator together end-to-end, achieving roughly 10x faster convergence to strong image quality compared to separate training.

What language is gear written in?

Mainly Python. The stack also includes Python, PyTorch, CUDA.

How hard is gear to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is gear for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub tencent-hunyuan on gitmyhub

Verify against the repo before relying on details.