facebookincubator/aitemplate

★ 4,717PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((aitemplate))
    What it does
      Model compilation
      Operator fusion
      GPU code generation
    Platforms
      NVIDIA CUDA
      AMD ROCm
    Supported models
      BERT
      Stable Diffusion
      Vision Transformer
    Use cases
      Fast inference
      No runtime deps
      GPU optimization

mindmap root((aitemplate)) What it does Model compilation Operator fusion GPU code generation Platforms NVIDIA CUDA AMD ROCm Supported models BERT Stable Diffusion Vision Transformer Use cases Fast inference No runtime deps GPU optimization

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Compile a PyTorch Stable Diffusion model into an optimized GPU binary for much faster image generation on NVIDIA Ampere hardware.

USE CASE 2

Speed up BERT or Vision Transformer inference on AMD GPUs without writing custom CUDA kernels.

USE CASE 3

Convert an existing PyTorch model to AIT format using FX2AIT to benchmark the speed improvement before committing to a migration.

Tech stack

PythonCUDAROCmC++PyTorch

Getting it running

Difficulty · hard Time to first run · 1day+

Requires Docker and either CUDA on NVIDIA Ampere or newer, or ROCm on AMD MI-210/MI-250, there is no CPU fallback.

In plain English

AITemplate (AIT) is a Python framework from Meta that takes a trained neural network model and compiles it into highly optimized C++ GPU code for fast inference. The idea is that rather than running a model through a general-purpose deep learning runtime, you generate a self-contained program specifically tuned for that model and the GPU hardware it will run on. The result is faster execution and no dependency on third-party libraries like cuBLAS or TensorRT at runtime. It targets two GPU platforms: NVIDIA GPUs (via CUDA, with a focus on Ampere-generation cards and newer) and AMD GPUs (via ROCm/HIP, tested on the MI-210 and MI-250). The framework specializes in half-precision floating-point arithmetic using the dedicated tensor cores these GPUs provide for matrix math. The README describes performance results on models including ResNet, BERT, Vision Transformer, and Stable Diffusion. A key part of the framework is operator fusion. Rather than executing each neural network operation one at a time, AIT merges sequences of operations into single GPU kernel calls, which reduces overhead and memory traffic. It supports horizontal fusion (merging parallel operations with different input sizes), vertical fusion (folding element-wise operations into matrix operations), and memory fusion (combining data rearrangement steps like splits and concatenations). A companion tool called FX2AIT converts existing PyTorch models into AIT format. It handles partial conversion for models that include operations AIT does not yet support, keeping those unsupported parts running in PyTorch. The generated AIT runtime can accept PyTorch tensors directly as input without copying data. Installation requires Docker or a correctly matched CUDA or ROCm compiler. The project is under active development, with planned work on dynamic input shapes, int8 and fp8 quantization, and integration with PyTorch 2.

Copy-paste prompts

Prompt 1

Walk me through compiling a simple PyTorch model using AITemplate's FX2AIT tool on an NVIDIA Ampere GPU inside the Docker environment.

Prompt 2

How does AITemplate's operator fusion reduce memory usage compared to running each layer separately in PyTorch? Give a concrete example.

Prompt 3

Show me how to set up the AITemplate Docker environment on a machine with an A100 GPU and run the Stable Diffusion inference example.

Prompt 4

I have a custom PyTorch model with some unsupported ops. How does FX2AIT handle partial conversion and which parts still run in PyTorch?

Open on GitHub → Explain another repo

← facebookincubator on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.