explaingit

nvidia/apex

8,959PythonAudience · researcherComplexity · 4/5Setup · hard

TLDR

A Python library from NVIDIA that speeds up AI model training on NVIDIA GPUs by using 16-bit mixed-precision math and spreading computation across multiple GPUs or machines.

Mindmap

mindmap
  root((repo))
    What it does
      Mixed precision
      Distributed training
      GPU optimization
    Tech
      PyTorch
      CUDA extensions
      C++ compiled
    Use cases
      Faster training
      Multi-GPU scaling
      Memory reduction
    Setup
      GPU required
      CUDA toolkit
      Python-only option
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Reduce memory usage and speed up a large PyTorch model's training time by switching to mixed-precision without rewriting your training loop.

USE CASE 2

Train a neural network across multiple GPUs or machines to finish in hours instead of days.

USE CASE 3

Fit a model into GPU memory that was previously too large to train at full 32-bit precision.

Tech stack

PythonPyTorchCUDAC++

Getting it running

Difficulty · hard Time to first run · 1h+

Full installation compiles C++ and CUDA extensions and requires a CUDA toolkit and NVIDIA GPU, a slower Python-only install is also available.

In plain English

Apex is a collection of tools from NVIDIA that makes training AI models faster and more efficient when using PyTorch on NVIDIA GPUs. PyTorch is a widely used framework for building and training machine learning models, and Apex adds features that NVIDIA developed specifically to get better performance out of their hardware. The two main things Apex offers are mixed precision training and distributed training. Mixed precision means the model uses a combination of 16-bit and 32-bit numbers during training instead of always using 32-bit. This reduces memory usage and speeds up computation on modern NVIDIA GPUs, which have dedicated hardware for 16-bit math. Distributed training means spreading the work across multiple GPUs, or even multiple machines, so that large models can be trained faster by parallelizing the computation. NVIDIA maintains Apex as a place to release optimized utilities quickly, before they might eventually be folded into the main PyTorch project. Some of the code here has already been or is planned to be incorporated into PyTorch itself. Installing Apex requires either a compatible NVIDIA GPU or access to NVIDIA's pre-built container images. The full-performance version compiles custom C++ and CUDA extensions during installation, which requires a working CUDA toolkit. A simpler Python-only install is also available but runs slower because it skips the low-level compiled components. This is a developer-facing library used during model training, not an end-user application. It is primarily useful for research teams or engineers who are training large neural networks and want to reduce training time or train models that would not otherwise fit in GPU memory.

Copy-paste prompts

Prompt 1
I have a standard PyTorch training loop. Show me exactly how to wrap my model and optimizer with Apex's AMP to enable mixed-precision training with minimal code changes.
Prompt 2
How do I use Apex's DistributedDataParallel to train a PyTorch model across 4 GPUs on a single machine? Show the launch command and the code changes needed.
Prompt 3
My model runs out of GPU memory during training. Walk me through which Apex features I should try first to reduce memory usage, and in what order.
Prompt 4
What is the difference between Apex's O1 and O2 mixed-precision optimization levels, and when should I choose each one?
Open on GitHub → Explain another repo

← nvidia on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.