nvidia/nccl

★ 4,702C++

This is a quick first-pass explanation. The richer sections — use-cases, tech stack, setup, prompts — are still being generated.

In plain English

NCCL (pronounced "Nickel") is a library from NVIDIA that makes multiple graphics cards talk to each other efficiently. When training large AI models, it is common to spread the work across many GPUs at once, and those GPUs need to constantly share intermediate results with each other. NCCL provides the low-level communication routines that handle this data exchange as fast as possible. The library supports standard collective operations used in distributed computing: all-reduce, all-gather, reduce, broadcast, reduce-scatter, and direct send and receive between any two GPUs. These are the building blocks that AI training frameworks like PyTorch and TensorFlow use under the hood when running across multiple GPUs. NCCL is optimized for the hardware connections that link GPUs together, including PCIe slots, NVLink (NVIDIA's fast direct GPU interconnect), NVswitch (a switching chip for large GPU clusters), and network fabrics like InfiniBand. It works whether your GPUs are all in one machine or spread across many machines in a data center. To use it, you can download pre-built packages from NVIDIA's developer site or compile the source yourself using the provided Makefile. Installation packages are available for Debian, Ubuntu, and Red Hat systems. A separate repository handles the test suite if you want to benchmark communication bandwidth after setup.

Open on GitHub → Explain another repo

← nvidia on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.