Learn how GPU clusters are built and how chips communicate during large model training
Build a basic distributed training setup from scratch using PyTorch with guided exercises
Understand how to speed up model inference and optimize resource use in production
Course content is primarily in Chinese, video lectures are hosted on Bilibili and YouTube.
AIInfra is an open-source course, written primarily in Chinese, that teaches the infrastructure layer behind large AI models. The name stands for AI Infrastructure, meaning the hardware and software stack that sits underneath large language models and handles everything from the chips in a cluster up to the training and inference pipelines that run on top. The course content is published as Markdown articles, slides (PDF), and Jupyter notebooks with code exercises, and accompanying video lectures are hosted on Bilibili and YouTube. The course is organized into eight modules. The first gives an overview of how large model systems work, including a discussion of scaling laws, which describe how model performance changes as you add more compute or data. The next module covers AI compute clusters: how racks of GPU or other AI chips are connected together, how performance is measured, and how massive clusters with tens of thousands of chips are built and operated. A third module covers networking and storage, explaining how data moves between chips during training and how checkpoints and datasets are stored. Later modules address containers and cloud-native tooling (using Docker and Kubernetes to manage AI workloads), distributed training strategies (splitting a model across many devices and keeping them in sync), and inference optimization (making a trained model respond faster and more efficiently). The final two modules cover the algorithms and data that go into large model training, including how prompts affect model behavior, and current applications of large models. Each section includes a mix of conceptual explanations and practical code exercises. The hands-on notebooks cover things like breaking down the structure of a Transformer model, calculating how much computation a model requires, and building a basic distributed training setup from scratch using PyTorch. The project is maintained by a creator who goes by ZOMI and is actively being expanded. Contributions are welcome.
← infrasys-ai on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.