infrasys-ai/aiinfra

★ 7,000Jupyter NotebookAudience · researcherComplexity · 3/5Setup · easy

Mindmap

mindmap
  root((aiinfra))
    What it covers
      Scaling laws
      GPU clusters
      Distributed training
      Inference optimization
    Formats
      Markdown articles
      Jupyter notebooks
      PDF slides
      Video lectures
    Tech
      PyTorch exercises
      Docker Kubernetes
      Transformer internals
    Audience
      AI engineers
      Researchers
      Students

mindmap root((aiinfra)) What it covers Scaling laws GPU clusters Distributed training Inference optimization Formats Markdown articles Jupyter notebooks PDF slides Video lectures Tech PyTorch exercises Docker Kubernetes Transformer internals Audience AI engineers Researchers Students

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Learn how GPU clusters are built and how chips communicate during large model training

USE CASE 2

Build a basic distributed training setup from scratch using PyTorch with guided exercises

USE CASE 3

Understand how to speed up model inference and optimize resource use in production

Tech stack

PythonPyTorchDockerKubernetesJupyter Notebook

Getting it running

Difficulty · easy Time to first run · 30min

Course content is primarily in Chinese, video lectures are hosted on Bilibili and YouTube.

In plain English

AIInfra is an open-source course, written primarily in Chinese, that teaches the infrastructure layer behind large AI models. The name stands for AI Infrastructure, meaning the hardware and software stack that sits underneath large language models and handles everything from the chips in a cluster up to the training and inference pipelines that run on top. The course content is published as Markdown articles, slides (PDF), and Jupyter notebooks with code exercises, and accompanying video lectures are hosted on Bilibili and YouTube. The course is organized into eight modules. The first gives an overview of how large model systems work, including a discussion of scaling laws, which describe how model performance changes as you add more compute or data. The next module covers AI compute clusters: how racks of GPU or other AI chips are connected together, how performance is measured, and how massive clusters with tens of thousands of chips are built and operated. A third module covers networking and storage, explaining how data moves between chips during training and how checkpoints and datasets are stored. Later modules address containers and cloud-native tooling (using Docker and Kubernetes to manage AI workloads), distributed training strategies (splitting a model across many devices and keeping them in sync), and inference optimization (making a trained model respond faster and more efficiently). The final two modules cover the algorithms and data that go into large model training, including how prompts affect model behavior, and current applications of large models. Each section includes a mix of conceptual explanations and practical code exercises. The hands-on notebooks cover things like breaking down the structure of a Transformer model, calculating how much computation a model requires, and building a basic distributed training setup from scratch using PyTorch. The project is maintained by a creator who goes by ZOMI and is actively being expanded. Contributions are welcome.

Copy-paste prompts

Prompt 1

Based on the aiinfra course material, explain how data moves between GPUs during distributed training and what the main bottlenecks are

Prompt 2

Using the aiinfra notebooks as a guide, show me how to calculate how much compute a Transformer model requires given its parameter count

Prompt 3

Walk me through setting up a simple distributed training job with PyTorch using the concepts from the aiinfra course

Prompt 4

Explain scaling laws in plain English as covered in the aiinfra course, how does model performance change as you add more compute or data?

Open on GitHub → Explain another repo

← infrasys-ai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.