This repository is a small teaching lab for GPU code that runs the math behind large language models. It contains hand-written versions of three building blocks: a row-wise softmax, an FP16 matrix multiply (called GEMM), and a FlashAttention-style fused attention forward pass. The kernels are written in Triton, a Python-friendly language for GPU programming, with PyTorch and CUDA used around them for setup and comparison. The author describes it as a cleaned-up reconstruction of earlier experiments. They deliberately do not publish benchmark numbers in the README, because results swing a lot depending on which GPU, driver, Triton version, and tensor shape you use. Instead, the project ships a local benchmark harness that you run yourself. The code is laid out under src/triton_llm_kernel_lab, with the three kernels in a kernels/ folder, a PyTorch reference implementation for comparison, a benchmark CLI called bench.py, and a configs.py file that holds LLM-shaped test sizes split into prefill, decode, GEMM, and softmax groups. There are also tests, one set that runs on CPU and checks the reference code, and another that requires a CUDA GPU and compares each custom kernel against the PyTorch version, reporting maximum absolute error. Installation expects Linux with an NVIDIA GPU for the real kernels, using pip install -e with a gpu,dev extra. On a CPU-only machine you can still install the dev extra and run the reference tests. The benchmark harness defaults to 50 warmup iterations and 200 timed iterations, and prints latency, estimated TFLOPS, estimated memory bandwidth, and max error per kernel.
Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.