zengxiao-he/triton-llm-kernel-lab

Analysis updated 2026-05-18

★ 29PythonAudience · developerComplexity · 5/5Setup · hard

What do people build with it?

USE CASE 1

Study readable, hand-written Triton kernels for LLM inference math

USE CASE 2

Benchmark custom GPU kernels against PyTorch reference implementations

USE CASE 3

Learn Triton kernel patterns through a compact teaching-focused codebase

What is it built with?

PythonTritonPyTorchCUDA

How does it compare?

	zengxiao-he/triton-llm-kernel-lab	dabit3/agent-hooks-in-depth	darksp33d/hyperhives-macos-infostealer-analysis
Stars	29	29	29
Language	Python	Python	Python
Setup difficulty	hard	moderate	hard
Complexity	5/5	3/5	4/5
Audience	developer	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1h+

Requires Linux with an NVIDIA GPU for the real kernels, CPU-only setup only runs reference tests.

In plain English

This repository is a small teaching lab for GPU code that runs the math behind large language models. It contains hand-written versions of three building blocks: a row-wise softmax, an FP16 matrix multiply (called GEMM), and a FlashAttention-style fused attention forward pass. The kernels are written in Triton, a Python-friendly language for GPU programming, with PyTorch and CUDA used around them for setup and comparison. The author describes it as a cleaned-up reconstruction of earlier experiments. They deliberately do not publish benchmark numbers in the README, because results swing a lot depending on which GPU, driver, Triton version, and tensor shape you use. Instead, the project ships a local benchmark tool you run yourself to get numbers for your own machine. The code is laid out under src/triton_llm_kernel_lab, with the three kernels in a kernels/ folder, a PyTorch reference implementation for comparison, a benchmark CLI called bench.py, and a configs.py file that holds LLM-shaped test sizes split into prefill, decode, GEMM, and softmax groups. There are also tests, one set that runs on CPU and checks the reference code, and another that requires a CUDA GPU and compares each custom kernel against the PyTorch version, reporting maximum absolute error. Installation expects Linux with an NVIDIA GPU for the real kernels, using pip install -e with a gpu,dev extra. On a CPU-only machine you can still install the dev extra and run the reference tests. The benchmark tool defaults to 50 warmup iterations and 200 timed iterations, and prints latency, estimated TFLOPS, estimated memory bandwidth, and max error per kernel.

Copy-paste prompts

Prompt 1

Explain how the row-wise softmax Triton kernel in this repo works

Prompt 2

Walk me through the FlashAttention-style fused attention kernel here

Prompt 3

Help me set up this repo and run the benchmark tool on my GPU

Prompt 4

How does the FP16 GEMM kernel achieve tiling and L2 reuse

Frequently asked questions

What is triton-llm-kernel-lab?

A small teaching lab with hand-written Triton GPU kernels for softmax, matrix multiplication, and fused attention used in LLM inference.

What language is triton-llm-kernel-lab written in?

Mainly Python. The stack also includes Python, Triton, PyTorch.

How hard is triton-llm-kernel-lab to set up?

Setup difficulty is rated hard, with roughly 1h+ to a first successful run.

Who is triton-llm-kernel-lab for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.