Study how a real GPU dispatches threads and manages memory by reading a simplified but working implementation.
Run the included matrix addition and matrix multiplication kernels in simulation to see how parallel thread execution works step by step.
Use the execution trace tooling to visualize what happens inside the GPU when a kernel runs.
Extend the custom instruction set or add warp scheduling to go deeper after mastering the basics.
Requires a Verilog simulation environment, no standard package install, you need to set up the toolchain before running kernels.
Tiny-gpu is a minimal GPU implementation written in Verilog, a hardware description language used to design digital circuits. The project is explicitly built for learning: it strips away the complexity of real graphics cards to expose the core architectural ideas that all GPUs share, including the kind of general-purpose computing chips used in AI training. The README opens by noting that while there are many resources for learning how CPUs work at a hardware level, the GPU market is so competitive that low-level architectural details stay proprietary. This project fills that gap by building a simplified but functional GPU from scratch in under 15 well-commented files. The architecture covers the main components found in a real GPU: a dispatcher that breaks work into thread groups (called blocks) and assigns them to compute cores, memory controllers that manage the bottleneck between the cores and external memory, a cache for storing recently fetched data to avoid redundant memory trips, and individual compute cores, each of which contains a scheduler, an instruction fetcher, a decoder, and per-thread resources (ALU for arithmetic, LSU for memory loads and stores, a program counter, and register files). The register files hold data specific to each thread, which is how the same instruction can operate on different data in parallel across many threads at once. The project also includes a custom instruction set (ISA), working example kernels for matrix addition and matrix multiplication, and tooling to simulate kernel execution and view execution traces. The documentation explains not just how to use it, but why each design decision was made. The repo notes areas where production GPUs go further, such as warp scheduling and pipelining, and points to those as next steps for anyone who wants to go deeper after working through the basics.
← adam-maj on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.