explaingit

adam-maj/tiny-gpu

12,388SystemVerilogAudience · developerComplexity · 5/5Setup · moderate

TLDR

A minimal GPU built in Verilog (a hardware design language) specifically to teach how GPUs work internally, covering the same core ideas as real AI training chips, in under 15 readable files.

Mindmap

mindmap
  root((tiny-gpu))
    Architecture
      Dispatcher
      Compute cores
      Memory controller
      Cache
    Per-core components
      Scheduler
      ALU
      LSU
      Register files
    Learning tools
      Custom ISA
      Simulation tooling
      Execution traces
    Example kernels
      Matrix addition
      Matrix multiplication
    Audience
      CS students
      Hardware learners
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Study how a real GPU dispatches threads and manages memory by reading a simplified but working implementation.

USE CASE 2

Run the included matrix addition and matrix multiplication kernels in simulation to see how parallel thread execution works step by step.

USE CASE 3

Use the execution trace tooling to visualize what happens inside the GPU when a kernel runs.

USE CASE 4

Extend the custom instruction set or add warp scheduling to go deeper after mastering the basics.

Tech stack

SystemVerilogVerilog

Getting it running

Difficulty · moderate Time to first run · 1h+

Requires a Verilog simulation environment, no standard package install, you need to set up the toolchain before running kernels.

License not specified in the explanation, check the repository directly before using in a project.

In plain English

Tiny-gpu is a minimal GPU implementation written in Verilog, a hardware description language used to design digital circuits. The project is explicitly built for learning: it strips away the complexity of real graphics cards to expose the core architectural ideas that all GPUs share, including the kind of general-purpose computing chips used in AI training. The README opens by noting that while there are many resources for learning how CPUs work at a hardware level, the GPU market is so competitive that low-level architectural details stay proprietary. This project fills that gap by building a simplified but functional GPU from scratch in under 15 well-commented files. The architecture covers the main components found in a real GPU: a dispatcher that breaks work into thread groups (called blocks) and assigns them to compute cores, memory controllers that manage the bottleneck between the cores and external memory, a cache for storing recently fetched data to avoid redundant memory trips, and individual compute cores, each of which contains a scheduler, an instruction fetcher, a decoder, and per-thread resources (ALU for arithmetic, LSU for memory loads and stores, a program counter, and register files). The register files hold data specific to each thread, which is how the same instruction can operate on different data in parallel across many threads at once. The project also includes a custom instruction set (ISA), working example kernels for matrix addition and matrix multiplication, and tooling to simulate kernel execution and view execution traces. The documentation explains not just how to use it, but why each design decision was made. The repo notes areas where production GPUs go further, such as warp scheduling and pipelining, and points to those as next steps for anyone who wants to go deeper after working through the basics.

Copy-paste prompts

Prompt 1
Walk me through how tiny-gpu's dispatcher breaks a matrix multiplication kernel into thread blocks and assigns them to compute cores.
Prompt 2
I want to write a new kernel for tiny-gpu that does element-wise vector addition. Show me the custom ISA instructions I need and how to set up the simulation.
Prompt 3
Explain the role of the LSU (Load Store Unit) in tiny-gpu and how it interacts with the memory controller to avoid redundant fetches.
Prompt 4
How would I add a simple pipeline stage to tiny-gpu's compute core? Show me which Verilog files to modify and what changes are needed.
Open on GitHub → Explain another repo

← adam-maj on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.