explaingit

xlite-dev/leetcuda

10,960CudaAudience · developerComplexity · 5/5Setup · hard

TLDR

A collection of 200+ working CUDA GPU programming examples organized by difficulty, covering high-performance matrix multiplication and Flash Attention, aimed at developers who want to learn GPU kernel development from first principles.

Mindmap

mindmap
  root((leetcuda))
    What it does
      200+ CUDA kernels
      Learning by difficulty
      Blog post links
    Tech stack
      CUDA
      C++
      PyTorch
    Key topics
      HGEMM matrix math
      Flash Attention
      Tensor Cores
    Audience
      GPU learners
      ML engineers
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Learn GPU kernel programming by working through 200+ examples that escalate from basic operations to advanced Tensor Core techniques.

USE CASE 2

Study high-performance HGEMM implementations that match 98-100% of NVIDIA's cuBLAS library performance.

USE CASE 3

Implement Flash Attention in CUDA to make transformer model inference faster and more memory-efficient.

USE CASE 4

Use the linked blog posts alongside working code to understand GPU architecture concepts in practice.

Tech stack

CUDAC++PythonPyTorch

Getting it running

Difficulty · hard Time to first run · 1h+

Requires an NVIDIA GPU with CUDA toolkit installed and matching compiler toolchain.

In plain English

LeetCUDA is a collection of learning notes and working code examples for CUDA, a programming model used to run computations on NVIDIA graphics cards (GPUs). GPUs can process many operations in parallel, which makes them central to deep learning, scientific computing, and large-scale matrix calculations. Writing GPU code directly is considerably more complex than writing standard CPU code, and this repository is aimed at helping developers learn how to do it. The collection includes more than 200 CUDA kernel implementations, organized by difficulty from easy through progressively harder levels. A kernel is a function that runs on the GPU. The examples range from basic operations to advanced techniques like matrix multiplication using Tensor Cores, which are specialized circuits on modern NVIDIA GPUs designed specifically to accelerate the kind of math used in neural networks. Two major areas get extended treatment. The first is HGEMM, which stands for half-precision general matrix multiplication, a fundamental operation in training and running AI models. The implementations here reportedly reach 98 to 100 percent of the performance of NVIDIA's own cuBLAS library, which is the standard reference for GPU-accelerated linear algebra. The second area is Flash Attention, an algorithm that makes the attention mechanism in transformer models (the architecture behind most modern language models) faster and more memory-efficient. The repository also links to more than 100 blog posts covering related GPU programming topics. PyTorch, a widely used Python library for machine learning, appears throughout the examples as a companion tool. The intended audience is developers who already have programming experience and want to learn GPU kernel development from first principles. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1
I want to write my first CUDA matrix multiplication kernel. Walk me through a basic implementation using the leetcuda examples as a reference, starting from the kernel function signature.
Prompt 2
Using the leetcuda HGEMM examples, help me understand what Tensor Cores are and how to write a kernel that uses them for half-precision matrix multiplication.
Prompt 3
Explain how the leetcuda Flash Attention implementation works and what makes it more memory-efficient than a naive attention kernel.
Prompt 4
I have an NVIDIA GPU and want to benchmark my CUDA kernel against cuBLAS. Show me how to set up the timing and comparison using PyTorch alongside leetcuda examples.
Prompt 5
Help me write a CUDA reduction kernel that sums all elements of a large array, use the easy-level leetcuda examples as a guide for the pattern.
Open on GitHub → Explain another repo

← xlite-dev on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.