explaingit

karpathy/llm.c

29,933CudaAudience · developerComplexity · 4/5QuietLicenseSetup · hard

TLDR

A minimal, from-scratch implementation of GPT-2/GPT-3 training in C and CUDA, cutting out framework overhead to show exactly how language model training works.

Mindmap

mindmap
  root((repo))
    What it does
      Train language models
      No heavy frameworks
      GPU acceleration
    Tech stack
      C language
      CUDA
      Python comparison
    Use cases
      Learn training internals
      Fast GPU training
      Systems programming
    Audience
      Systems programmers
      AI researchers
      GPU enthusiasts

Things people build with this

USE CASE 1

Train GPT-2 or GPT-3 style models on your own data (e.g., Shakespeare corpus) with minimal dependencies.

USE CASE 2

Understand exactly how language model training works by reading clean, direct C code without framework abstractions.

USE CASE 3

Run fast GPU-accelerated training on NVIDIA hardware without PyTorch or TensorFlow overhead.

Tech stack

CCUDAPythonNVIDIA GPU

Getting it running

Difficulty · hard Time to first run · 1day+

Requires NVIDIA GPU with CUDA toolkit installed, C compiler, and understanding of GPU programming; building from source with multiple dependencies.

Use freely for any purpose including commercial, as long as you keep the copyright notice.

In plain English

llm.c is an implementation of large language model (LLM) training written entirely in C and CUDA, two low-level programming languages, without depending on large frameworks like PyTorch. The goal is to train the same kind of AI language models (specifically reproducing GPT-2 and GPT-3 class models) using code that is small, direct, and easy to read. Most AI training code relies on heavyweight libraries that can weigh hundreds of megabytes. This project cuts all that away: the core single-GPU, full-precision training code fits in roughly 1,000 lines of C, and the optimized GPU version uses CUDA (a programming interface for NVIDIA graphics cards) to run faster than standard framework-based training. The repository also includes a parallel implementation in Python for comparison and testing. You can run it on a CPU alone (useful for learning but slow for serious training), or on one or more NVIDIA GPUs for real training speed. It supports training on small datasets like a Shakespeare text corpus, and comes with scripts to download and tokenize data automatically. Someone would use this if they want to understand exactly how LLM training works at a low level without layers of abstraction hiding the details, or if they are a systems programmer curious about GPU computing and AI, or if they want the fastest possible training without framework overhead.

Copy-paste prompts

Prompt 1
How do I set up and run llm.c to train a language model on my own text dataset?
Prompt 2
Walk me through the core training loop in llm.c, what are the main steps from data loading to backpropagation?
Prompt 3
How does the CUDA version of llm.c compare in speed to PyTorch for the same model, and why is it faster?
Prompt 4
I want to modify llm.c to add a new feature (e.g., different optimizer, new layer type), where in the code should I start?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.