explaingit

karpathy/nanogpt

58,291PythonAudience · researcherComplexity · 3/5QuietLicenseSetup · moderate

TLDR

A minimal, readable Python implementation of GPT-2 that trains language models to predict and generate text. Learn how GPT works by reading ~600 lines of clean code.

Mindmap

mindmap
  root((nanoGPT))
    What it does
      Train GPT models
      Fine-tune on text
      Generate predictions
    How it works
      Model definition
      Training loop
      Distributed training
    Use cases
      Learn GPT internals
      Research experiments
      Custom fine-tuning
    Tech stack
      Python
      PyTorch
      CUDA optional
    Hardware options
      Single GPU fast
      CPU or Mac slow
      Multi-GPU distributed

Things people build with this

USE CASE 1

Train a character-level language model on Shakespeare in 3 minutes on a single GPU to understand GPT training.

USE CASE 2

Fine-tune a pre-trained GPT-2 model on your own text dataset to generate domain-specific completions.

USE CASE 3

Reproduce GPT-2's 124M parameter model on benchmark datasets to verify training techniques work correctly.

USE CASE 4

Modify the model architecture or training loop to experiment with new ideas in language model research.

Tech stack

PythonPyTorchCUDA

Getting it running

Difficulty · moderate Time to first run · 30min

Requires PyTorch installation and CUDA setup for GPU training; CPU-only fallback possible but slow.

MIT license allows free use for any purpose, including commercial, as long as you include the original copyright notice.

In plain English

nanoGPT is a minimal Python codebase for training and fine-tuning GPT-style language models, designed to be readable and hackable rather than production-hardened. GPT models are neural networks that learn to predict the next word in a sequence and can be fine-tuned to generate text that continues a given prompt. The project was written to reimplement the original GPT-2 architecture in as few lines as possible while still achieving the same training results, making the internals easy to understand and modify. The README notes that this repository is now deprecated and that its successor, nanochat, is the recommended alternative for new users. The entire project consists of two main files: a roughly 300-line training loop and a roughly 300-line model definition. Despite this simplicity, it can reproduce GPT-2 with 124 million parameters on standard benchmark datasets when run on appropriate hardware, around 4 days on 8 high-end GPUs. For experimentation on smaller hardware, it includes examples for training a character-level model on Shakespeare's works in about 3 minutes on a single GPU, or more slowly on a CPU or Apple Silicon Mac. The code supports distributed training across multiple GPUs using PyTorch's built-in parallelism tools, and can also load pre-trained GPT-2 weights from OpenAI as a starting point for fine-tuning. The tech stack is Python with PyTorch as the deep learning framework. You would use nanoGPT when you want to understand how GPT training works from first principles by reading clean, commented code, when you want a starting point for language model research, or when you need to fine-tune a GPT-style model on a custom dataset without wading through a large framework.

Copy-paste prompts

Prompt 1
I want to understand how GPT models are trained. Walk me through the nanoGPT training loop and explain what each section does.
Prompt 2
Show me how to fine-tune nanoGPT on my own text file. What changes do I need to make to the code?
Prompt 3
How do I load OpenAI's pre-trained GPT-2 weights into nanoGPT and use them as a starting point?
Prompt 4
Explain the model architecture in nanoGPT. What are the key components and how do they work together?
Prompt 5
I want to train nanoGPT across multiple GPUs. What PyTorch features does it use for distributed training?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.