Train a GPT-2-equivalent language model from scratch on a GPU cluster for under $100 in two hours.
Finetune a pretrained language model for custom chatbot behavior and evaluate its performance.
Study the complete pipeline of language model development from tokenization through inference with a web chat interface.
Experiment with neural network architecture by adjusting the depth parameter to automatically optimize compute efficiency.
Requires GPU cluster setup, PyTorch/CUDA configuration, and distributed training infrastructure (torchrun).
nanochat is a minimal, experimental toolkit for training large language models, the type of AI that powers chatbots like ChatGPT, from scratch on a single cluster of high-powered GPUs. The headline claim is that you can reproduce a model with the same capability as GPT-2 (a landmark AI model from 2019 that cost approximately $43,000 to train) for under $100 today, in roughly two hours, thanks to seven years of hardware and software improvements. The project covers every stage of building a language model: tokenization (converting raw text into numbers the model can process), pretraining (the initial training phase where the model reads a huge amount of text to learn language patterns), finetuning (adjusting the model for specific behavior), evaluation (measuring how good the model is), and inference (actually generating text). It also includes a web-based chat interface so you can talk to your trained model just as you would with ChatGPT. The design philosophy is deliberately simple. All the complexity knobs are reduced to a single parameter called depth, which is the number of layers in the neural network. Setting that one number automatically calculates all other settings, network width, learning rate, training duration, and more, so that the resulting model is compute-optimal without requiring expert tuning. This is a project for machine learning researchers and engineers who want to study and experiment with how language models are built at a low level. It is not a consumer product, you need access to rented GPU servers (typically eight H100 or A100 GPUs) and familiarity with Python and the command line. The tech stack is Python using PyTorch, the dominant deep learning framework. Dependency management uses uv. Training is distributed across multiple GPUs using PyTorch's torchrun utility.
Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.