explaingit

intologyai/nanogpt-bench

13PythonAudience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

Benchmark that drops an AI coding agent into a sandboxed container and measures how much it can speed up the NanoGPT pretraining speedrun versus human leaderboard records.

Mindmap

mindmap
  root((NanoGPT-Bench))
    Inputs
      Human anchor record
      Compute budget
      Agent API key
    Outputs
      Validated speedups
      Run logs
      Timestamped workspaces
    Use Cases
      Score frontier coding agents
      Compare agent prompting styles
      Refresh benchmark anchor
    Tech Stack
      Python
      Docker
      CUDA
      H100

Things people build with this

USE CASE 1

Benchmark a frontier coding agent like Codex or Claude Code on autonomous ML research

USE CASE 2

Compare prompting strategies such as Autoresearch against a default Claude Code setup

USE CASE 3

Refresh the benchmark by swapping in a newer human anchor record

USE CASE 4

Reproduce the reported result that no agent recovered more than 10% of the human speedup

Tech stack

PythonDockerCUDAPyTorch

Getting it running

Difficulty · hard Time to first run · 1day+

Practical runs need H100-class GPUs, Docker with CUDA, large dataset prefetch, and an API key for the agent being evaluated.

In plain English

NanoGPT-Bench is a benchmark for testing how well AI coding agents can do long, open-ended machine-learning research on their own. It is built on top of an existing community project called the NanoGPT Speedrun, in which people compete to pretrain a small GPT-2 model as fast as possible. The leaderboard has a long history of human submissions, each one a small algorithmic improvement, and this benchmark uses that history as a yardstick for autonomous agents. In a run, an AI agent is dropped into a sandboxed container with a strong starting point taken from the human leaderboard, no internet access, and a fixed compute budget. The agent then has to come up with its own ideas to make the training script faster. To check a candidate, it calls a submit command inside the container. The submitter does two things: it runs an LLM judge that checks the change against the speedrun competition rules, then it retimes the candidate ten times to confirm any speedup is statistically significant. Both the starting record and the compute budget are knobs, so the benchmark can be refreshed later without contamination. The project tested three frontier coding agents: Codex backed by GPT-5.4 xhigh, Claude Code backed by Opus 4.6 Max, and a second Claude Code variant using a prompting style from a project called Autoresearch. Each had 512 H100-hours of compute and started from the human world record set on September 3rd, 2025. None of them recovered more than 10 percent of the speedup that humans found over the following five months, and the agents spent most of their compute on tuning hyperparameters, while around 77 percent of human records involved real algorithmic changes. The repository is organized into a host-side harness under nanogpt/, a Docker image under image/ that holds the training environment and the submit validator, and human_baselines/ with snapshots of historical record submissions. To run the benchmark, you build the Docker image once, which prefetches nine FineWeb10B training shards plus a validation shard. Then you export the API key for the agent you want to test, set BENCHMARK_SESSION_HOURS, and run one of the launcher scripts under nanogpt/run/. Each launcher copies the anchor record into a fresh timestamped workspace, mounts a shared data volume, and streams logs as the agent works.

Copy-paste prompts

Prompt 1
Walk me through building the NanoGPT-Bench Docker image and prefetching the FineWeb10B shards
Prompt 2
Show me how to launch a Claude Code run with a 512 H100-hour budget using the scripts under nanogpt/run/
Prompt 3
Explain how the submit command validates a candidate with an LLM judge and ten retimings
Prompt 4
How do I plug a new agent into the harness so it can call the in-container submit command
Prompt 5
What does the human_baselines folder contain and how is the anchor record copied into a fresh workspace
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.