Benchmark a frontier coding agent like Codex or Claude Code on autonomous ML research
Compare prompting strategies such as Autoresearch against a default Claude Code setup
Refresh the benchmark by swapping in a newer human anchor record
Reproduce the reported result that no agent recovered more than 10% of the human speedup
Practical runs need H100-class GPUs, Docker with CUDA, large dataset prefetch, and an API key for the agent being evaluated.
NanoGPT-Bench is a benchmark for testing how well AI coding agents can do long, open-ended machine-learning research on their own. It is built on top of an existing community project called the NanoGPT Speedrun, in which people compete to pretrain a small GPT-2 model as fast as possible. The leaderboard has a long history of human submissions, each one a small algorithmic improvement, and this benchmark uses that history as a yardstick for autonomous agents. In a run, an AI agent is dropped into a sandboxed container with a strong starting point taken from the human leaderboard, no internet access, and a fixed compute budget. The agent then has to come up with its own ideas to make the training script faster. To check a candidate, it calls a submit command inside the container. The submitter does two things: it runs an LLM judge that checks the change against the speedrun competition rules, then it retimes the candidate ten times to confirm any speedup is statistically significant. Both the starting record and the compute budget are knobs, so the benchmark can be refreshed later without contamination. The project tested three frontier coding agents: Codex backed by GPT-5.4 xhigh, Claude Code backed by Opus 4.6 Max, and a second Claude Code variant using a prompting style from a project called Autoresearch. Each had 512 H100-hours of compute and started from the human world record set on September 3rd, 2025. None of them recovered more than 10 percent of the speedup that humans found over the following five months, and the agents spent most of their compute on tuning hyperparameters, while around 77 percent of human records involved real algorithmic changes. The repository is organized into a host-side harness under nanogpt/, a Docker image under image/ that holds the training environment and the submit validator, and human_baselines/ with snapshots of historical record submissions. To run the benchmark, you build the Docker image once, which prefetches nine FineWeb10B training shards plus a validation shard. Then you export the API key for the agent you want to test, set BENCHMARK_SESSION_HOURS, and run one of the launcher scripts under nanogpt/run/. Each launcher copies the anchor record into a fresh timestamped workspace, mounts a shared data volume, and streams logs as the agent works.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.