intologyai/nanogpt-bench

Analysis updated 2026-06-24

★ 13PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((NanoGPT-Bench))
    Inputs
      Human anchor record
      Compute budget
      Agent API key
    Outputs
      Validated speedups
      Run logs
      Timestamped workspaces
    Use Cases
      Score frontier coding agents
      Compare agent prompting styles
      Refresh benchmark anchor
    Tech Stack
      Python
      Docker
      CUDA
      H100

mindmap root((NanoGPT-Bench)) Inputs Human anchor record Compute budget Agent API key Outputs Validated speedups Run logs Timestamped workspaces Use Cases Score frontier coding agents Compare agent prompting styles Refresh benchmark anchor Tech Stack Python Docker CUDA H100

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Benchmark a frontier coding agent like Codex or Claude Code on autonomous ML research

USE CASE 2

Compare prompting strategies such as Autoresearch against a default Claude Code setup

USE CASE 3

Refresh the benchmark by swapping in a newer human anchor record

USE CASE 4

Reproduce the reported result that no agent recovered more than 10% of the human speedup

What is it built with?

PythonDockerCUDAPyTorch

How does it compare?

	intologyai/nanogpt-bench	1lystore/awaek	actashui/sjtu-ppt-template-skill
Stars	13	13	13
Language	Python	Python	Python
Setup difficulty	hard	moderate	moderate
Complexity	5/5	2/5	2/5
Audience	researcher	vibe coder	researcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Practical runs need H100-class GPUs, Docker with CUDA, large dataset prefetch, and an API key for the agent being evaluated.

In plain English

NanoGPT-Bench is a benchmark for testing how well AI coding agents can do long, open-ended machine-learning research on their own. It is built on top of an existing community project called the NanoGPT Speedrun, in which people compete to pretrain a small GPT-2 model as fast as possible. The leaderboard has a long history of human submissions, each one a small algorithmic improvement, and this benchmark uses that history as a yardstick for autonomous agents. In a run, an AI agent is dropped into a sandboxed container with a strong starting point taken from the human leaderboard, no internet access, and a fixed compute budget. The agent then has to come up with its own ideas to make the training script faster. To check a candidate, it calls a submit command inside the container. The submitter does two things: it runs an LLM judge that checks the change against the speedrun competition rules, then it retimes the candidate ten times to confirm any speedup is statistically significant. Both the starting record and the compute budget are knobs, so the benchmark can be refreshed later without contamination. The project tested three frontier coding agents: Codex backed by GPT-5.4 xhigh, Claude Code backed by Opus 4.6 Max, and a second Claude Code variant using a prompting style from a project called Autoresearch. Each had 512 H100-hours of compute and started from the human world record set on September 3rd, 2025. None of them recovered more than 10 percent of the speedup that humans found over the following five months, and the agents spent most of their compute on tuning hyperparameters, while around 77 percent of human records involved real algorithmic changes. The repository is organized into a host-side harness under nanogpt/, a Docker image under image/ that holds the training environment and the submit validator, and human_baselines/ with snapshots of historical record submissions. To run the benchmark, you build the Docker image once, which prefetches nine FineWeb10B training shards plus a validation shard. Then you export the API key for the agent you want to test, set BENCHMARK_SESSION_HOURS, and run one of the launcher scripts under nanogpt/run/. Each launcher copies the anchor record into a fresh timestamped workspace, mounts a shared data volume, and streams logs as the agent works.

Copy-paste prompts

Prompt 1

Walk me through building the NanoGPT-Bench Docker image and prefetching the FineWeb10B shards

Prompt 2

Show me how to launch a Claude Code run with a 512 H100-hour budget using the scripts under nanogpt/run/

Prompt 3

Explain how the submit command validates a candidate with an LLM judge and ten retimings

Prompt 4

How do I plug a new agent into the harness so it can call the in-container submit command

Prompt 5

What does the human_baselines folder contain and how is the anchor record copied into a fresh workspace

Frequently asked questions

What is nanogpt-bench?

Benchmark that drops an AI coding agent into a sandboxed container and measures how much it can speed up the NanoGPT pretraining speedrun versus human leaderboard records.

What language is nanogpt-bench written in?

Mainly Python. The stack also includes Python, Docker, CUDA.

How hard is nanogpt-bench to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is nanogpt-bench for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.