Score a base language model on Parrot vs Intelligence answers
Plot how scores shift across pre-training checkpoints
Add a new probe task with paired Parrot and Intelligence answers
Reproduce the Olmo-3-1025-7B eval from the blog post
Needs a GPU host with vLLM and a Hugging Face model checkpoint; the eval dataset auto-downloads from the hub.
GDsuite is a small evaluation kit aimed at researchers studying how large language models learn during pre-training. The author calls it a toy eval suite for tracing generalization dynamics. Each task in the suite is built to ask the same kind of question: when faced with a tricky prompt, does the model copy a surface pattern it has seen (the README calls this Parrot behavior), or does it apply real reasoning (Intelligence behavior)? There are six tasks in the README table. Three of them probe in-context learning. Flipped Answer flips sentiment labels from the training examples to see if the model still copies the old mapping. Repetitive Answer feeds three examples that all share the same numeric answer to see if the model just repeats it. Successive Answer chains arithmetic examples whose answers form a sequence to see if the model continues the sequence instead of solving the new problem. The other three tasks cover different angles. Truthy Answer tests whether the model picks an answer that sounds true over one that is actually true. Intuitive Answer is a zero-shot test using the bat-and-ball puzzle to see if the model gives the gut answer of 10 cents instead of the correct 5 cents. Multi-hop Persona QA checks whether the model links separate facts into a coherent persona or treats them as disconnected. Each item lists what a Parrot model would say and what an Intelligence model would say, so the evaluation result is just whether the model gave the Parrot answer or the Intelligence answer. To use it, the code clones the repo, installs vllm, torch, transformers, pyyaml, datasets, and huggingface_hub, then runs run_eval.py with a model name and an output directory. The README shows an example using an early checkpoint of allenai/Olmo-3-1025-7B. The eval data itself lives on the Hugging Face hub under jiaxin-wen/generalization-dynamics-evals, and the script downloads it on first run, so no manual data setup is needed. The README links to a longer blog post for the full theory and gives a citation entry for the work.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.