explaingit

jiaxin-wen/gdsuite

9PythonAudience · researcherComplexity · 3/5ActiveSetup · moderate

TLDR

Toy eval suite of six tasks that probes whether a language model copies surface patterns (Parrot) or applies real reasoning (Intelligence) during pre-training.

Mindmap

mindmap
  root((GDsuite))
    Inputs
      HF model name
      Eval data from HF hub
    Outputs
      Parrot vs Intelligence labels
      Per-task scores
    Use Cases
      Track generalization across checkpoints
      Compare base models
      Probe in-context learning
    Tech Stack
      Python
      vLLM
      PyTorch
      Transformers

Things people build with this

USE CASE 1

Score a base language model on Parrot vs Intelligence answers

USE CASE 2

Plot how scores shift across pre-training checkpoints

USE CASE 3

Add a new probe task with paired Parrot and Intelligence answers

USE CASE 4

Reproduce the Olmo-3-1025-7B eval from the blog post

Tech stack

PythonvLLMPyTorchTransformersHuggingFace

Getting it running

Difficulty · moderate Time to first run · 1h+

Needs a GPU host with vLLM and a Hugging Face model checkpoint; the eval dataset auto-downloads from the hub.

In plain English

GDsuite is a small evaluation kit aimed at researchers studying how large language models learn during pre-training. The author calls it a toy eval suite for tracing generalization dynamics. Each task in the suite is built to ask the same kind of question: when faced with a tricky prompt, does the model copy a surface pattern it has seen (the README calls this Parrot behavior), or does it apply real reasoning (Intelligence behavior)? There are six tasks in the README table. Three of them probe in-context learning. Flipped Answer flips sentiment labels from the training examples to see if the model still copies the old mapping. Repetitive Answer feeds three examples that all share the same numeric answer to see if the model just repeats it. Successive Answer chains arithmetic examples whose answers form a sequence to see if the model continues the sequence instead of solving the new problem. The other three tasks cover different angles. Truthy Answer tests whether the model picks an answer that sounds true over one that is actually true. Intuitive Answer is a zero-shot test using the bat-and-ball puzzle to see if the model gives the gut answer of 10 cents instead of the correct 5 cents. Multi-hop Persona QA checks whether the model links separate facts into a coherent persona or treats them as disconnected. Each item lists what a Parrot model would say and what an Intelligence model would say, so the evaluation result is just whether the model gave the Parrot answer or the Intelligence answer. To use it, the code clones the repo, installs vllm, torch, transformers, pyyaml, datasets, and huggingface_hub, then runs run_eval.py with a model name and an output directory. The README shows an example using an early checkpoint of allenai/Olmo-3-1025-7B. The eval data itself lives on the Hugging Face hub under jiaxin-wen/generalization-dynamics-evals, and the script downloads it on first run, so no manual data setup is needed. The README links to a longer blog post for the full theory and gives a citation entry for the work.

Copy-paste prompts

Prompt 1
Run GDsuite against a local checkpoint of Olmo-3-1025-7B and save the results
Prompt 2
Walk me through how each of the six tasks separates Parrot from Intelligence answers
Prompt 3
Add a seventh task to GDsuite that probes a different reasoning trap
Prompt 4
Plot Parrot vs Intelligence rates across pre-training checkpoints from one model family
Prompt 5
Explain the bat-and-ball Intuitive Answer task and why 10 cents is the Parrot reply
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.