jiaxin-wen/gdsuite

Analysis updated 2026-06-24

★ 8PythonAudience · researcherComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((GDsuite))
    Inputs
      HF model name
      Eval data from HF hub
    Outputs
      Parrot vs Intelligence labels
      Per-task scores
    Use Cases
      Track generalization across checkpoints
      Compare base models
      Probe in-context learning
    Tech Stack
      Python
      vLLM
      PyTorch
      Transformers

mindmap root((GDsuite)) Inputs HF model name Eval data from HF hub Outputs Parrot vs Intelligence labels Per-task scores Use Cases Track generalization across checkpoints Compare base models Probe in-context learning Tech Stack Python vLLM PyTorch Transformers

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Score a base language model on Parrot vs Intelligence answers

USE CASE 2

Plot how scores shift across pre-training checkpoints

USE CASE 3

Add a new probe task with paired Parrot and Intelligence answers

USE CASE 4

Reproduce the Olmo-3-1025-7B eval from the blog post

What is it built with?

PythonvLLMPyTorchTransformersHuggingFace

How does it compare?

	jiaxin-wen/gdsuite	adam-s/car-diagnosis	bongobongo2020/krea2-character-lora-trainer
Stars	8	8	8
Language	Python	Python	Python
Setup difficulty	moderate	moderate	moderate
Complexity	3/5	3/5	3/5
Audience	researcher	researcher	vibe coder

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 1h+

Needs a GPU host with vLLM and a Hugging Face model checkpoint, the eval dataset auto-downloads from the hub.

In plain English

GDsuite is a small evaluation kit aimed at researchers studying how large language models learn during pre-training. The author calls it a toy eval suite for tracing generalization dynamics. Each task in the suite is built to ask the same kind of question: when faced with a tricky prompt, does the model copy a surface pattern it has seen (the README calls this Parrot behavior), or does it apply real reasoning (Intelligence behavior)? There are six tasks in the README table. Three of them probe in-context learning. Flipped Answer flips sentiment labels from the training examples to see if the model still copies the old mapping. Repetitive Answer feeds three examples that all share the same numeric answer to see if the model just repeats it. Successive Answer chains arithmetic examples whose answers form a sequence to see if the model continues the sequence instead of solving the new problem. The other three tasks cover different angles. Truthy Answer tests whether the model picks an answer that sounds true over one that is actually true. Intuitive Answer is a zero-shot test using the bat-and-ball puzzle to see if the model gives the gut answer of 10 cents instead of the correct 5 cents. Multi-hop Persona QA checks whether the model links separate facts into a coherent persona or treats them as disconnected. Each item lists what a Parrot model would say and what an Intelligence model would say, so the evaluation result is just whether the model gave the Parrot answer or the Intelligence answer. To use it, the code clones the repo, installs vllm, torch, transformers, pyyaml, datasets, and huggingface_hub, then runs run_eval.py with a model name and an output directory. The README shows an example using an early checkpoint of allenai/Olmo-3-1025-7B. The eval data itself lives on the Hugging Face hub under jiaxin-wen/generalization-dynamics-evals, and the script downloads it on first run, so no manual data setup is needed. The README links to a longer blog post for the full theory and gives a citation entry for the work.

Copy-paste prompts

Prompt 1

Run GDsuite against a local checkpoint of Olmo-3-1025-7B and save the results

Prompt 2

Walk me through how each of the six tasks separates Parrot from Intelligence answers

Prompt 3

Add a seventh task to GDsuite that probes a different reasoning trap

Prompt 4

Plot Parrot vs Intelligence rates across pre-training checkpoints from one model family

Prompt 5

Explain the bat-and-ball Intuitive Answer task and why 10 cents is the Parrot reply

Frequently asked questions

What is gdsuite?

Toy eval suite of six tasks that probes whether a language model copies surface patterns (Parrot) or applies real reasoning (Intelligence) during pre-training.

What language is gdsuite written in?

Mainly Python. The stack also includes Python, vLLM, PyTorch.

How hard is gdsuite to set up?

Setup difficulty is rated moderate, with roughly 1h+ to a first successful run.

Who is gdsuite for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.