Self-improve a Qwen3-0.6B model on gsm8k with synthetic data
Run a 10-hour RL training budget on any Prime hub environment
Probe where a student model fails and target those gaps
Compare synthetic vs real-data-only training as a control
You need a GPU plus three Python libs (verifiers, prime, prime-rl) and a working Prime hub environment before the skill can train anything.
This repository is a Claude Code skill, which is a recipe that Claude can load and follow to do a particular job. In this case, the job is to take a small open-source language model and make it better at a task by training it on questions and answers that Claude itself invents, again and again, until a time budget runs out. The loop works like this. First, Claude looks at where the small model, called the student, is failing on real training examples. It then writes a generated dataset of 500 to 1000 rows aimed at those weak spots, wraps that dataset in an environment compatible with a library called verifiers, and runs 100 more steps of reinforcement-learning training using another library called prime-rl. After each round, the student is tested on the real held-out test set, not on the synthetic data. The loop continues until either a wall-clock budget such as ten hours is reached or a maximum number of iterations is hit. After the loop, two control runs check whether the gain is real and whether it beats training on real data only. The README shows one example result. Using a tiny student model called Qwen3-0.6B on the gsm8k maths dataset, real-data training scored 78.54 percent accuracy. After adding about 700 generated rows on top, the score rose to 81.58 percent, a gain of just over three points from the synthetic pass alone. The skill is described as dataset-agnostic, meaning it does not assume the task is maths or code or question answering. You point its --hub-id flag at any Prime hub environment with a working verifiers rubric, and the skill inspects that environment in its first phase and mirrors its parser, rubric and system prompt thereafter. Flags let you change the student model, the budget, the maximum iterations, the rollout batch size, and the starting checkpoint. Installation needs three Python libraries: verifiers, prime, and prime-rl, installable through uv, pip, or local clones. You then copy the skill folder into your Claude Code skills directory and invoke it with a slash command followed by the dataset name. The project is released under the MIT licence.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.