explaingit

vivekvkashyap/synthetic-self-improve-rl

22Audience · researcherComplexity · 4/5ActiveLicenseSetup · hard

TLDR

A Claude Code skill that runs a loop where Claude invents synthetic training data, retrains a small student model with prime-rl, and tests on real data until a time budget runs out.

Mindmap

mindmap
  root((synthetic-self-improve-rl))
    Inputs
      Prime hub dataset
      Student checkpoint
      Time budget
    Outputs
      Improved model
      Synthetic rows
      Eval scores
    Use Cases
      Boost a small LLM
      Generate task data
      Run targeted RL
    Tech Stack
      Python
      prime-rl
      verifiers
      Claude Code

Things people build with this

USE CASE 1

Self-improve a Qwen3-0.6B model on gsm8k with synthetic data

USE CASE 2

Run a 10-hour RL training budget on any Prime hub environment

USE CASE 3

Probe where a student model fails and target those gaps

USE CASE 4

Compare synthetic vs real-data-only training as a control

Tech stack

Pythonprime-rlverifiersClaude Code

Getting it running

Difficulty · hard Time to first run · 1day+

You need a GPU plus three Python libs (verifiers, prime, prime-rl) and a working Prime hub environment before the skill can train anything.

MIT licensed: free to use, modify, and redistribute commercially as long as the copyright notice is included.

In plain English

This repository is a Claude Code skill, which is a recipe that Claude can load and follow to do a particular job. In this case, the job is to take a small open-source language model and make it better at a task by training it on questions and answers that Claude itself invents, again and again, until a time budget runs out. The loop works like this. First, Claude looks at where the small model, called the student, is failing on real training examples. It then writes a generated dataset of 500 to 1000 rows aimed at those weak spots, wraps that dataset in an environment compatible with a library called verifiers, and runs 100 more steps of reinforcement-learning training using another library called prime-rl. After each round, the student is tested on the real held-out test set, not on the synthetic data. The loop continues until either a wall-clock budget such as ten hours is reached or a maximum number of iterations is hit. After the loop, two control runs check whether the gain is real and whether it beats training on real data only. The README shows one example result. Using a tiny student model called Qwen3-0.6B on the gsm8k maths dataset, real-data training scored 78.54 percent accuracy. After adding about 700 generated rows on top, the score rose to 81.58 percent, a gain of just over three points from the synthetic pass alone. The skill is described as dataset-agnostic, meaning it does not assume the task is maths or code or question answering. You point its --hub-id flag at any Prime hub environment with a working verifiers rubric, and the skill inspects that environment in its first phase and mirrors its parser, rubric and system prompt thereafter. Flags let you change the student model, the budget, the maximum iterations, the rollout batch size, and the starting checkpoint. Installation needs three Python libraries: verifiers, prime, and prime-rl, installable through uv, pip, or local clones. You then copy the skill folder into your Claude Code skills directory and invoke it with a slash command followed by the dataset name. The project is released under the MIT licence.

Copy-paste prompts

Prompt 1
Run the synthetic-self-improve-rl skill on gsm8k with Qwen3-0.6B and a 4 hour budget
Prompt 2
Adapt the skill to point at a custom Prime hub environment for code generation
Prompt 3
Add a flag to control the number of synthetic rows generated per iteration
Prompt 4
Replace prime-rl with TRL as the training backend and keep the same loop
Prompt 5
Plot the eval accuracy curve across iterations from the skill output logs
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.