vivekvkashyap/synthetic-self-improve-rl

Analysis updated 2026-06-24

★ 19Audience · researcherComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((synthetic-self-improve-rl))
    Inputs
      Prime hub dataset
      Student checkpoint
      Time budget
    Outputs
      Improved model
      Synthetic rows
      Eval scores
    Use Cases
      Boost a small LLM
      Generate task data
      Run targeted RL
    Tech Stack
      Python
      prime-rl
      verifiers
      Claude Code

mindmap root((synthetic-self-improve-rl)) Inputs Prime hub dataset Student checkpoint Time budget Outputs Improved model Synthetic rows Eval scores Use Cases Boost a small LLM Generate task data Run targeted RL Tech Stack Python prime-rl verifiers Claude Code

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Self-improve a Qwen3-0.6B model on gsm8k with synthetic data

USE CASE 2

Run a 10-hour RL training budget on any Prime hub environment

USE CASE 3

Probe where a student model fails and target those gaps

USE CASE 4

Compare synthetic vs real-data-only training as a control

What is it built with?

Pythonprime-rlverifiersClaude Code

How does it compare?

	vivekvkashyap/synthetic-self-improve-rl	16nic/comfyui-agnes-ai	6c696e68/gpt_signup_hybrid
Stars	19	19	19
Language	—	Python	Python
Setup difficulty	hard	moderate	hard
Complexity	4/5	2/5	4/5
Audience	researcher	vibe coder	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

You need a GPU plus three Python libs (verifiers, prime, prime-rl) and a working Prime hub environment before the skill can train anything.

MIT licensed: free to use, modify, and redistribute commercially as long as the copyright notice is included.

In plain English

This repository is a Claude Code skill, which is a recipe that Claude can load and follow to do a particular job. In this case, the job is to take a small open-source language model and make it better at a task by training it on questions and answers that Claude itself invents, again and again, until a time budget runs out. The loop works like this. First, Claude looks at where the small model, called the student, is failing on real training examples. It then writes a generated dataset of 500 to 1000 rows aimed at those weak spots, wraps that dataset in an environment compatible with a library called verifiers, and runs 100 more steps of reinforcement-learning training using another library called prime-rl. After each round, the student is tested on the real held-out test set, not on the synthetic data. The loop continues until either a wall-clock budget such as ten hours is reached or a maximum number of iterations is hit. After the loop, two control runs check whether the gain is real and whether it beats training on real data only. The README shows one example result. Using a tiny student model called Qwen3-0.6B on the gsm8k maths dataset, real-data training scored 78.54 percent accuracy. After adding about 700 generated rows on top, the score rose to 81.58 percent, a gain of just over three points from the synthetic pass alone. The skill is described as dataset-agnostic, meaning it does not assume the task is maths or code or question answering. You point its --hub-id flag at any Prime hub environment with a working verifiers rubric, and the skill inspects that environment in its first phase and mirrors its parser, rubric and system prompt thereafter. Flags let you change the student model, the budget, the maximum iterations, the rollout batch size, and the starting checkpoint. Installation needs three Python libraries: verifiers, prime, and prime-rl, installable through uv, pip, or local clones. You then copy the skill folder into your Claude Code skills directory and invoke it with a slash command followed by the dataset name. The project is released under the MIT licence.

Copy-paste prompts

Prompt 1

Run the synthetic-self-improve-rl skill on gsm8k with Qwen3-0.6B and a 4 hour budget

Prompt 2

Adapt the skill to point at a custom Prime hub environment for code generation

Prompt 3

Add a flag to control the number of synthetic rows generated per iteration

Prompt 4

Replace prime-rl with TRL as the training backend and keep the same loop

Prompt 5

Plot the eval accuracy curve across iterations from the skill output logs

Frequently asked questions

What is synthetic-self-improve-rl?

A Claude Code skill that runs a loop where Claude invents synthetic training data, retrains a small student model with prime-rl, and tests on real data until a time budget runs out.

What license does synthetic-self-improve-rl use?

MIT licensed: free to use, modify, and redistribute commercially as long as the copyright notice is included.

How hard is synthetic-self-improve-rl to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is synthetic-self-improve-rl for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.