explaingit

jiayi-pan/tinyzero

13,096PythonAudience · researcherComplexity · 4/5Setup · hard

TLDR

A minimal research project that recreates the DeepSeek R1 Zero self-improving AI reasoning experiment on math tasks, reproducible for under $30 using two GPUs.

Mindmap

mindmap
  root((TinyZero))
    What it does
      Recreates DeepSeek R1 Zero
      Self-improving reasoning
    Tasks
      Countdown math
      Multiplication
    Training method
      Reinforcement learning
      No worked examples
      Reward on correct answer
    Requirements
      Python and veRL
      Two GPUs for 3B
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Train a small language model to solve math problems through reinforcement learning without providing worked examples.

USE CASE 2

Reproduce the DeepSeek R1 Zero self-verification experiment at a fraction of the original computing cost.

USE CASE 3

Use as a concrete runnable reference when learning how reinforcement learning is applied to language model training.

Tech stack

PythonveRLPyTorch

Getting it running

Difficulty · hard Time to first run · 1day+

Two GPUs required for the 3B-parameter model that shows meaningful reasoning improvements, single GPU only supports the 1.5B model.

In plain English

TinyZero is a research project that recreates a specific AI training experiment called DeepSeek R1 Zero, scaled down so it can run without a massive computing budget. The original DeepSeek R1 Zero showed that an AI model could teach itself to reason through problems step by step purely through trial and error, without being shown examples of correct reasoning first. TinyZero demonstrates that same effect on two math tasks: countdown (reaching a target number using arithmetic) and multiplication. The core idea is reinforcement learning, which means the model is trained by giving it a problem, letting it attempt a solution, and then rewarding it when the answer is right. No step-by-step worked examples are provided. Over many training rounds, a 3-billion-parameter language model gradually develops what the authors describe as self-verification and search abilities, meaning it starts checking its own work and exploring different approaches before committing to an answer. The authors call this moment of capability emerging the "Aha moment," and they say you can reproduce it yourself for under $30 in cloud computing costs. The project is built on top of an existing training library called veRL and uses Qwen2.5 series base models. Setup involves installing Python dependencies and running shell scripts to prepare data and launch training. Single-GPU training works for smaller models up to 1.5 billion parameters, while larger 3-billion-parameter models require two GPUs and show more meaningful reasoning improvements. The README provides exact commands for both configurations. One thing to note: as of the time of archival, the authors have deprecated this repository and recommend using the veRL library directly for new reinforcement learning experiments. TinyZero remains available for reference, and the full training logs from the original experiments are publicly accessible online. If you want to understand how modern AI reasoning models are trained without reading dense research papers, TinyZero is a concrete, runnable example that walks through the full process from data preparation to training completion.

Copy-paste prompts

Prompt 1
Using TinyZero, show me the exact commands to prepare the countdown dataset and start single-GPU training on the 1.5B model.
Prompt 2
Explain the reward function used in TinyZero for the countdown task and how it signals correct or incorrect answers to the model.
Prompt 3
How does TinyZero implement the reinforcement learning training loop using veRL, and where in the code does the model receive its reward?
Prompt 4
What changes would I need to make to TinyZero to train on a custom arithmetic dataset instead of countdown or multiplication?
Prompt 5
Why does TinyZero require two GPUs for the 3B model but only one for 1.5B, and what difference does it make to the results?
Open on GitHub → Explain another repo

← jiayi-pan on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.