yinqibai962/d-arl

Analysis updated 2026-06-24

★ 24PythonAudience · researcherComplexity · 5/5LicenseSetup · hard

Mindmap

mindmap
  root((D-ARL))
    Inputs
      Math benchmarks
      Code benchmarks
      Qwen3 base models
      Shell config scripts
    Outputs
      Trained reasoning models
      Eval scores
      Training logs
    Use Cases
      Async RL research
      Reasoning model training
      Slurm cluster runs
      Baseline ablations
    Tech Stack
      Python
      PyTorch
      verl
      Conda
      Slurm

mindmap root((D-ARL)) Inputs Math benchmarks Code benchmarks Qwen3 base models Shell config scripts Outputs Trained reasoning models Eval scores Training logs Use Cases Async RL research Reasoning model training Slurm cluster runs Baseline ablations Tech Stack Python PyTorch verl Conda Slurm

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Reproduce D-ARL results on AIME, MATH-500, and LiveCodeBench

USE CASE 2

Compare asynchronous RL with and without the variance-guided sample filter

USE CASE 3

Launch D-ARL training jobs on an existing Slurm cluster

USE CASE 4

Adapt the per-version optimization trick to a different verl-based RL pipeline

What is it built with?

PythonPyTorchverlCondaSlurm

How does it compare?

	yinqibai962/d-arl	18597990650-lab/multi-agent-game	agents365-ai/cloakfetch
Stars	24	24	24
Language	Python	Python	Python
Setup difficulty	hard	moderate	moderate
Complexity	5/5	3/5	3/5
Audience	researcher	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Needs a conda environment, the verl framework, multiple GPUs, and ideally a Slurm cluster to reproduce paper-scale runs on Qwen3 models.

Apache 2.0 license, so you can use, modify, and ship it commercially as long as you keep notices and the patent grant.

In plain English

D-ARL is the source code for a research paper accepted to the ICML 2026 conference. The work is about training large language models so they get better at reasoning, using a technique called reinforcement learning. In reinforcement learning the model practices solving problems and gets rewarded when it does well, then updates itself based on those rewards. The paper tackles a specific problem in a faster style of training called asynchronous reinforcement learning. In this style, one part of the system keeps generating new practice problems while another part keeps updating the model, so neither has to wait for the other. The catch is that by the time a practice problem is used for an update, the model that generated it is already out of date. The data no longer matches the current model, and training can become unstable. D-ARL offers three pieces of machinery to address this. It keeps a memory buffer of the most recent past versions of the model and replays their data. It then picks out the practice samples that still line up well with the current model, guided by a measure called variance. Finally, it uses an optimization method that treats data from different past model versions as distinct sources instead of mashing them together. The code is built on an existing open-source framework called verl from Volcano Engine, which already handles a lot of the plumbing for training language models with reinforcement learning. To use D-ARL you create a Python conda environment, install the requirements, and run one of the provided shell scripts. There are scripts for the D-ARL configuration as well as baseline scripts that turn the new features off, so you can compare results. A separate script is included for launching jobs on a Slurm cluster. The paper tests the method on six public reasoning benchmarks covering grade-school math, the AIME competition, MATH-500, LightEval, plus the code benchmarks LiveCodeBench and HumanEval. The base models used are Qwen3-1.7B and Qwen3-4B. The repository also includes documentation pages on one-step off-policy training, fully asynchronous training, and rollout importance sampling.

Copy-paste prompts

Prompt 1

Walk me through how D-ARL keeps a buffer of past model versions and replays their rollouts during training

Prompt 2

Explain the variance-guided sample selection step in D-ARL and how it decides which rollouts to keep

Prompt 3

Write a Slurm batch script that runs the D-ARL config on Qwen3-4B across two nodes

Prompt 4

Diff the baseline shell scripts against the D-ARL script in this repo and summarize what flags change

Prompt 5

Show how to plug a new benchmark into the D-ARL evaluation pipeline

Frequently asked questions

What is d-arl?

ICML 2026 paper code for asynchronous reinforcement learning of LLM reasoning, with a replay buffer, variance-guided sample selection, and per-version optimization on top of verl.

What language is d-arl written in?

Mainly Python. The stack also includes Python, PyTorch, verl.

What license does d-arl use?

Apache 2.0 license, so you can use, modify, and ship it commercially as long as you keep notices and the patent grant.

How hard is d-arl to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is d-arl for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.