explaingit

yinqibai962/d-arl

24PythonAudience · researcherComplexity · 5/5ActiveLicenseSetup · hard

TLDR

ICML 2026 paper code for asynchronous reinforcement learning of LLM reasoning, with a replay buffer, variance-guided sample selection, and per-version optimization on top of verl.

Mindmap

mindmap
  root((D-ARL))
    Inputs
      Math benchmarks
      Code benchmarks
      Qwen3 base models
      Shell config scripts
    Outputs
      Trained reasoning models
      Eval scores
      Training logs
    Use Cases
      Async RL research
      Reasoning model training
      Slurm cluster runs
      Baseline ablations
    Tech Stack
      Python
      PyTorch
      verl
      Conda
      Slurm

Things people build with this

USE CASE 1

Reproduce D-ARL results on AIME, MATH-500, and LiveCodeBench

USE CASE 2

Compare asynchronous RL with and without the variance-guided sample filter

USE CASE 3

Launch D-ARL training jobs on an existing Slurm cluster

USE CASE 4

Adapt the per-version optimization trick to a different verl-based RL pipeline

Tech stack

PythonPyTorchverlCondaSlurm

Getting it running

Difficulty · hard Time to first run · 1day+

Needs a conda environment, the verl framework, multiple GPUs, and ideally a Slurm cluster to reproduce paper-scale runs on Qwen3 models.

Apache 2.0 license, so you can use, modify, and ship it commercially as long as you keep notices and the patent grant.

In plain English

D-ARL is the source code for a research paper accepted to the ICML 2026 conference. The work is about training large language models so they get better at reasoning, using a technique called reinforcement learning. In reinforcement learning the model practices solving problems and gets rewarded when it does well, then updates itself based on those rewards. The paper tackles a specific problem in a faster style of training called asynchronous reinforcement learning. In this style, one part of the system keeps generating new practice problems while another part keeps updating the model, so neither has to wait for the other. The catch is that by the time a practice problem is used for an update, the model that generated it is already out of date. The data no longer matches the current model, and training can become unstable. D-ARL offers three pieces of machinery to address this. It keeps a memory buffer of the most recent past versions of the model and replays their data. It then picks out the practice samples that still line up well with the current model, guided by a measure called variance. Finally, it uses an optimization method that treats data from different past model versions as distinct sources instead of mashing them together. The code is built on an existing open-source framework called verl from Volcano Engine, which already handles a lot of the plumbing for training language models with reinforcement learning. To use D-ARL you create a Python conda environment, install the requirements, and run one of the provided shell scripts. There are scripts for the D-ARL configuration as well as baseline scripts that turn the new features off, so you can compare results. A separate script is included for launching jobs on a Slurm cluster. The paper tests the method on six public reasoning benchmarks covering grade-school math, the AIME competition, MATH-500, LightEval, plus the code benchmarks LiveCodeBench and HumanEval. The base models used are Qwen3-1.7B and Qwen3-4B. The repository also includes documentation pages on one-step off-policy training, fully asynchronous training, and rollout importance sampling.

Copy-paste prompts

Prompt 1
Walk me through how D-ARL keeps a buffer of past model versions and replays their rollouts during training
Prompt 2
Explain the variance-guided sample selection step in D-ARL and how it decides which rollouts to keep
Prompt 3
Write a Slurm batch script that runs the D-ARL config on Qwen3-4B across two nodes
Prompt 4
Diff the baseline shell scripts against the D-ARL script in this repo and summarize what flags change
Prompt 5
Show how to plug a new benchmark into the D-ARL evaluation pipeline
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.