Reproduce D-ARL results on AIME, MATH-500, and LiveCodeBench
Compare asynchronous RL with and without the variance-guided sample filter
Launch D-ARL training jobs on an existing Slurm cluster
Adapt the per-version optimization trick to a different verl-based RL pipeline
Needs a conda environment, the verl framework, multiple GPUs, and ideally a Slurm cluster to reproduce paper-scale runs on Qwen3 models.
D-ARL is the source code for a research paper accepted to the ICML 2026 conference. The work is about training large language models so they get better at reasoning, using a technique called reinforcement learning. In reinforcement learning the model practices solving problems and gets rewarded when it does well, then updates itself based on those rewards. The paper tackles a specific problem in a faster style of training called asynchronous reinforcement learning. In this style, one part of the system keeps generating new practice problems while another part keeps updating the model, so neither has to wait for the other. The catch is that by the time a practice problem is used for an update, the model that generated it is already out of date. The data no longer matches the current model, and training can become unstable. D-ARL offers three pieces of machinery to address this. It keeps a memory buffer of the most recent past versions of the model and replays their data. It then picks out the practice samples that still line up well with the current model, guided by a measure called variance. Finally, it uses an optimization method that treats data from different past model versions as distinct sources instead of mashing them together. The code is built on an existing open-source framework called verl from Volcano Engine, which already handles a lot of the plumbing for training language models with reinforcement learning. To use D-ARL you create a Python conda environment, install the requirements, and run one of the provided shell scripts. There are scripts for the D-ARL configuration as well as baseline scripts that turn the new features off, so you can compare results. A separate script is included for launching jobs on a Slurm cluster. The paper tests the method on six public reasoning benchmarks covering grade-school math, the AIME competition, MATH-500, LightEval, plus the code benchmarks LiveCodeBench and HumanEval. The base models used are Qwen3-1.7B and Qwen3-4B. The repository also includes documentation pages on one-step off-policy training, fully asynchronous training, and rollout importance sampling.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.