explaingit

xiaoxuannlp/golongrl

26Python

TLDR

GoLongRL is the code release accompanying a research paper from Kwai-Klear about training language models to handle very long pieces of text.

Mindmap

A visual breakdown will appear here once this repo is fully enriched.

In plain English

GoLongRL is the code release accompanying a research paper from Kwai-Klear about training language models to handle very long pieces of text. The problem they target is that when an AI model is asked to read a document of hundreds of pages, it often loses track. Existing fine-tuning recipes that use reinforcement learning for long context, the authors argue, focus almost entirely on tricky retrieval puzzles, like multi-hop question chains, while ignoring other things a reader has to do, such as summarizing, ranking items, or aggregating numbers across the document. The project's first contribution is a training dataset of 23,000 samples covering nine different task types: precise retrieval, comprehension, exhaustive retrieval, numerical reasoning, structured extraction, structured matching, graded ranking, sequence ordering, and summarization. Crucially, each task type is paired with its own natural scoring rule (such as exact match, F1, ROUGE-L, or pairwise comparison) rather than being squeezed into a single yes-or-no reward. The dataset is published on Hugging Face under the name Kwai-Klear/GoLongRL. The second contribution is a training tweak called TMN-Reweight, short for Task-Mixed Normalization. When a model is trained on so many different task types at once, the score scales differ, and the usual normalization step inside the GRPO reinforcement learning algorithm can confuse a hard prompt with a high-variance task type. TMN-Reweight normalizes inside each reward-type group instead of globally, and adds a weight that pays more attention to prompts of medium difficulty. The README reports this gives a small but steady improvement over vanilla GRPO in the authors' tests. Two trained models are released, GoLongRL-4B and GoLongRL-30B-A3B. The 30B-A3B version is a mixture-of-experts model that, according to the paper's table, reaches an average score of 69.8 across six long-context benchmarks including DocMath, LongBench-V2, Frames, MRCR, CorpusQA, and LBV1-QA, which the authors describe as comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking despite using a much smaller active parameter budget. The repository contains the training code (built on top of the verl framework, running across 16 nodes of 8 GPUs with SGLang for asynchronous model serving) and the evaluation code (using a suite called QwenLong-Benchmarks that covers long-context tasks, general benchmarks like MMLU-Pro and AIME, and memory benchmarks). Example shell scripts launch training of Qwen3-4B with either plain GRPO or with TMN-GRPO and difficulty reweighting.

Open on GitHub → Explain another repo

Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.