xiaoxuannlp/golongrl

Analysis updated 2026-06-24

★ 26PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((GoLongRL))
    Inputs
      23k sample dataset
      Long documents
      Qwen3 base models
    Outputs
      GoLongRL-4B model
      GoLongRL-30B-A3B MoE
      Benchmark scores
    Use Cases
      Train long-context LLMs
      Reproduce paper results
      Benchmark long-context tasks
    Tech Stack
      Python
      verl
      SGLang
      GRPO
    Task Types
      Retrieval
      Summarization
      Numerical reasoning
      Ranking

mindmap root((GoLongRL)) Inputs 23k sample dataset Long documents Qwen3 base models Outputs GoLongRL-4B model GoLongRL-30B-A3B MoE Benchmark scores Use Cases Train long-context LLMs Reproduce paper results Benchmark long-context tasks Tech Stack Python verl SGLang GRPO Task Types Retrieval Summarization Numerical reasoning Ranking

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Reproduce GoLongRL training across 16 nodes of 8 GPUs

USE CASE 2

Fine-tune a different base model on the 23k Kwai-Klear/GoLongRL dataset

USE CASE 3

Try TMN-Reweight normalization on top of vanilla GRPO

USE CASE 4

Evaluate any long-context model on QwenLong-Benchmarks

What is it built with?

PythonPyTorchverlSGLangGRPO

How does it compare?

	xiaoxuannlp/golongrl	alicankiraz1/gemma-4-31b-mtp-vllm-server	chrisjohnson89/comfyui-neuralbooru
Stars	26	26	26
Language	Python	Python	Python
Setup difficulty	hard	hard	hard
Complexity	5/5	4/5	3/5
Audience	researcher	ops devops	vibe coder

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Full training run expects 16 nodes of 8 GPUs with verl and SGLang, so reproducing the paper needs serious cluster access.

In plain English

GoLongRL is the code release accompanying a research paper from Kwai-Klear about training language models to handle very long pieces of text. The problem they target is that when an AI model is asked to read a document of hundreds of pages, it often loses track. Existing fine-tuning recipes that use reinforcement learning for long context, the authors argue, focus almost entirely on tricky retrieval puzzles, like multi-hop question chains, while ignoring other things a reader has to do, such as summarizing, ranking items, or aggregating numbers across the document. The project's first contribution is a training dataset of 23,000 samples covering nine different task types: precise retrieval, comprehension, exhaustive retrieval, numerical reasoning, structured extraction, structured matching, graded ranking, sequence ordering, and summarization. Crucially, each task type is paired with its own natural scoring rule (such as exact match, F1, ROUGE-L, or pairwise comparison) rather than being squeezed into a single yes-or-no reward. The dataset is published on Hugging Face under the name Kwai-Klear/GoLongRL. The second contribution is a training tweak called TMN-Reweight, short for Task-Mixed Normalization. When a model is trained on so many different task types at once, the score scales differ, and the usual normalization step inside the GRPO reinforcement learning algorithm can confuse a hard prompt with a high-variance task type. TMN-Reweight normalizes inside each reward-type group instead of globally, and adds a weight that pays more attention to prompts of medium difficulty. The README reports this gives a small but steady improvement over vanilla GRPO in the authors' tests. Two trained models are released, GoLongRL-4B and GoLongRL-30B-A3B. The 30B-A3B version is a mixture-of-experts model that, according to the paper's table, reaches an average score of 69.8 across six long-context benchmarks including DocMath, LongBench-V2, Frames, MRCR, CorpusQA, and LBV1-QA, which the authors describe as comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking despite using a much smaller active parameter budget. The repository contains the training code (built on top of the verl framework, running across 16 nodes of 8 GPUs with SGLang for asynchronous model serving) and the evaluation code (using a suite called QwenLong-Benchmarks that covers long-context tasks, general benchmarks like MMLU-Pro and AIME, and memory benchmarks). Example shell scripts launch training of Qwen3-4B with either plain GRPO or with TMN-GRPO and difficulty reweighting.

Copy-paste prompts

Prompt 1

Walk me through how TMN-Reweight in GoLongRL differs from vanilla GRPO normalization with a worked example

Prompt 2

Set up a single-node debug run of GoLongRL training on Qwen3-4B and list every config I need to change

Prompt 3

Explain the nine task types in the GoLongRL dataset and which reward function each one uses

Prompt 4

Compare GoLongRL-30B-A3B against DeepSeek-R1-0528 across DocMath, LongBench-V2, Frames, MRCR, and CorpusQA

Prompt 5

Show me how to plug a new task type into GoLongRL with its own scoring rule and reward group

Frequently asked questions

What is golongrl?

Code, dataset, and two model checkpoints for training language models on long documents across nine task types using a reweighted reinforcement learning recipe called TMN-GRPO.

What language is golongrl written in?

Mainly Python. The stack also includes Python, PyTorch, verl.

How hard is golongrl to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is golongrl for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.