redai-infra/pipo

★ 22PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((PIPO))
    What it does
      Compress input token pairs
      Predict extra output tokens
      Confidence filtering
    Training
      Fine-tuning on teacher outputs
      Distillation 9B to 4B
      Math and coding tasks
    Tech stack
      Python
      SGLang inference
      ms-swift training
      Hugging Face
    Supported models
      Qwen 3.5 4B
      Qwen 3.5 9B
    Research
      arXiv paper
      Reasoning benchmarks

mindmap root((PIPO)) What it does Compress input token pairs Predict extra output tokens Confidence filtering Training Fine-tuning on teacher outputs Distillation 9B to 4B Math and coding tasks Tech stack Python SGLang inference ms-swift training Hugging Face Supported models Qwen 3.5 4B Qwen 3.5 9B Research arXiv paper Reasoning benchmarks

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Reproduce the PIPO paper's speed benchmarks on Qwen 3.5 models using the provided evaluation scripts.

USE CASE 2

Fine-tune a smaller 4B model with PIPO-style distillation from a 9B teacher on your own math or coding dataset.

USE CASE 3

Run PIPO inference with SGLang to get faster token generation on a supported Qwen 3.5 checkpoint.

USE CASE 4

Download pre-trained PIPO checkpoints from Hugging Face and merge weights for evaluation.

Tech stack

PythonSGLangms-swiftHugging FacePyTorch

Getting it running

Difficulty · hard Time to first run · 1day+

Requires downloading large Qwen 3.5 model weights from Hugging Face and significant GPU memory for both training stages, only Qwen 3.5 backbones are currently supported.

No license information was mentioned in the explanation.

In plain English

PIPO, short for Pair-In, Pair-Out, is a research project proposing a new approach to making large language model inference faster without giving up accuracy. It was developed jointly by Renmin University of China, Xiaohongshu, and other institutions, and the results are described in an arXiv paper. The key idea pairs two operations that are normally developed separately. On the input side, the model compresses pairs of text tokens into a single internal representation, so it processes fewer units per step. On the output side, a secondary component predicts an extra token alongside each main prediction, so the model produces more text per forward pass. A small confidence module then decides whether each extra predicted token is reliable enough to keep, removing the need for a separate expensive verification step that other approaches require. Training follows two stages: a standard fine-tuning step using a teacher model's outputs, followed by a distillation phase where a larger 9B parameter model guides a smaller 4B model on math and coding problems. The code builds on top of two existing open-source projects: SGLang for inference and ms-swift for training. The repository includes scripts to download checkpoints and datasets from Hugging Face, merge trained model weights, run evaluations on several reasoning benchmarks, and reproduce the training runs. Experiments reported in the paper used Qwen 3.5 models at 4B and 9B parameter sizes. At the time of writing, only Qwen 3.5 backbones are supported, and a few known limitations exist around memory requirements during training and certain inference cache optimizations that are currently disabled in the PIPO inference path.

Copy-paste prompts

Prompt 1

I want to reproduce PIPO inference speedup on a Qwen 3.5 4B model. Walk me through downloading the checkpoint from Hugging Face, merging weights, and running the SGLang-based inference server.

Prompt 2

Show me how to run the PIPO training pipeline: first the fine-tuning stage with a teacher model, then the distillation stage using the 9B Qwen model as teacher and 4B as student.

Prompt 3

How does PIPO's confidence module decide whether to keep or discard the extra predicted token? Explain the mechanism and where it is implemented in the code.

Prompt 4

What evaluation benchmarks does PIPO support and how do I run them to compare PIPO vs baseline inference speed and accuracy?

Open on GitHub → Explain another repo

← redai-infra on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.