explaingit

yedaotian9/lite-opd

16PythonAudience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

Research code for on-policy distillation that trains a smaller LLM to match a larger teacher model, fitting the full loop on a single GPU.

Mindmap

mindmap
  root((Lite-OPD))
    Inputs
      Student model
      Teacher model
      Prompt set
    Outputs
      Distilled student weights
      Training logs
      Benchmarks
    Use Cases
      Distill a smaller LLM
      Compare loss functions
      Study OPD code
    Tech Stack
      Python
      CUDA
      SGLang
      FlashInfer

Things people build with this

USE CASE 1

Distill a Qwen 3 or Llama 3 student model from a larger teacher on a single GPU.

USE CASE 2

Compare the three full-vocabulary loss functions on the same student and teacher pair.

USE CASE 3

Read the single-file training loop as a reference implementation of on-policy distillation.

USE CASE 4

Benchmark on-policy distillation against ms-swift on the H20 GPU configuration the README describes.

Tech stack

PythonCUDAPyTorchSGLangFlashInfer

Getting it running

Difficulty · hard Time to first run · 1day+

Needs a GPU large enough to host student and teacher in one process, plus building CUDA and C kernels from source.

In plain English

Lite-OPD is a research code that helps a smaller language model learn from a larger one, a process the project calls on-policy distillation. The smaller student model writes answers to prompts in real time, the larger teacher model then scores those answers, and the student's training pushes it to match the teacher's choices more closely each step. The README says the project supports models from the Qwen 2.5, Qwen 3, Llama 3.x, and Gemma 3 families, and three different loss functions over the full vocabulary. The author's stated goal is to keep the code easy to read and modify, even at the cost of features. The training loop lives in a single file with no plugin system, no callbacks, and no deep configuration layers, so to change the behaviour you edit the code directly. The model that generates the responses and the model that is being trained are the same set of weights, sitting in the same process, so there is no copying step between writing answers and updating the model. A training step does four things in order. The student first generates several responses to a prompt using an embedded inference engine. The teacher then reads those responses and computes its own probability for every position in the response. A KL divergence loss compares the two probability distributions, and the gradient is pushed back through the student in chunks so that the peak memory does not grow with the response length. The optimizer updates the student's weights, and the inference engine sees the new weights immediately because they live in shared memory. The codebase is described as roughly 9000 lines of Python plus 2100 lines of CUDA and C kernels. The full loop is reported to fit on a single GPU, which the author calls a low hardware barrier for lab work. The README also includes a benchmark against another framework called ms-swift, reporting modest speedups for the configurations tested on a pair of H20 GPUs. The acknowledgements credit several open-source projects whose code the inference engine builds on: SGLang, mini-sglang, and FlashInfer.

Copy-paste prompts

Prompt 1
Set up Lite-OPD to distill a Qwen 2.5 1.5B student from a Qwen 2.5 7B teacher on one H100, and show me the exact command.
Prompt 2
Walk me through the four steps of a Lite-OPD training step and point to the line in the single-file loop where the KL divergence is computed.
Prompt 3
Swap the loss function in Lite-OPD from KL divergence to a reverse-KL variant and explain which lines to edit.
Prompt 4
Reproduce the Lite-OPD vs ms-swift benchmark on a pair of H20 GPUs and report tokens-per-second per step.
Prompt 5
Modify Lite-OPD to log per-position teacher entropy during training and write a small matplotlib plot script.
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.