Distill a Qwen 3 or Llama 3 student model from a larger teacher on a single GPU.
Compare the three full-vocabulary loss functions on the same student and teacher pair.
Read the single-file training loop as a reference implementation of on-policy distillation.
Benchmark on-policy distillation against ms-swift on the H20 GPU configuration the README describes.
Needs a GPU large enough to host student and teacher in one process, plus building CUDA and C kernels from source.
Lite-OPD is a research code that helps a smaller language model learn from a larger one, a process the project calls on-policy distillation. The smaller student model writes answers to prompts in real time, the larger teacher model then scores those answers, and the student's training pushes it to match the teacher's choices more closely each step. The README says the project supports models from the Qwen 2.5, Qwen 3, Llama 3.x, and Gemma 3 families, and three different loss functions over the full vocabulary. The author's stated goal is to keep the code easy to read and modify, even at the cost of features. The training loop lives in a single file with no plugin system, no callbacks, and no deep configuration layers, so to change the behaviour you edit the code directly. The model that generates the responses and the model that is being trained are the same set of weights, sitting in the same process, so there is no copying step between writing answers and updating the model. A training step does four things in order. The student first generates several responses to a prompt using an embedded inference engine. The teacher then reads those responses and computes its own probability for every position in the response. A KL divergence loss compares the two probability distributions, and the gradient is pushed back through the student in chunks so that the peak memory does not grow with the response length. The optimizer updates the student's weights, and the inference engine sees the new weights immediately because they live in shared memory. The codebase is described as roughly 9000 lines of Python plus 2100 lines of CUDA and C kernels. The full loop is reported to fit on a single GPU, which the author calls a low hardware barrier for lab work. The README also includes a benchmark against another framework called ms-swift, reporting modest speedups for the configurations tested on a pair of H20 GPUs. The acknowledgements credit several open-source projects whose code the inference engine builds on: SGLang, mini-sglang, and FlashInfer.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.