yedaotian9/lite-opd

Analysis updated 2026-06-24

★ 16PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((Lite-OPD))
    Inputs
      Student model
      Teacher model
      Prompt set
    Outputs
      Distilled student weights
      Training logs
      Benchmarks
    Use Cases
      Distill a smaller LLM
      Compare loss functions
      Study OPD code
    Tech Stack
      Python
      CUDA
      SGLang
      FlashInfer

mindmap root((Lite-OPD)) Inputs Student model Teacher model Prompt set Outputs Distilled student weights Training logs Benchmarks Use Cases Distill a smaller LLM Compare loss functions Study OPD code Tech Stack Python CUDA SGLang FlashInfer

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Distill a Qwen 3 or Llama 3 student model from a larger teacher on a single GPU.

USE CASE 2

Compare the three full-vocabulary loss functions on the same student and teacher pair.

USE CASE 3

Read the single-file training loop as a reference implementation of on-policy distillation.

USE CASE 4

Benchmark on-policy distillation against ms-swift on the H20 GPU configuration the README describes.

What is it built with?

PythonCUDAPyTorchSGLangFlashInfer

How does it compare?

	yedaotian9/lite-opd	adya84/ha-world-cup-2026	afk-surf/safeclipper
Stars	16	16	16
Language	Python	Python	Python
Setup difficulty	hard	easy	moderate
Complexity	5/5	2/5	3/5
Audience	researcher	general	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Needs a GPU large enough to host student and teacher in one process, plus building CUDA and C kernels from source.

In plain English

Lite-OPD is a research code that helps a smaller language model learn from a larger one, a process the project calls on-policy distillation. The smaller student model writes answers to prompts in real time, the larger teacher model then scores those answers, and the student's training pushes it to match the teacher's choices more closely each step. The README says the project supports models from the Qwen 2.5, Qwen 3, Llama 3.x, and Gemma 3 families, and three different loss functions over the full vocabulary. The author's stated goal is to keep the code easy to read and modify, even at the cost of features. The training loop lives in a single file with no plugin system, no callbacks, and no deep configuration layers, so to change the behaviour you edit the code directly. The model that generates the responses and the model that is being trained are the same set of weights, sitting in the same process, so there is no copying step between writing answers and updating the model. A training step does four things in order. The student first generates several responses to a prompt using an embedded inference engine. The teacher then reads those responses and computes its own probability for every position in the response. A KL divergence loss compares the two probability distributions, and the gradient is pushed back through the student in chunks so that the peak memory does not grow with the response length. The optimizer updates the student's weights, and the inference engine sees the new weights immediately because they live in shared memory. The codebase is described as roughly 9000 lines of Python plus 2100 lines of CUDA and C kernels. The full loop is reported to fit on a single GPU, which the author calls a low hardware barrier for lab work. The README also includes a benchmark against another framework called ms-swift, reporting modest speedups for the configurations tested on a pair of H20 GPUs. The acknowledgements credit several open-source projects whose code the inference engine builds on: SGLang, mini-sglang, and FlashInfer.

Copy-paste prompts

Prompt 1

Set up Lite-OPD to distill a Qwen 2.5 1.5B student from a Qwen 2.5 7B teacher on one H100, and show me the exact command.

Prompt 2

Walk me through the four steps of a Lite-OPD training step and point to the line in the single-file loop where the KL divergence is computed.

Prompt 3

Swap the loss function in Lite-OPD from KL divergence to a reverse-KL variant and explain which lines to edit.

Prompt 4

Reproduce the Lite-OPD vs ms-swift benchmark on a pair of H20 GPUs and report tokens-per-second per step.

Prompt 5

Modify Lite-OPD to log per-position teacher entropy during training and write a small matplotlib plot script.

Frequently asked questions

What is lite-opd?

Research code for on-policy distillation that trains a smaller LLM to match a larger teacher model, fitting the full loop on a single GPU.

What language is lite-opd written in?

Mainly Python. The stack also includes Python, CUDA, PyTorch.

How hard is lite-opd to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is lite-opd for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.