explaingit

zunhaisu/oscar-kv-quant

23C++Audience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

Research code for OScaR, a 2-bit KV cache quantization method for LLMs that uses a rotation plus token scaling fix to match 16-bit accuracy without retraining.

Mindmap

mindmap
  root((OScaR))
    Inputs
      LLM weights
      KV cache tensors
      Benchmark sets
    Outputs
      2 bit KV cache
      Decoded tokens
      Throughput metrics
    Use Cases
      Shrink long context memory
      Speed up decoding
      Run multi modal LLMs cheaper
    Tech Stack
      C++
      CUDA
      PyTorch
      CUTLASS

Things people build with this

USE CASE 1

Quantize the KV cache of Llama 3.1 8B or Qwen3 8B to 2 bits with near 16-bit accuracy

USE CASE 2

Reproduce the LongBench-E, OCRBench, and MMAU-Pro numbers from the OScaR paper

USE CASE 3

Cut LLM serving memory by roughly 5x on long context workloads

USE CASE 4

Apply Canalized Rotation and Omni-Token Scaling to a multi modal model like Qwen3-VL

Tech stack

C++CUDAPyTorchPythonCUTLASS

Getting it running

Difficulty · hard Time to first run · 1day+

Needs a CUDA 12.4 GPU stack, custom kernels on top of BitDecoding and HadaCore, plus a CUTLASS submodule and flash-attn 2.8.3.

In plain English

OScaR is a research project from teams at Tsinghua, Hong Kong University, Meituan's LongCat team, and Edinburgh that tries to shrink the memory cost of running large language models without hurting their accuracy. When a large language model reads a long input or holds a long conversation, it stores intermediate values called the KV cache. That cache grows with the length of the input and can dominate memory use. The standard trick is quantization, which stores those values in fewer bits, like 2 bits per number instead of 16. The trade off is that aggressive quantization usually loses accuracy. The paper, posted on arXiv on May 20, 2026, identifies what the authors see as the real reason 2 bit quantization tends to fail. They call it Token Norm Imbalance, or TNI: across the tokens in a sequence, some sit far away from the others in size, and the usual per channel quantization scheme cannot represent both groups well at once. In plain language, a few outlier tokens drag down the precision of the rest. The authors show that TNI shows up in text only models, in multi modal models, and in omni modal models that handle audio and images alongside text. OScaR, short for Omni-Scaled Canalized Rotation, is the proposed fix. It is described as following Occam's razor: it does only two things, a Canalized Rotation step and an Omni-Token Scaling step. It does not need extra training data and does not need a calibration pass on real samples. The reported results put 2 bit OScaR roughly even with the original 16 bit baseline on LongBench-E, OCRBench, and the MMAU-Pro audio benchmark, across models like Llama 3.1 8B, Qwen3 8B, Qwen3-VL, and Qwen3-Omni 30B. Implementation wise the authors built custom CUDA kernels on top of BitDecoding and HadaCore, and report 3.0 times faster decoding, 5.3 times less memory, and 4.1 times higher throughput against a BF16 FlashDecoding-v2 baseline. The repository ships the code, an evaluation suite, and an installation flow built around the uv package manager, Python 3.10, PyTorch 2.6 with CUDA 12.4, and flash-attn 2.8.3, plus a CUTLASS submodule. Integration with the vLLM and SGLang serving frameworks is listed as upcoming.

Copy-paste prompts

Prompt 1
Install OScaR-KV-Quant with uv, Python 3.10, PyTorch 2.6 CUDA 12.4, and flash-attn 2.8.3
Prompt 2
Run the OScaR evaluation suite on LongBench-E with Qwen3 8B at 2 bit KV cache
Prompt 3
Explain Token Norm Imbalance from the OScaR paper and how Canalized Rotation fixes it
Prompt 4
Wire the OScaR CUDA kernels into a custom decoding loop on top of BitDecoding
Prompt 5
Track the upcoming vLLM and SGLang integration plans for OScaR-KV-Quant
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.