zunhaisu/oscar-kv-quant

Analysis updated 2026-06-24

★ 22C++Audience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((OScaR))
    Inputs
      LLM weights
      KV cache tensors
      Benchmark sets
    Outputs
      2 bit KV cache
      Decoded tokens
      Throughput metrics
    Use Cases
      Shrink long context memory
      Speed up decoding
      Run multi modal LLMs cheaper
    Tech Stack
      C++
      CUDA
      PyTorch
      CUTLASS

mindmap root((OScaR)) Inputs LLM weights KV cache tensors Benchmark sets Outputs 2 bit KV cache Decoded tokens Throughput metrics Use Cases Shrink long context memory Speed up decoding Run multi modal LLMs cheaper Tech Stack C++ CUDA PyTorch CUTLASS

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Quantize the KV cache of Llama 3.1 8B or Qwen3 8B to 2 bits with near 16-bit accuracy

USE CASE 2

Reproduce the LongBench-E, OCRBench, and MMAU-Pro numbers from the OScaR paper

USE CASE 3

Cut LLM serving memory by roughly 5x on long context workloads

USE CASE 4

Apply Canalized Rotation and Omni-Token Scaling to a multi modal model like Qwen3-VL

What is it built with?

C++CUDAPyTorchPythonCUTLASS

How does it compare?

	zunhaisu/oscar-kv-quant	akshayanirmal2005-cmyk/smart-health-track	swordfatih/reflect
Stars	22	23	21
Language	C++	C++	C++
Setup difficulty	hard	moderate	hard
Complexity	5/5	3/5	4/5
Audience	researcher	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Needs a CUDA 12.4 GPU stack, custom kernels on top of BitDecoding and HadaCore, plus a CUTLASS submodule and flash-attn 2.8.3.

In plain English

OScaR is a research project from teams at Tsinghua, Hong Kong University, Meituan's LongCat team, and Edinburgh that tries to shrink the memory cost of running large language models without hurting their accuracy. When a large language model reads a long input or holds a long conversation, it stores intermediate values called the KV cache. That cache grows with the length of the input and can dominate memory use. The standard trick is quantization, which stores those values in fewer bits, like 2 bits per number instead of 16. The trade off is that aggressive quantization usually loses accuracy. The paper, posted on arXiv on May 20, 2026, identifies what the authors see as the real reason 2 bit quantization tends to fail. They call it Token Norm Imbalance, or TNI: across the tokens in a sequence, some sit far away from the others in size, and the usual per channel quantization scheme cannot represent both groups well at once. In plain language, a few outlier tokens drag down the precision of the rest. The authors show that TNI shows up in text only models, in multi modal models, and in omni modal models that handle audio and images alongside text. OScaR, short for Omni-Scaled Canalized Rotation, is the proposed fix. It is described as following Occam's razor: it does only two things, a Canalized Rotation step and an Omni-Token Scaling step. It does not need extra training data and does not need a calibration pass on real samples. The reported results put 2 bit OScaR roughly even with the original 16 bit baseline on LongBench-E, OCRBench, and the MMAU-Pro audio benchmark, across models like Llama 3.1 8B, Qwen3 8B, Qwen3-VL, and Qwen3-Omni 30B. Implementation wise the authors built custom CUDA kernels on top of BitDecoding and HadaCore, and report 3.0 times faster decoding, 5.3 times less memory, and 4.1 times higher throughput against a BF16 FlashDecoding-v2 baseline. The repository ships the code, an evaluation suite, and an installation flow built around the uv package manager, Python 3.10, PyTorch 2.6 with CUDA 12.4, and flash-attn 2.8.3, plus a CUTLASS submodule. Integration with the vLLM and SGLang serving frameworks is listed as upcoming.

Copy-paste prompts

Prompt 1

Install OScaR-KV-Quant with uv, Python 3.10, PyTorch 2.6 CUDA 12.4, and flash-attn 2.8.3

Prompt 2

Run the OScaR evaluation suite on LongBench-E with Qwen3 8B at 2 bit KV cache

Prompt 3

Explain Token Norm Imbalance from the OScaR paper and how Canalized Rotation fixes it

Prompt 4

Wire the OScaR CUDA kernels into a custom decoding loop on top of BitDecoding

Prompt 5

Track the upcoming vLLM and SGLang integration plans for OScaR-KV-Quant

Frequently asked questions

What is oscar-kv-quant?

Research code for OScaR, a 2-bit KV cache quantization method for LLMs that uses a rotation plus token scaling fix to match 16-bit accuracy without retraining.

What language is oscar-kv-quant written in?

Mainly C++. The stack also includes C++, CUDA, PyTorch.

How hard is oscar-kv-quant to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is oscar-kv-quant for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.