Compress the KV cache of a Qwen3 or GLM model down to about 2.28 bits per element while keeping accuracy
Reproduce the GPQA, HumanEval, LiveCodeBench, AIME 25, and MATH-500 results from the paper
Plug OSCAR-quantized inference into an SGLang serving setup to save GPU memory at long contexts
Download a precomputed rotation file from the Rotation Zoo and skip the calibration step
You need CUDA 12.8 or 12.9, Python 3.12, HuggingFace access, and at least one H100 80GB for small models, with multi-GPU rigs needed for the large ones.
OSCAR is a research project from FutureMLS-Lab that tackles one of the biggest memory bottlenecks of large language models: the so called KV cache. When a model generates text, it has to remember the keys and values for every token already in the conversation. That memory grows with context length, and on long inputs it can swallow more GPU memory than the model weights themselves. Storing each number in 16 bits (BF16) is the safe default. OSCAR stores them in 2 bits instead, using about seven times less memory, while keeping the answers almost as accurate as the BF16 baseline. The trick is in the name: Offline Spectral Covariance-Aware Rotation. Before running the model, OSCAR feeds it a small calibration set, watches the Q, K, and V activations that attention produces, and computes a rotation matrix and a clipping threshold for each layer. The rotation lines up the data so that the two bits per number land where attention is actually looking, rather than wasting precision on directions the model does not use. A small slice of the very first and most recent tokens is kept in full BF16 as a safety net. The README shows results on five reasoning and coding benchmarks (GPQA, HumanEval, LiveCodeBench v6, AIME 25, MATH-500) and on four models including Qwen3-4B Thinking, Qwen3-8B, Qwen3-32B, and the very large GLM-4.7-FP8 at 358 billion parameters. At about 2.28 effective bits per KV element, OSCAR stays within a few points of BF16, while older 2 bit methods such as QuaRot-INT2 and naive INT2 collapse on the reasoning tasks. It also matches or beats a 4 bit baseline at roughly half the storage. The code plugs into the open source SGLang inference framework. The repo lays out a three phase pipeline: dump Q, K, and V activations from a calibration run, fit the rotation, then evaluate. On a single H100 80 GB the whole Qwen3-8B example takes about twenty minutes. Bigger models need four or eight H100s, CUDA 12.8 or 12.9, Python 3.12, and HuggingFace access to the weights. For users who do not want to recompute the rotations themselves, the authors publish a "Rotation Zoo" on Hugging Face so the calibrated rotation files can be downloaded directly.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.