explaingit

futuremls-lab/oscar

12PythonAudience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

Research code for OSCAR, a 2-bit KV cache compression method for LLM inference that uses offline spectral covariance-aware rotation and plugs into SGLang.

Mindmap

mindmap
  root((OSCAR))
    Inputs
      Calibration prompts
      Model weights from HuggingFace
      Q K V activations
    Outputs
      Per layer rotation matrices
      Clipping thresholds
      Compressed KV cache
    Use Cases
      Serve long context LLMs with less GPU memory
      Reproduce paper benchmarks
      Download precomputed rotations from Rotation Zoo
    Tech Stack
      Python
      PyTorch
      SGLang
      CUDA
      H100

Things people build with this

USE CASE 1

Compress the KV cache of a Qwen3 or GLM model down to about 2.28 bits per element while keeping accuracy

USE CASE 2

Reproduce the GPQA, HumanEval, LiveCodeBench, AIME 25, and MATH-500 results from the paper

USE CASE 3

Plug OSCAR-quantized inference into an SGLang serving setup to save GPU memory at long contexts

USE CASE 4

Download a precomputed rotation file from the Rotation Zoo and skip the calibration step

Tech stack

PythonPyTorchSGLangCUDAHuggingFace

Getting it running

Difficulty · hard Time to first run · 1day+

You need CUDA 12.8 or 12.9, Python 3.12, HuggingFace access, and at least one H100 80GB for small models, with multi-GPU rigs needed for the large ones.

In plain English

OSCAR is a research project from FutureMLS-Lab that tackles one of the biggest memory bottlenecks of large language models: the so called KV cache. When a model generates text, it has to remember the keys and values for every token already in the conversation. That memory grows with context length, and on long inputs it can swallow more GPU memory than the model weights themselves. Storing each number in 16 bits (BF16) is the safe default. OSCAR stores them in 2 bits instead, using about seven times less memory, while keeping the answers almost as accurate as the BF16 baseline. The trick is in the name: Offline Spectral Covariance-Aware Rotation. Before running the model, OSCAR feeds it a small calibration set, watches the Q, K, and V activations that attention produces, and computes a rotation matrix and a clipping threshold for each layer. The rotation lines up the data so that the two bits per number land where attention is actually looking, rather than wasting precision on directions the model does not use. A small slice of the very first and most recent tokens is kept in full BF16 as a safety net. The README shows results on five reasoning and coding benchmarks (GPQA, HumanEval, LiveCodeBench v6, AIME 25, MATH-500) and on four models including Qwen3-4B Thinking, Qwen3-8B, Qwen3-32B, and the very large GLM-4.7-FP8 at 358 billion parameters. At about 2.28 effective bits per KV element, OSCAR stays within a few points of BF16, while older 2 bit methods such as QuaRot-INT2 and naive INT2 collapse on the reasoning tasks. It also matches or beats a 4 bit baseline at roughly half the storage. The code plugs into the open source SGLang inference framework. The repo lays out a three phase pipeline: dump Q, K, and V activations from a calibration run, fit the rotation, then evaluate. On a single H100 80 GB the whole Qwen3-8B example takes about twenty minutes. Bigger models need four or eight H100s, CUDA 12.8 or 12.9, Python 3.12, and HuggingFace access to the weights. For users who do not want to recompute the rotations themselves, the authors publish a "Rotation Zoo" on Hugging Face so the calibrated rotation files can be downloaded directly.

Copy-paste prompts

Prompt 1
Run the three-phase OSCAR pipeline on Qwen3-8B on a single H100 80GB and report the effective bits per KV element
Prompt 2
Wire OSCAR into an SGLang server and benchmark long-context generation memory against BF16
Prompt 3
Download a Rotation Zoo file for Qwen3-32B and skip recalibration in the OSCAR evaluation step
Prompt 4
Compare OSCAR at 2 bits to QuaRot-INT2 and a 4-bit baseline on LiveCodeBench v6 using this repo
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.