futuremls-lab/oscar

Analysis updated 2026-06-24

★ 12PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((OSCAR))
    Inputs
      Calibration prompts
      Model weights from HuggingFace
      Q K V activations
    Outputs
      Per layer rotation matrices
      Clipping thresholds
      Compressed KV cache
    Use Cases
      Serve long context LLMs with less GPU memory
      Reproduce paper benchmarks
      Download precomputed rotations from Rotation Zoo
    Tech Stack
      Python
      PyTorch
      SGLang
      CUDA
      H100

mindmap root((OSCAR)) Inputs Calibration prompts Model weights from HuggingFace Q K V activations Outputs Per layer rotation matrices Clipping thresholds Compressed KV cache Use Cases Serve long context LLMs with less GPU memory Reproduce paper benchmarks Download precomputed rotations from Rotation Zoo Tech Stack Python PyTorch SGLang CUDA H100

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Compress the KV cache of a Qwen3 or GLM model down to about 2.28 bits per element while keeping accuracy

USE CASE 2

Reproduce the GPQA, HumanEval, LiveCodeBench, AIME 25, and MATH-500 results from the paper

USE CASE 3

Plug OSCAR-quantized inference into an SGLang serving setup to save GPU memory at long contexts

USE CASE 4

Download a precomputed rotation file from the Rotation Zoo and skip the calibration step

What is it built with?

PythonPyTorchSGLangCUDAHuggingFace

How does it compare?

	futuremls-lab/oscar	aim-uofa/reasonmatch	arpecop/kokobook
Stars	12	12	12
Language	Python	Python	Python
Setup difficulty	hard	hard	hard
Complexity	5/5	5/5	3/5
Audience	researcher	researcher	general

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

You need CUDA 12.8 or 12.9, Python 3.12, HuggingFace access, and at least one H100 80GB for small models, with multi-GPU rigs needed for the large ones.

In plain English

OSCAR is a research project from FutureMLS-Lab that tackles one of the biggest memory bottlenecks of large language models: the so called KV cache. When a model generates text, it has to remember the keys and values for every token already in the conversation. That memory grows with context length, and on long inputs it can swallow more GPU memory than the model weights themselves. Storing each number in 16 bits (BF16) is the safe default. OSCAR stores them in 2 bits instead, using about seven times less memory, while keeping the answers almost as accurate as the BF16 baseline. The trick is in the name: Offline Spectral Covariance-Aware Rotation. Before running the model, OSCAR feeds it a small calibration set, watches the Q, K, and V activations that attention produces, and computes a rotation matrix and a clipping threshold for each layer. The rotation lines up the data so that the two bits per number land where attention is actually looking, rather than wasting precision on directions the model does not use. A small slice of the very first and most recent tokens is kept in full BF16 as a safety net. The README shows results on five reasoning and coding benchmarks (GPQA, HumanEval, LiveCodeBench v6, AIME 25, MATH-500) and on four models including Qwen3-4B Thinking, Qwen3-8B, Qwen3-32B, and the very large GLM-4.7-FP8 at 358 billion parameters. At about 2.28 effective bits per KV element, OSCAR stays within a few points of BF16, while older 2 bit methods such as QuaRot-INT2 and naive INT2 collapse on the reasoning tasks. It also matches or beats a 4 bit baseline at roughly half the storage. The code plugs into the open source SGLang inference framework. The repo lays out a three phase pipeline: dump Q, K, and V activations from a calibration run, fit the rotation, then evaluate. On a single H100 80 GB the whole Qwen3-8B example takes about twenty minutes. Bigger models need four or eight H100s, CUDA 12.8 or 12.9, Python 3.12, and HuggingFace access to the weights. For users who do not want to recompute the rotations themselves, the authors publish a "Rotation Zoo" on Hugging Face so the calibrated rotation files can be downloaded directly.

Copy-paste prompts

Prompt 1

Run the three-phase OSCAR pipeline on Qwen3-8B on a single H100 80GB and report the effective bits per KV element

Prompt 2

Wire OSCAR into an SGLang server and benchmark long-context generation memory against BF16

Prompt 3

Download a Rotation Zoo file for Qwen3-32B and skip recalibration in the OSCAR evaluation step

Prompt 4

Compare OSCAR at 2 bits to QuaRot-INT2 and a 4-bit baseline on LiveCodeBench v6 using this repo

Frequently asked questions

What is oscar?

Research code for OSCAR, a 2-bit KV cache compression method for LLM inference that uses offline spectral covariance-aware rotation and plugs into SGLang.

What language is oscar written in?

Mainly Python. The stack also includes Python, PyTorch, SGLang.

How hard is oscar to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is oscar for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.