Reproduce the ICML 2026 paper Q-K=V experiment results at 300M or 1.2B parameter scale.
Apply the Q-K=V attention variant to your own transformer model to cut KV cache memory by 50% with minimal accuracy loss.
Use the included computation profiler to verify the theoretical KV cache reduction numbers from the paper against your own hardware.
Requires eight NVIDIA A100 GPUs for the paper experiments, single-GPU training at these scales is described as impractical.
This repository contains code accompanying a research paper published at the International Conference on Machine Learning (ICML) 2026. The paper asks whether the standard attention mechanism inside AI language models actually needs three separate sets of learned weights (called query, key, and value projections), or whether some can be shared to save memory without hurting much performance. The research tests four variations. In the standard setup all three projections are distinct. The paper's main finding is a variant called Q-K=V, where the query projection stays separate but the key and value projections are merged into one. This approach reduces the size of a component called the KV cache (which stores intermediate results during text generation) by 50%, while only increasing text prediction error by about 3%. When combined with a technique called multi-query attention, the cache reduction reaches roughly 97% at the 300 million parameter scale. Experiments were run at two model sizes, 300 million and 1.2 billion parameters, across 12 tasks covering image recognition, anomaly detection, and language modeling. Training data is SlimPajama, an open text dataset. Training at this scale required eight NVIDIA A100 GPUs and took between one and three days per experiment. The repository contains one Python training script per variant and model size, shared configuration files for architecture and optimizer settings, a dataset download script, an evaluation script for standard language benchmarks, and a computation profiler that reproduces the theoretical analysis tables from the paper. Checkpoints are saved every 1,000 training steps. This is a research code release intended for AI researchers who want to reproduce the paper's results or apply the findings to their own models. Running these experiments requires substantial GPU hardware. Single-GPU training is technically possible at these scales but is described as impractical. The project has 17 stars on GitHub.
← brainchip-inc on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.