brainchip-inc/do-transformers-need-3-projections

★ 17PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((attention projections))
    Research question
      Share Q K V weights
      Reduce KV cache
      Minimal accuracy loss
    Variants tested
      Standard QKV
      Q-K equals V merged
      Multi-query attention
      Combined approach
    Results
      50 percent cache cut
      3 percent error increase
      97 percent at MQA scale
    Experiments
      300M parameters
      1.2B parameters
      12 tasks
    Tech stack
      Python
      PyTorch
      NVIDIA A100

mindmap root((attention projections)) Research question Share Q K V weights Reduce KV cache Minimal accuracy loss Variants tested Standard QKV Q-K equals V merged Multi-query attention Combined approach Results 50 percent cache cut 3 percent error increase 97 percent at MQA scale Experiments 300M parameters 1.2B parameters 12 tasks Tech stack Python PyTorch NVIDIA A100

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Reproduce the ICML 2026 paper Q-K=V experiment results at 300M or 1.2B parameter scale.

USE CASE 2

Apply the Q-K=V attention variant to your own transformer model to cut KV cache memory by 50% with minimal accuracy loss.

USE CASE 3

Use the included computation profiler to verify the theoretical KV cache reduction numbers from the paper against your own hardware.

Tech stack

PythonPyTorchCUDANVIDIA A100SlimPajama

Getting it running

Difficulty · hard Time to first run · 1day+

Requires eight NVIDIA A100 GPUs for the paper experiments, single-GPU training at these scales is described as impractical.

In plain English

This repository contains code accompanying a research paper published at the International Conference on Machine Learning (ICML) 2026. The paper asks whether the standard attention mechanism inside AI language models actually needs three separate sets of learned weights (called query, key, and value projections), or whether some can be shared to save memory without hurting much performance. The research tests four variations. In the standard setup all three projections are distinct. The paper's main finding is a variant called Q-K=V, where the query projection stays separate but the key and value projections are merged into one. This approach reduces the size of a component called the KV cache (which stores intermediate results during text generation) by 50%, while only increasing text prediction error by about 3%. When combined with a technique called multi-query attention, the cache reduction reaches roughly 97% at the 300 million parameter scale. Experiments were run at two model sizes, 300 million and 1.2 billion parameters, across 12 tasks covering image recognition, anomaly detection, and language modeling. Training data is SlimPajama, an open text dataset. Training at this scale required eight NVIDIA A100 GPUs and took between one and three days per experiment. The repository contains one Python training script per variant and model size, shared configuration files for architecture and optimizer settings, a dataset download script, an evaluation script for standard language benchmarks, and a computation profiler that reproduces the theoretical analysis tables from the paper. Checkpoints are saved every 1,000 training steps. This is a research code release intended for AI researchers who want to reproduce the paper's results or apply the findings to their own models. Running these experiments requires substantial GPU hardware. Single-GPU training is technically possible at these scales but is described as impractical. The project has 17 stars on GitHub.

Copy-paste prompts

Prompt 1

I want to reproduce the Q-K=V experiment from this repo at the 300M parameter scale. What hardware do I need and which training script should I run first?

Prompt 2

Explain the difference between the four attention projection variants tested in this paper and show me which config file to edit to switch between them.

Prompt 3

I want to apply the Q-K=V projection sharing technique to my own transformer architecture. Walk me through the code changes in the training script that implement this.

Prompt 4

Run the profiler script from this repo and explain what each column in the output table corresponds to in the paper theoretical analysis.

Open on GitHub → Explain another repo

← brainchip-inc on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.