tilde-research/comp-muon-release

★ 14PythonAudience · researcherComplexity · 4/5LicenseSetup · moderate

Mindmap

mindmap
  root((repo))
    What it does
      Partner whitening
      Attention layer updates
      Spectral normalization
    Tech stack
      Python
      PyTorch
    Use cases
      Transformer training
      Research experiments
      Custom optimizers
    Audience
      ML researchers
      AI engineers
    Key functions
      cm_qk query-key
      cm_ov output-value

mindmap root((repo)) What it does Partner whitening Attention layer updates Spectral normalization Tech stack Python PyTorch Use cases Transformer training Research experiments Custom optimizers Audience ML researchers AI engineers Key functions cm_qk query-key cm_ov output-value

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Train transformer attention layers more precisely by accounting for how paired weight matrices interact during updates.

USE CASE 2

Drop Compositional Muon into an existing training loop alongside your current optimizer to handle only the attention pairs.

USE CASE 3

Reproduce or extend research on partner-aware optimizers for language or vision transformer models.

USE CASE 4

Experiment with spectral normalization techniques applied to query-key and output-value matrix pairs.

Tech stack

PythonPyTorch

Getting it running

Difficulty · moderate Time to first run · 30min

Requires PyTorch. The demo in src/main.py runs a small transformer out of the box. Caller must manage momentum state and handle non-attention parameters with a separate optimizer.

Apache 2.0, free to use, modify, and distribute, including in commercial projects, as long as you include the original license notice.

In plain English

This repository is a Python implementation of Compositional Muon, a research-oriented method for training the attention layers of transformer models more precisely. Transformers are the model architecture behind most modern AI language and vision systems. Attention layers are a core component: they determine how different parts of an input relate to each other. Training a transformer means repeatedly adjusting millions of numerical parameters based on how wrong the model's outputs are. Most training methods treat each parameter matrix independently when deciding how to update it. The problem with attention layers is that the model never actually sees the individual matrices: it only sees the result of multiplying pairs of them together. Compositional Muon accounts for this by using information about one matrix in a pair to shape the update applied to the other. The README calls this approach partner whitening. The method builds on an existing optimizer called Muon, which applies a kind of spectral normalization to each gradient before using it. Compositional Muon extends that idea to the two matrix pairs that make up transformer attention: the query-key pair and the output-value pair. The update rule for each matrix is adjusted based on the geometry of its partner, so that the effective step size adapts to how stretched or compressed the partner matrix is. The code provides two functions, cm_qk and cm_ov, which take the relevant weight matrices, their gradients, and momentum buffers as arguments and apply the update in place. The caller is responsible for managing momentum state. These two functions only handle the attention pairs, all other model parameters are updated with a separate optimizer of the caller's choice. A runnable demo in src/main.py shows a small transformer trained with this combination. The library requires PyTorch and is released under the Apache 2.0 license.

Copy-paste prompts

Prompt 1

I want to use the cm_qk and cm_ov functions from Compositional Muon to train a transformer. Show me how to integrate them into a PyTorch training loop, including how to manage the momentum buffers they need.

Prompt 2

Explain partner whitening as used in tilde-research/comp-muon-release. How does using one matrix's geometry to shape the update of its partner improve transformer attention training?

Prompt 3

I have an existing transformer training script using AdamW. Show me step-by-step how to add Compositional Muon for the attention weight pairs while keeping AdamW for all other parameters.

Prompt 4

Walk me through the src/main.py demo in tilde-research/comp-muon-release. What transformer is being trained, how are cm_qk and cm_ov called, and what should I expect to see?

Prompt 5

How does Compositional Muon differ from the base Muon optimizer? Explain what extra information it uses and why that matters for query-key and output-value matrix pairs.

Open on GitHub → Explain another repo

← tilde-research on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.