nohwai-software/raytention

Analysis updated 2026-05-18

★ 1PythonAudience · researcherComplexity · 5/5LicenseSetup · hard

Mindmap

mindmap
  root((RayTention))
    What it does
      Replaces KV cache
      Fixed memory cost
      Matches quality
    7 Signals
      Centroid
      Temporal centroid
      Top-1 key
      Entropy
    Tech Stack
      Python
      PyTorch
      CUDA
    Limitations
      Slower prototype
      CUDA kernel pending

mindmap root((RayTention)) What it does Replaces KV cache Fixed memory cost Matches quality 7 Signals Centroid Temporal centroid Top-1 key Entropy Tech Stack Python PyTorch CUDA Limitations Slower prototype CUDA kernel pending

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Test whether RayTention's fixed-memory attention matches standard transformer quality on your own dataset.

USE CASE 2

Build long-context inference servers that can serve more concurrent users on the same GPU hardware.

USE CASE 3

Study a KV-cache-free attention mechanism as a research starting point for your own architecture work.

What is it built with?

PythonPyTorchCUDARust

How does it compare?

	nohwai-software/raytention	a-bissell/unleash-lite	abhiinnovates/whatsapp-hr-assistant
Stars	1	1	1
Language	Python	Python	Python
Setup difficulty	hard	hard	hard
Complexity	5/5	4/5	3/5
Audience	researcher	researcher	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 30min

Requires a CUDA-capable GPU and Python 3.10+, no CPU fallback.

AGPL-3.0: use and modify freely, but any derivative work or service must also be released under AGPL-3.0.

In plain English

RayTention is a research project that proposes a different way for AI language models to process text. Standard language models keep growing amounts of memory as they read longer pieces of text. The memory they store, called a KV cache, can reach hundreds of gigabytes for very long documents. RayTention is designed to replace that growing memory with a fixed, tiny summary that stays the same size no matter how long the text is. The core idea is to compress the entire history of a conversation or document into seven signals totaling 642 values, instead of storing every word's representation in full. These signals capture things like: what is the overall topic of what was just read, what is the most recently seen token, what single piece of context is most relevant right now, and how focused or scattered the model's attention is. A small neural network then processes these signals to produce the model's output. The practical result is that RayTention uses roughly 102 megabytes of GPU memory for inference at any context length, from 16,000 tokens to 1 million. Standard approaches need over 500 gigabytes for the same 1 million token context. According to benchmarks in the repo, RayTention matches standard attention in text prediction quality after 2,000 training steps, while using a fraction of the memory. It is slower in the current Python prototype because it lacks the hardware-optimized kernels that standard attention benefits from, but the authors note a Rust and CUDA version already reaches much higher speeds. The project is written in Python and uses PyTorch. Running the benchmark requires a CUDA-capable GPU and Python 3.10 or later. The repo includes a single benchmark script that trains both a standard transformer and a RayTention model on the same data, then compares memory usage and speed. A native CUDA kernel is listed as future work. The license is AGPL-3.0, a copyleft license that requires derivative works to be released under the same terms. The architecture is also subject to a pending U.S. patent application.

Copy-paste prompts

Prompt 1

How do I run the RayTention benchmark to compare it against a standard transformer on my GPU?

Prompt 2

Explain how RayTention's 7 geometric signals replace the KV cache in a transformer model.

Prompt 3

How does RayTention achieve fixed O(1) memory while standard attention grows with context length?

Prompt 4

What would I need to implement to build a CUDA kernel for RayTention's L2 distance computation?

Frequently asked questions

What is raytention?

A research prototype that replaces the growing memory cost of AI language models with a fixed 642-value summary, matching standard attention quality at a fraction of the GPU memory.

What language is raytention written in?

Mainly Python. The stack also includes Python, PyTorch, CUDA.

What license does raytention use?

AGPL-3.0: use and modify freely, but any derivative work or service must also be released under AGPL-3.0.

How hard is raytention to set up?

Setup difficulty is rated hard, with roughly 30min to a first successful run.

Who is raytention for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub nohwai-software on gitmyhub

Verify against the repo before relying on details.