thetom/turboquant_plus

★ 6,780PythonAudience · researcherComplexity · 4/5Setup · moderate

Mindmap

mindmap
  root((TurboQuant+))
    What it does
      KV cache compression
      Memory reduction
      Large model support
    Cache Formats
      turbo4 3.8x smaller
      turbo3 medium
      turbo2 6.4x smaller
    Hardware Support
      Apple Silicon Mac
      NVIDIA cards
      AMD cards
    Key Findings
      Value cache safe at 2-bit
      Keys drive quality loss
      Layer protection helps

mindmap root((TurboQuant+)) What it does KV cache compression Memory reduction Large model support Cache Formats turbo4 3.8x smaller turbo3 medium turbo2 6.4x smaller Hardware Support Apple Silicon Mac NVIDIA cards AMD cards Key Findings Value cache safe at 2-bit Keys drive quality loss Layer protection helps

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run a 104 billion parameter AI model at 128K context length on a single MacBook by applying KV cache compression.

USE CASE 2

Reduce GPU memory usage when running large language models locally on an NVIDIA or AMD card.

USE CASE 3

Test turbo2, turbo3, and turbo4 cache formats to find the right compression-to-quality tradeoff for a specific model.

Tech stack

Pythonllama.cppCUDA

Getting it running

Difficulty · moderate Time to first run · 30min

Prebuilt binaries available for Mac and Windows, Linux users need to build from source with llama.cpp dependencies.

In plain English

TurboQuant+ is a Python project focused on compressing the memory that AI language models need while they're generating text. When a model generates a response, it stores temporary data called a KV cache (short for key-value cache). On large models this cache can grow very large, limiting how much text the model can process at once. TurboQuant+ applies a compression technique from a 2026 Google research paper to shrink that cache by 3.8 to 6.4 times, so the same model fits into less memory with only a small quality penalty. The project builds on top of llama.cpp, a widely used tool for running AI models on ordinary hardware. It adds new cache formats called turbo2, turbo3, and turbo4, named after the number of bits used per value. The highest-compression format, turbo2, uses only 2.5 bits per value and achieves a 6.4x reduction in cache size. The turbo4 format gets 3.8x compression with almost no measurable quality loss compared to the standard 8-bit format. Three findings stand out from the team's experiments. First, compressing the value side of the cache down to 2 bits has no detectable effect on output quality as long as the key side stays at higher precision. Second, all quality degradation traces back to compressing the key cache, not the value cache. Third, protecting the first and last two transformer layers at higher precision recovers a large share of the quality difference, usually between 37 and 91 percent. The project has been tested on Apple Silicon Macs, NVIDIA cards ranging from RTX 3080 Ti to RTX 5090, and AMD cards. It supports running models as large as 104 billion parameters at 128K context length on a single MacBook. Prebuilt binaries for Mac and Windows are available for download without needing any build tools. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1

How do I configure TurboQuant+ to run a 70B model with turbo4 cache compression on an NVIDIA RTX 3080?

Prompt 2

Show me how to enable first-and-last-layer protection in TurboQuant+ to recover output quality when using turbo2 compression.

Prompt 3

Build a TurboQuant+ setup on Apple Silicon Mac to run a large language model within available unified memory.

Prompt 4

Compare turbo2 vs turbo4 cache formats in TurboQuant+ by running the same prompt and measuring the quality difference.

Prompt 5

Install TurboQuant+ prebuilt binaries on Windows and run a model without needing any build tools or CUDA setup.

Open on GitHub → Explain another repo

← thetom on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.