Run a 104 billion parameter AI model at 128K context length on a single MacBook by applying KV cache compression.
Reduce GPU memory usage when running large language models locally on an NVIDIA or AMD card.
Test turbo2, turbo3, and turbo4 cache formats to find the right compression-to-quality tradeoff for a specific model.
Prebuilt binaries available for Mac and Windows, Linux users need to build from source with llama.cpp dependencies.
TurboQuant+ is a Python project focused on compressing the memory that AI language models need while they're generating text. When a model generates a response, it stores temporary data called a KV cache (short for key-value cache). On large models this cache can grow very large, limiting how much text the model can process at once. TurboQuant+ applies a compression technique from a 2026 Google research paper to shrink that cache by 3.8 to 6.4 times, so the same model fits into less memory with only a small quality penalty. The project builds on top of llama.cpp, a widely used tool for running AI models on ordinary hardware. It adds new cache formats called turbo2, turbo3, and turbo4, named after the number of bits used per value. The highest-compression format, turbo2, uses only 2.5 bits per value and achieves a 6.4x reduction in cache size. The turbo4 format gets 3.8x compression with almost no measurable quality loss compared to the standard 8-bit format. Three findings stand out from the team's experiments. First, compressing the value side of the cache down to 2 bits has no detectable effect on output quality as long as the key side stays at higher precision. Second, all quality degradation traces back to compressing the key cache, not the value cache. Third, protecting the first and last two transformer layers at higher precision recovers a large share of the quality difference, usually between 37 and 91 percent. The project has been tested on Apple Silicon Macs, NVIDIA cards ranging from RTX 3080 Ti to RTX 5090, and AMD cards. It supports running models as large as 104 billion parameters at 128K context length on a single MacBook. Prebuilt binaries for Mac and Windows are available for download without needing any build tools. The full README is longer than what was shown.
← thetom on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.