explaingit

hikarioyama/vllm-nvfp4-kv-sm120

12PythonAudience · ops devopsComplexity · 5/5LicenseSetup · hard

TLDR

A patch for vLLM that enables NVFP4 KV cache compression on Blackwell-architecture GPUs like the RTX PRO 6000, fitting roughly 78 percent more context into the same GPU memory.

Mindmap

mindmap
  root((vllm nvfp4 patch))
    What it does
      NVFP4 KV cache
      Blackwell GPU support
      Memory efficiency
    Results
      1.78x more tokens
      Same decoding speed
      Two-GPU validated
    Installation
      Docker container
      Shell script patch
      Version-pinned deps
    Testing
      Included test scripts
      Verify before prod
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Fit roughly 78 percent more context into GPU memory on an RTX PRO 6000 without buying additional hardware.

USE CASE 2

Deploy the patch via Docker to run vLLM with NVFP4 KV cache on a Blackwell GPU setup.

USE CASE 3

Verify the patch works on your specific model shape using the included test scripts before enabling it in production.

Tech stack

PythonvLLMFlashInferDockerCUDA

Getting it running

Difficulty · hard Time to first run · 1h+

Requires NVIDIA Blackwell GPU, version-pinned vLLM and FlashInfer releases, and Docker or manual file patching.

Apache 2.0 license. Use, modify, and distribute freely, including for commercial purposes, with attribution and license notice kept.

In plain English

This repository is a patch for vLLM, an open-source tool used to run large AI language models at high throughput. The patch unlocks a specific memory-saving feature on a particular class of NVIDIA GPU, the RTX PRO 6000 based on NVIDIA's Blackwell architecture. The feature in question is called NVFP4 KV cache. When a language model processes text, it stores intermediate data called the KV cache that grows with the length of the conversation or document being processed. Storing that cache in a highly compressed format (NVFP4) means you can fit more of it in GPU memory, which allows longer conversations or more simultaneous users. The issue is that vLLM's built-in support for NVFP4 KV cache depends on pre-compiled GPU code that NVIDIA has not released for this GPU family. The patch works around that by routing the NVFP4 computation through a different, more general code path that does work on these GPUs. The practical result measured by the author on a two-GPU setup running a 198-billion-parameter model: the patch fits about 1.78 times as many tokens in the KV cache compared to the previous best option, at roughly the same decoding speed (within a few percent). That means about 78 percent more context length capacity, or more concurrent users, without buying more hardware. The patch modifies four files in vLLM and FlashInfer (a GPU attention library). It is version-pinned to specific releases of both libraries. Installation is either through Docker, by building a patched container image, or by running a shell script that copies the modified files over the installed versions. The repository includes test scripts so you can verify the patch works correctly on your specific model shape before using it in production. This is a low-level engineering patch intended for people already running vLLM on Blackwell GPUs and looking to increase how much context their deployment can hold. It is Apache-2.0 licensed.

Copy-paste prompts

Prompt 1
Walk me through applying the vllm-nvfp4-kv-sm120 patch to a two-GPU RTX PRO 6000 setup running a 198B parameter model via Docker.
Prompt 2
I have vLLM installed on a Blackwell GPU. Help me run the vllm-nvfp4-kv-sm120 shell script to patch the installed files and then verify with the included tests.
Prompt 3
Compare KV cache token capacity before and after applying vllm-nvfp4-kv-sm120 on my Blackwell GPU and show me how to measure the throughput difference.
Open on GitHub → Explain another repo

← hikarioyama on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.