explaingit

bit-incarnas/nvfp4-mtp-conversions

0Audience · researcherComplexity · 5/5ActiveLicenseSetup · hard

TLDR

Methodology repo for converting open-weight LLMs into NVFP4 GGUF format with the Multi-Token Prediction head intact, targeting NVIDIA Blackwell GPUs and llama.cpp.

Mindmap

mindmap
  root((nvfp4-mtp-conversions))
    Inputs
      Open-weight models
      llama.cpp patches
      Benchmark scripts
    Outputs
      NVFP4 GGUF method
      Benchmark numbers
      Report PDF
    Use Cases
      Reproduce conversion
      Bench MTP speedup
      Cite methodology
    Tech Stack
      llama.cpp
      CUDA
      NVFP4
      Python

Things people build with this

USE CASE 1

Reproduce the NVFP4 GGUF conversion of Qwen3.5-122B-A10B with the MTP head preserved.

USE CASE 2

Benchmark self-speculative decoding throughput on a Blackwell card using the included scripts.

USE CASE 3

Apply the patches folder to a local llama.cpp checkout to fix the Qwen3.5 MTP path bug.

USE CASE 4

Cite the BibTeX entry in a paper on four-bit quantization or multi-token prediction.

Tech stack

llama.cppCUDAPythonGGUF

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a Blackwell-class NVIDIA GPU, custom llama.cpp build with the included patches, and 96GB VRAM for the headline Qwen3.5 model.

MIT license. You can use, modify, and redistribute the code and report for almost any purpose as long as you keep the copyright notice.

In plain English

This repository documents how a person who publishes under the name Incarnas converts large open-weight language models into a particular packaged format. The format is called NVFP4 GGUF, and it is designed to run efficiently on NVIDIA Blackwell-class GPUs. Rather than holding the converted model files themselves, the repo holds the methodology: a written report, the local code patches needed to do the conversion, the benchmark scripts, the raw benchmark numbers, and the plots derived from those numbers. Two technical ideas come up repeatedly in the README. NVFP4 is a four-bit numerical format that NVIDIA's newer hardware can compute on directly, which the project enables through a flag called BLACKWELL_NATIVE_FP4. MTP, or Multi-Token Prediction, is a head on top of the model that can predict several tokens at once, used here to speed up generation through what is called self-speculative decoding inside the llama.cpp runtime. The project keeps the MTP head intact during conversion so that this speed-up still works. The actual model files live separately on Hugging Face under the Incarnas account, and each release links back to this repo for the methodology. The first release listed is Qwen3.5-122B-A10B-NVFP4-MTP-GGUF, dated 2026-05-16, which the README says delivers a 46 percent improvement in long-decode throughput against the same model without MTP on a Blackwell Pro card with 96 gigabytes of memory. The patches folder holds local edits to llama.cpp that the methodology depends on. One patch fixes a bug specific to the Qwen3.5 MTP path. The README notes that this patch was later merged upstream and will be retired once the next release is built against a post-fix version of llama.cpp. The report itself lives at paper/REPORT.md with a PDF rendering alongside. The benchmark rig is documented in detail: an RTX PRO 6000 Blackwell Max-Q workstation card with 96 gigabytes of GDDR7, a 24-core Threadripper, 128 gigabytes of DDR5 ECC memory, and a Linux kernel with a current NVIDIA driver. The bench scripts call the llama-server HTTP endpoint directly without depending on vendor harnesses. The repository is released under the MIT license, and a BibTeX citation block is included for academic reference.

Copy-paste prompts

Prompt 1
Walk me through reproducing the NVFP4 MTP conversion of Qwen3.5-122B using bit-incarnas/nvfp4-mtp-conversions on an RTX PRO 6000.
Prompt 2
Explain how the BLACKWELL_NATIVE_FP4 flag interacts with the Multi-Token Prediction head inside llama.cpp.
Prompt 3
Run the benchmark scripts in this repo against llama-server and compare long-decode throughput with and without MTP.
Prompt 4
Apply the Qwen3.5 MTP patch from the patches folder to a current llama.cpp checkout and explain what it fixes.
Prompt 5
Sketch the math behind self-speculative decoding and why keeping the MTP head intact preserves the 46% speedup.
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.