bit-incarnas/nvfp4-mtp-conversions

Analysis updated 2026-06-24

★ 0Audience · researcherComplexity · 5/5LicenseSetup · hard

Mindmap

mindmap
  root((nvfp4-mtp-conversions))
    Inputs
      Open-weight models
      llama.cpp patches
      Benchmark scripts
    Outputs
      NVFP4 GGUF method
      Benchmark numbers
      Report PDF
    Use Cases
      Reproduce conversion
      Bench MTP speedup
      Cite methodology
    Tech Stack
      llama.cpp
      CUDA
      NVFP4
      Python

mindmap root((nvfp4-mtp-conversions)) Inputs Open-weight models llama.cpp patches Benchmark scripts Outputs NVFP4 GGUF method Benchmark numbers Report PDF Use Cases Reproduce conversion Bench MTP speedup Cite methodology Tech Stack llama.cpp CUDA NVFP4 Python

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Reproduce the NVFP4 GGUF conversion of Qwen3.5-122B-A10B with the MTP head preserved.

USE CASE 2

Benchmark self-speculative decoding throughput on a Blackwell card using the included scripts.

USE CASE 3

Apply the patches folder to a local llama.cpp checkout to fix the Qwen3.5 MTP path bug.

USE CASE 4

Cite the BibTeX entry in a paper on four-bit quantization or multi-token prediction.

What is it built with?

llama.cppCUDAPythonGGUF

How does it compare?

	bit-incarnas/nvfp4-mtp-conversions	0xhassaan/nn-from-scratch	0xzgbot/hermes-comfyui-skills
Stars	0	0	0
Language	—	Python	—
Setup difficulty	hard	moderate	easy
Complexity	5/5	4/5	1/5
Audience	researcher	developer	designer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Requires a Blackwell-class NVIDIA GPU, custom llama.cpp build with the included patches, and 96GB VRAM for the headline Qwen3.5 model.

MIT license. You can use, modify, and redistribute the code and report for almost any purpose as long as you keep the copyright notice.

In plain English

This repository documents how a person who publishes under the name Incarnas converts large open-weight language models into a particular packaged format. The format is called NVFP4 GGUF, and it is designed to run efficiently on NVIDIA Blackwell-class GPUs. Rather than holding the converted model files themselves, the repo holds the methodology: a written report, the local code patches needed to do the conversion, the benchmark scripts, the raw benchmark numbers, and the plots derived from those numbers. Two technical ideas come up repeatedly in the README. NVFP4 is a four-bit numerical format that NVIDIA's newer hardware can compute on directly, which the project enables through a flag called BLACKWELL_NATIVE_FP4. MTP, or Multi-Token Prediction, is a head on top of the model that can predict several tokens at once, used here to speed up generation through what is called self-speculative decoding inside the llama.cpp runtime. The project keeps the MTP head intact during conversion so that this speed-up still works. The actual model files live separately on Hugging Face under the Incarnas account, and each release links back to this repo for the methodology. The first release listed is Qwen3.5-122B-A10B-NVFP4-MTP-GGUF, dated 2026-05-16, which the README says delivers a 46 percent improvement in long-decode throughput against the same model without MTP on a Blackwell Pro card with 96 gigabytes of memory. The patches folder holds local edits to llama.cpp that the methodology depends on. One patch fixes a bug specific to the Qwen3.5 MTP path. The README notes that this patch was later merged upstream and will be retired once the next release is built against a post-fix version of llama.cpp. The report itself lives at paper/REPORT.md with a PDF rendering alongside. The benchmark rig is documented in detail: an RTX PRO 6000 Blackwell Max-Q workstation card with 96 gigabytes of GDDR7, a 24-core Threadripper, 128 gigabytes of DDR5 ECC memory, and a Linux kernel with a current NVIDIA driver. The bench scripts call the llama-server HTTP endpoint directly without depending on vendor harnesses. The repository is released under the MIT license, and a BibTeX citation block is included for academic reference.

Copy-paste prompts

Prompt 1

Walk me through reproducing the NVFP4 MTP conversion of Qwen3.5-122B using bit-incarnas/nvfp4-mtp-conversions on an RTX PRO 6000.

Prompt 2

Explain how the BLACKWELL_NATIVE_FP4 flag interacts with the Multi-Token Prediction head inside llama.cpp.

Prompt 3

Run the benchmark scripts in this repo against llama-server and compare long-decode throughput with and without MTP.

Prompt 4

Apply the Qwen3.5 MTP patch from the patches folder to a current llama.cpp checkout and explain what it fixes.

Prompt 5

Sketch the math behind self-speculative decoding and why keeping the MTP head intact preserves the 46% speedup.

Frequently asked questions

What is nvfp4-mtp-conversions?

Methodology repo for converting open-weight LLMs into NVFP4 GGUF format with the Multi-Token Prediction head intact, targeting NVIDIA Blackwell GPUs and llama.cpp.

What license does nvfp4-mtp-conversions use?

MIT license. You can use, modify, and redistribute the code and report for almost any purpose as long as you keep the copyright notice.

How hard is nvfp4-mtp-conversions to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is nvfp4-mtp-conversions for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.