Reproduce the NVFP4 GGUF conversion of Qwen3.5-122B-A10B with the MTP head preserved.
Benchmark self-speculative decoding throughput on a Blackwell card using the included scripts.
Apply the patches folder to a local llama.cpp checkout to fix the Qwen3.5 MTP path bug.
Cite the BibTeX entry in a paper on four-bit quantization or multi-token prediction.
Requires a Blackwell-class NVIDIA GPU, custom llama.cpp build with the included patches, and 96GB VRAM for the headline Qwen3.5 model.
This repository documents how a person who publishes under the name Incarnas converts large open-weight language models into a particular packaged format. The format is called NVFP4 GGUF, and it is designed to run efficiently on NVIDIA Blackwell-class GPUs. Rather than holding the converted model files themselves, the repo holds the methodology: a written report, the local code patches needed to do the conversion, the benchmark scripts, the raw benchmark numbers, and the plots derived from those numbers. Two technical ideas come up repeatedly in the README. NVFP4 is a four-bit numerical format that NVIDIA's newer hardware can compute on directly, which the project enables through a flag called BLACKWELL_NATIVE_FP4. MTP, or Multi-Token Prediction, is a head on top of the model that can predict several tokens at once, used here to speed up generation through what is called self-speculative decoding inside the llama.cpp runtime. The project keeps the MTP head intact during conversion so that this speed-up still works. The actual model files live separately on Hugging Face under the Incarnas account, and each release links back to this repo for the methodology. The first release listed is Qwen3.5-122B-A10B-NVFP4-MTP-GGUF, dated 2026-05-16, which the README says delivers a 46 percent improvement in long-decode throughput against the same model without MTP on a Blackwell Pro card with 96 gigabytes of memory. The patches folder holds local edits to llama.cpp that the methodology depends on. One patch fixes a bug specific to the Qwen3.5 MTP path. The README notes that this patch was later merged upstream and will be retired once the next release is built against a post-fix version of llama.cpp. The report itself lives at paper/REPORT.md with a PDF rendering alongside. The benchmark rig is documented in detail: an RTX PRO 6000 Blackwell Max-Q workstation card with 96 gigabytes of GDDR7, a 24-core Threadripper, 128 gigabytes of DDR5 ECC memory, and a Linux kernel with a current NVIDIA driver. The bench scripts call the llama-server HTTP endpoint directly without depending on vendor harnesses. The repository is released under the MIT license, and a BibTeX citation block is included for academic reference.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.