Serve a long-context LLM with NVFP4 prefilling and BF16 decoding for agent workflows
Reproduce the paper benchmarks on math500, aime24, and LongBench-v2
Build the forked vLLM from source and run the disaggregated proxy
Grade LongMemEval runs with an OpenAI-API judge model
You need a recent NVIDIA GPU that supports NVFP4 plus a custom vLLM fork built from source as a Git submodule.
Mix-Quant is research code that comes with a paper from a lab at the National University of Singapore. It targets a specific problem in how large language models are run when they are used as agents. Agentic workflows feed the model very long context windows full of tool output, memory, retrieved documents, and earlier reasoning. Reading that context, called the prefilling phase, is slow. The natural fix is to use lower-precision arithmetic to speed it up, but if you use low precision everywhere the small errors pile up and the answers get worse. The project's idea is to split the two phases of running a model and treat them differently. During prefilling, when the model is just reading the input, Mix-Quant uses a low-precision number format called NVFP4 to go faster. During decoding, when the model is actually writing words one at a time, it switches back to a higher-precision format called BF16 so the output stays stable. The README claims this preserves task performance while accelerating long-context inference. Installation is done in a Python conda environment. The project depends on a modified fork of vLLM, an open source serving system for large language models, which is included as a Git submodule. You can either install vLLM with a pre-compiled wheel or build it from source. After that a requirements.txt file pulls in the rest. Using it follows a disaggregated serving pattern. A shell script launches three processes at once: one server running the model with NVFP4 quantization for prefilling, a second server running the same model in BF16 for decoding, and a small proxy that sits in front of both. You then send normal OpenAI-style chat requests to the proxy, picking which GPU each server runs on with command-line flags. The README also lists evaluation scripts. You can benchmark on reasoning datasets like math500, aime24, aime25, and gsm8k, on LongBench-v2, and on LongMemEval. The LongMemEval flow includes an optional step that uses a separate OpenAI-API judge model to grade the answers automatically.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.