explaingit

haiquanlu/mix-quant

21PythonAudience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

Research code that splits LLM serving into NVFP4 prefilling and BF16 decoding via a forked vLLM and a proxy, aiming to speed long-context agent inference without losing accuracy.

Mindmap

mindmap
  root((Mix-Quant))
    Inputs
      Long context prompts
      OpenAI-style requests
      Model checkpoint
    Outputs
      Generated tokens
      Benchmark scores
      Throughput numbers
    Use Cases
      Speed up agent prefilling
      Run mixed-precision serving
      Benchmark long context
    Tech Stack
      Python
      vLLM
      CUDA
      NVFP4

Things people build with this

USE CASE 1

Serve a long-context LLM with NVFP4 prefilling and BF16 decoding for agent workflows

USE CASE 2

Reproduce the paper benchmarks on math500, aime24, and LongBench-v2

USE CASE 3

Build the forked vLLM from source and run the disaggregated proxy

USE CASE 4

Grade LongMemEval runs with an OpenAI-API judge model

Tech stack

PythonvLLMCUDANVFP4BF16

Getting it running

Difficulty · hard Time to first run · 1day+

You need a recent NVIDIA GPU that supports NVFP4 plus a custom vLLM fork built from source as a Git submodule.

In plain English

Mix-Quant is research code that comes with a paper from a lab at the National University of Singapore. It targets a specific problem in how large language models are run when they are used as agents. Agentic workflows feed the model very long context windows full of tool output, memory, retrieved documents, and earlier reasoning. Reading that context, called the prefilling phase, is slow. The natural fix is to use lower-precision arithmetic to speed it up, but if you use low precision everywhere the small errors pile up and the answers get worse. The project's idea is to split the two phases of running a model and treat them differently. During prefilling, when the model is just reading the input, Mix-Quant uses a low-precision number format called NVFP4 to go faster. During decoding, when the model is actually writing words one at a time, it switches back to a higher-precision format called BF16 so the output stays stable. The README claims this preserves task performance while accelerating long-context inference. Installation is done in a Python conda environment. The project depends on a modified fork of vLLM, an open source serving system for large language models, which is included as a Git submodule. You can either install vLLM with a pre-compiled wheel or build it from source. After that a requirements.txt file pulls in the rest. Using it follows a disaggregated serving pattern. A shell script launches three processes at once: one server running the model with NVFP4 quantization for prefilling, a second server running the same model in BF16 for decoding, and a small proxy that sits in front of both. You then send normal OpenAI-style chat requests to the proxy, picking which GPU each server runs on with command-line flags. The README also lists evaluation scripts. You can benchmark on reasoning datasets like math500, aime24, aime25, and gsm8k, on LongBench-v2, and on LongMemEval. The LongMemEval flow includes an optional step that uses a separate OpenAI-API judge model to grade the answers automatically.

Copy-paste prompts

Prompt 1
Install Mix-Quant in a conda env with the forked vLLM submodule and run the example proxy
Prompt 2
Launch the three-process server with NVFP4 prefilling on GPU 0 and BF16 decoding on GPU 1
Prompt 3
Benchmark Mix-Quant on LongBench-v2 against plain BF16 vLLM and compare throughput
Prompt 4
Swap the proxy with a custom router that picks decoding precision per request length
Prompt 5
Adapt the disaggregated pattern to use INT8 prefilling instead of NVFP4
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.