haiquanlu/mix-quant

Analysis updated 2026-06-24

★ 19PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((Mix-Quant))
    Inputs
      Long context prompts
      OpenAI-style requests
      Model checkpoint
    Outputs
      Generated tokens
      Benchmark scores
      Throughput numbers
    Use Cases
      Speed up agent prefilling
      Run mixed-precision serving
      Benchmark long context
    Tech Stack
      Python
      vLLM
      CUDA
      NVFP4

mindmap root((Mix-Quant)) Inputs Long context prompts OpenAI-style requests Model checkpoint Outputs Generated tokens Benchmark scores Throughput numbers Use Cases Speed up agent prefilling Run mixed-precision serving Benchmark long context Tech Stack Python vLLM CUDA NVFP4

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Serve a long-context LLM with NVFP4 prefilling and BF16 decoding for agent workflows

USE CASE 2

Reproduce the paper benchmarks on math500, aime24, and LongBench-v2

USE CASE 3

Build the forked vLLM from source and run the disaggregated proxy

USE CASE 4

Grade LongMemEval runs with an OpenAI-API judge model

What is it built with?

PythonvLLMCUDANVFP4BF16

How does it compare?

	haiquanlu/mix-quant	16nic/comfyui-agnes-ai	6c696e68/gpt_signup_hybrid
Stars	19	19	19
Language	Python	Python	Python
Setup difficulty	hard	moderate	hard
Complexity	5/5	2/5	4/5
Audience	researcher	vibe coder	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

You need a recent NVIDIA GPU that supports NVFP4 plus a custom vLLM fork built from source as a Git submodule.

In plain English

Mix-Quant is research code that comes with a paper from a lab at the National University of Singapore. It targets a specific problem in how large language models are run when they are used as agents. Agentic workflows feed the model very long context windows full of tool output, memory, retrieved documents, and earlier reasoning. Reading that context, called the prefilling phase, is slow. The natural fix is to use lower-precision arithmetic to speed it up, but if you use low precision everywhere the small errors pile up and the answers get worse. The project's idea is to split the two phases of running a model and treat them differently. During prefilling, when the model is just reading the input, Mix-Quant uses a low-precision number format called NVFP4 to go faster. During decoding, when the model is actually writing words one at a time, it switches back to a higher-precision format called BF16 so the output stays stable. The README claims this preserves task performance while accelerating long-context inference. Installation is done in a Python conda environment. The project depends on a modified fork of vLLM, an open source serving system for large language models, which is included as a Git submodule. You can either install vLLM with a pre-compiled wheel or build it from source. After that a requirements.txt file pulls in the rest. Using it follows a disaggregated serving pattern. A shell script launches three processes at once: one server running the model with NVFP4 quantization for prefilling, a second server running the same model in BF16 for decoding, and a small proxy that sits in front of both. You then send normal OpenAI-style chat requests to the proxy, picking which GPU each server runs on with command-line flags. The README also lists evaluation scripts. You can benchmark on reasoning datasets like math500, aime24, aime25, and gsm8k, on LongBench-v2, and on LongMemEval. The LongMemEval flow includes an optional step that uses a separate OpenAI-API judge model to grade the answers automatically.

Copy-paste prompts

Prompt 1

Install Mix-Quant in a conda env with the forked vLLM submodule and run the example proxy

Prompt 2

Launch the three-process server with NVFP4 prefilling on GPU 0 and BF16 decoding on GPU 1

Prompt 3

Benchmark Mix-Quant on LongBench-v2 against plain BF16 vLLM and compare throughput

Prompt 4

Swap the proxy with a custom router that picks decoding precision per request length

Prompt 5

Adapt the disaggregated pattern to use INT8 prefilling instead of NVFP4

Frequently asked questions

What is mix-quant?

Research code that splits LLM serving into NVFP4 prefilling and BF16 decoding via a forked vLLM and a proxy, aiming to speed long-context agent inference without losing accuracy.

What language is mix-quant written in?

Mainly Python. The stack also includes Python, vLLM, CUDA.

How hard is mix-quant to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is mix-quant for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.