Double serving throughput on a Qwen or Mixtral MoE deployment
Test MoE placement policies without forking vLLM
Compare stock vLLM and Lynx on the same hardware with one env flag
Register a custom MoE policy for an in-house model family
Needs a working multi-GPU vLLM environment and an MoE checkpoint; Triton kernels JIT-compile on first run so the first request is slow.
Lynx is a small Python add-on for vLLM, the popular open-source server that runs large language models. Its job is to speed up a specific kind of model called a Mixture of Experts, or MoE. In those models, only a few internal experts are used per request, and shuffling work between experts can become a bottleneck. According to the README, Lynx rearranges how those experts are mapped across the hardware and can raise throughput by up to two times while keeping the same accuracy, on both reasoning and multi-modal MoE models. The big selling point in the README is how little you have to change to try it. Lynx ships as a plugin that plugs into vLLM. There is no patching of vLLM source code, no recompiling of GPU kernels at install time, and no changes to the application that calls the server. The Python wheel is pure Python, so installing it does not need a CUDA toolkit, a host compiler, or CMake. The fast GPU code is written in Triton and is compiled the first time you run the server. Installation is two pip commands: install a specific version of vLLM, then install Lynx from its Git repo. To turn the plugin on, you set one environment variable named VLLM_LYNX_ENABLED to 1, then start vLLM the usual way. The README shows an example serving a Qwen MoE model with tensor parallelism across two GPUs. You then send requests to the normal vLLM completions endpoint as if Lynx were not there. If the environment variable is not set, the plugin stays out of the way and the server behaves exactly like stock vLLM. Lynx works with any MoE model that uses vLLM's built-in FusedMoE layer. The package comes with built-in policies for several popular model families, including Qwen, Mixtral, DeepSeek, GPT-OSS, and Llama 4. A separate documentation file lists every supported model and explains how to register a new one. There is also an optional variable for pointing at a custom policy file. The project is released under the Apache 2.0 license.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.