explaingit

gt-nexs/lynx

9PythonAudience · researcherComplexity · 4/5ActiveLicenseSetup · hard

TLDR

vLLM plugin that remaps Mixture of Experts placement to lift MoE throughput up to 2x without source patches or CUDA recompilation.

Mindmap

mindmap
  root((lynx))
    Inputs
      MoE model
      vLLM server
      Env flag
    Outputs
      Faster tokens per second
      Same accuracy
    Use Cases
      Serve Qwen MoE faster
      Serve Mixtral DeepSeek faster
      Drop-in vLLM speedup
    Tech Stack
      Python
      Triton
      vLLM
      CUDA

Things people build with this

USE CASE 1

Double serving throughput on a Qwen or Mixtral MoE deployment

USE CASE 2

Test MoE placement policies without forking vLLM

USE CASE 3

Compare stock vLLM and Lynx on the same hardware with one env flag

USE CASE 4

Register a custom MoE policy for an in-house model family

Tech stack

PythonTritonvLLMCUDA

Getting it running

Difficulty · hard Time to first run · 1h+

Needs a working multi-GPU vLLM environment and an MoE checkpoint; Triton kernels JIT-compile on first run so the first request is slow.

Apache 2.0 lets anyone use, modify, and ship the code commercially as long as the licence and notices stay.

In plain English

Lynx is a small Python add-on for vLLM, the popular open-source server that runs large language models. Its job is to speed up a specific kind of model called a Mixture of Experts, or MoE. In those models, only a few internal experts are used per request, and shuffling work between experts can become a bottleneck. According to the README, Lynx rearranges how those experts are mapped across the hardware and can raise throughput by up to two times while keeping the same accuracy, on both reasoning and multi-modal MoE models. The big selling point in the README is how little you have to change to try it. Lynx ships as a plugin that plugs into vLLM. There is no patching of vLLM source code, no recompiling of GPU kernels at install time, and no changes to the application that calls the server. The Python wheel is pure Python, so installing it does not need a CUDA toolkit, a host compiler, or CMake. The fast GPU code is written in Triton and is compiled the first time you run the server. Installation is two pip commands: install a specific version of vLLM, then install Lynx from its Git repo. To turn the plugin on, you set one environment variable named VLLM_LYNX_ENABLED to 1, then start vLLM the usual way. The README shows an example serving a Qwen MoE model with tensor parallelism across two GPUs. You then send requests to the normal vLLM completions endpoint as if Lynx were not there. If the environment variable is not set, the plugin stays out of the way and the server behaves exactly like stock vLLM. Lynx works with any MoE model that uses vLLM's built-in FusedMoE layer. The package comes with built-in policies for several popular model families, including Qwen, Mixtral, DeepSeek, GPT-OSS, and Llama 4. A separate documentation file lists every supported model and explains how to register a new one. There is also an optional variable for pointing at a custom policy file. The project is released under the Apache 2.0 license.

Copy-paste prompts

Prompt 1
Install Lynx on top of my vLLM and serve a Qwen MoE model across 2 GPUs
Prompt 2
Benchmark requests per second on stock vLLM vs Lynx for the same Mixtral checkpoint
Prompt 3
Walk me through registering a new MoE model family with Lynx
Prompt 4
Explain what VLLM_LYNX_ENABLED actually changes inside the FusedMoE layer
Prompt 5
Write a docker-compose that boots vLLM with Lynx and a Llama 4 MoE checkpoint
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.