gt-nexs/lynx

Analysis updated 2026-06-24

★ 8PythonAudience · researcherComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((lynx))
    Inputs
      MoE model
      vLLM server
      Env flag
    Outputs
      Faster tokens per second
      Same accuracy
    Use Cases
      Serve Qwen MoE faster
      Serve Mixtral DeepSeek faster
      Drop-in vLLM speedup
    Tech Stack
      Python
      Triton
      vLLM
      CUDA

mindmap root((lynx)) Inputs MoE model vLLM server Env flag Outputs Faster tokens per second Same accuracy Use Cases Serve Qwen MoE faster Serve Mixtral DeepSeek faster Drop-in vLLM speedup Tech Stack Python Triton vLLM CUDA

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Double serving throughput on a Qwen or Mixtral MoE deployment

USE CASE 2

Test MoE placement policies without forking vLLM

USE CASE 3

Compare stock vLLM and Lynx on the same hardware with one env flag

USE CASE 4

What is it built with?

PythonTritonvLLMCUDA

How does it compare?

	gt-nexs/lynx	adam-s/car-diagnosis	bongobongo2020/krea2-character-lora-trainer
Stars	8	8	8
Language	Python	Python	Python
Setup difficulty	hard	moderate	moderate
Complexity	4/5	3/5	3/5
Audience	researcher	researcher	vibe coder

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1h+

Needs a working multi-GPU vLLM environment and an MoE checkpoint, Triton kernels JIT-compile on first run so the first request is slow.

Apache 2.0 lets anyone use, modify, and ship the code commercially as long as the licence and notices stay.

In plain English

Lynx is a small Python add-on for vLLM, the popular open-source server that runs large language models. Its job is to speed up a specific kind of model called a Mixture of Experts, or MoE. In those models, only a few internal experts are used per request, and shuffling work between experts can become a bottleneck. According to the README, Lynx rearranges how those experts are mapped across the hardware and can raise throughput by up to two times while keeping the same accuracy, on both reasoning and multi-modal MoE models. The big selling point in the README is how little you have to change to try it. Lynx ships as a plugin that plugs into vLLM. There is no patching of vLLM source code, no recompiling of GPU kernels at install time, and no changes to the application that calls the server. The Python wheel is pure Python, so installing it does not need a CUDA toolkit, a host compiler, or CMake. The fast GPU code is written in Triton and is compiled the first time you run the server. Installation is two pip commands: install a specific version of vLLM, then install Lynx from its Git repo. To turn the plugin on, you set one environment variable named VLLM_LYNX_ENABLED to 1, then start vLLM the usual way. The README shows an example serving a Qwen MoE model with tensor parallelism across two GPUs. You then send requests to the normal vLLM completions endpoint as if Lynx were not there. If the environment variable is not set, the plugin stays out of the way and the server behaves exactly like stock vLLM. Lynx works with any MoE model that uses vLLM's built-in FusedMoE layer. The package comes with built-in policies for several popular model families, including Qwen, Mixtral, DeepSeek, GPT-OSS, and Llama 4. A separate documentation file lists every supported model and explains how to register a new one. There is also an optional variable for pointing at a custom policy file. The project is released under the Apache 2.0 license.

Copy-paste prompts

Prompt 1

Install Lynx on top of my vLLM and serve a Qwen MoE model across 2 GPUs

Prompt 2

Benchmark requests per second on stock vLLM vs Lynx for the same Mixtral checkpoint

Prompt 3

Walk me through registering a new MoE model family with Lynx

Prompt 4

Explain what VLLM_LYNX_ENABLED actually changes inside the FusedMoE layer

Prompt 5

Write a docker-compose that boots vLLM with Lynx and a Llama 4 MoE checkpoint

Frequently asked questions

What is lynx?

vLLM plugin that remaps Mixture of Experts placement to lift MoE throughput up to 2x without source patches or CUDA recompilation.

What language is lynx written in?

Mainly Python. The stack also includes Python, Triton, vLLM.

What license does lynx use?

Apache 2.0 lets anyone use, modify, and ship the code commercially as long as the licence and notices stay.

How hard is lynx to set up?

Setup difficulty is rated hard, with roughly 1h+ to a first successful run.

Who is lynx for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.