Host many customized AI model variants for different customers from a single GPU instead of one GPU each
Serve per-request LoRA adapters without blocking other users or reloading the base model
Redirect existing OpenAI client code to a self-hosted LoRAX endpoint with minimal changes
Build a multi-tenant AI product where each customer gets their own fine-tuned model at a fraction of normal serving cost
Requires an Nvidia Ampere-generation GPU or newer with CUDA 11.8+ and Docker installed.
LoRAX, short for LoRA eXchange, is a server for running large language model inference when you have many different fine-tuned versions of the same base model. The core problem it addresses is cost: if you wanted to serve a thousand slightly different customized models, you would normally need a thousand separate GPU deployments. LoRAX instead keeps one copy of the base model in memory and loads small fine-tuning add-ons called LoRA adapters on a per-request basis, making it possible to handle many different fine-tuned variants from a single GPU. LoRA is a technique for customizing a pretrained model by training a small set of additional weight matrices rather than retraining the whole model. These adapter files are much smaller than the base model itself, typically a few megabytes versus several gigabytes. LoRAX loads the appropriate adapter just-in-time when a request arrives, without blocking other requests. Multiple adapters can be packed into the same processing batch, so the server stays busy and the throughput stays high even when requests are destined for different fine-tuned variants. Using the server involves pulling a Docker image, pointing it at a base model stored on Hugging Face or a local path, and then sending HTTP requests with a standard REST API. You specify which adapter to use in the request body, or omit it to use the base model. LoRAX also exposes an OpenAI-compatible API, so code written for the OpenAI client library can be redirected to LoRAX with minimal changes. A Python client library is available as a separate pip install. Base model support covers Llama, Mistral, Qwen, and several other popular architectures. Adapters must be in the LoRA format produced by libraries like PEFT. The server requires an Nvidia GPU from the Ampere generation or newer, with CUDA 11.8 or later. LoRAX is released under the Apache 2.0 license.
← predibase on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.