predibase/lorax

★ 3,780PythonAudience · developerComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((LoRAX))
    What It Does
      Multi-adapter inference
      Single GPU efficiency
    How It Works
      Base model in memory
      LoRA adapters on demand
      Batched mixed requests
    Setup
      Docker image
      Hugging Face models
      REST or OpenAI API
    Use Cases
      Multi-tenant AI products
      Custom model serving
      Cost reduction at scale

mindmap root((LoRAX)) What It Does Multi-adapter inference Single GPU efficiency How It Works Base model in memory LoRA adapters on demand Batched mixed requests Setup Docker image Hugging Face models REST or OpenAI API Use Cases Multi-tenant AI products Custom model serving Cost reduction at scale

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Host many customized AI model variants for different customers from a single GPU instead of one GPU each

USE CASE 2

Serve per-request LoRA adapters without blocking other users or reloading the base model

USE CASE 3

Redirect existing OpenAI client code to a self-hosted LoRAX endpoint with minimal changes

USE CASE 4

Build a multi-tenant AI product where each customer gets their own fine-tuned model at a fraction of normal serving cost

Tech stack

PythonDockerCUDAREST APIOpenAI-compatible APIPEFT

Getting it running

Difficulty · hard Time to first run · 1h+

Requires an Nvidia Ampere-generation GPU or newer with CUDA 11.8+ and Docker installed.

Use freely for any purpose including commercial, modify and redistribute with attribution under Apache 2.0.

In plain English

LoRAX, short for LoRA eXchange, is a server for running large language model inference when you have many different fine-tuned versions of the same base model. The core problem it addresses is cost: if you wanted to serve a thousand slightly different customized models, you would normally need a thousand separate GPU deployments. LoRAX instead keeps one copy of the base model in memory and loads small fine-tuning add-ons called LoRA adapters on a per-request basis, making it possible to handle many different fine-tuned variants from a single GPU. LoRA is a technique for customizing a pretrained model by training a small set of additional weight matrices rather than retraining the whole model. These adapter files are much smaller than the base model itself, typically a few megabytes versus several gigabytes. LoRAX loads the appropriate adapter just-in-time when a request arrives, without blocking other requests. Multiple adapters can be packed into the same processing batch, so the server stays busy and the throughput stays high even when requests are destined for different fine-tuned variants. Using the server involves pulling a Docker image, pointing it at a base model stored on Hugging Face or a local path, and then sending HTTP requests with a standard REST API. You specify which adapter to use in the request body, or omit it to use the base model. LoRAX also exposes an OpenAI-compatible API, so code written for the OpenAI client library can be redirected to LoRAX with minimal changes. A Python client library is available as a separate pip install. Base model support covers Llama, Mistral, Qwen, and several other popular architectures. Adapters must be in the LoRA format produced by libraries like PEFT. The server requires an Nvidia GPU from the Ampere generation or newer, with CUDA 11.8 or later. LoRAX is released under the Apache 2.0 license.

Copy-paste prompts

Prompt 1

Help me set up a LoRAX server with Docker pointed at a Llama-3 base model on Hugging Face, and show me how to send a request that specifies a custom LoRA adapter

Prompt 2

I have a base Mistral model and 50 LoRA adapters fine-tuned for different use cases. Walk me through deploying LoRAX to serve them all simultaneously from one GPU

Prompt 3

Show me how to use LoRAX with the OpenAI Python client library so I can reuse my existing OpenAI code but route requests to my local LoRAX server

Prompt 4

I want to benchmark LoRAX throughput with 10 concurrent LoRA adapters, help me write a Python load test script using the LoRAX REST API

Prompt 5

Explain the difference between sending a request to the base model versus a specific LoRA adapter in LoRAX and show me both request formats

Open on GitHub → Explain another repo

← predibase on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.