fminference/flexllmgen

★ 9,366PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((FlexLLMGen))
    What it does
      Runs large models
      Single consumer GPU
      High batch throughput
    Memory tiers
      GPU VRAM
      CPU RAM
      Disk SSD
    Best for
      Overnight batch jobs
      Document classification
      Benchmark evaluation
    Tradeoffs
      Slower per request
      Not for live chat

mindmap root((FlexLLMGen)) What it does Runs large models Single consumer GPU High batch throughput Memory tiers GPU VRAM CPU RAM Disk SSD Best for Overnight batch jobs Document classification Benchmark evaluation Tradeoffs Slower per request Not for live chat

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run a 175-billion-parameter language model on a single consumer GPU overnight to classify or summarize thousands of documents.

USE CASE 2

Evaluate a large AI model on standard benchmarks without access to an expensive multi-GPU cluster.

USE CASE 3

Process large text datasets for information extraction on budget hardware by offloading model weights to an SSD.

Tech stack

PythonPyTorchCUDAHugging Facepip

Getting it running

Difficulty · hard Time to first run · 1h+

Requires a CUDA-capable GPU, very large models also need a fast SSD with tens of GB free for weight offloading.

License terms not stated in the explanation.

In plain English

FlexLLMGen is a Python tool for running large language models on a single, ordinary GPU by intelligently spreading the model's data across the GPU, regular computer memory (RAM), and even the disk. Large language models are AI systems trained to understand and generate text. The challenge is that the most capable models are too large to fit inside a typical GPU's memory, which is the fastest but most limited type of storage a computer has. The project is specifically designed for batch processing jobs rather than live conversations. Examples include running a model across thousands of documents to extract information, classifying large collections of text, or evaluating a model against a standard benchmark. In these situations, raw speed for a single response matters less than how many responses the system can produce per hour across the whole job. FlexLLMGen trades some response latency for much higher overall throughput. It achieves this by offloading parts of the model that are not immediately needed to slower storage, either CPU memory or a fast SSD, then pulling them back onto the GPU just in time for processing. It also supports compression of the model's internal data to fit more into the same space, and it processes large batches of inputs together to keep the GPU busy. Installing FlexLLMGen is a single pip command. Running it is also straightforward: you point it at a model name (it downloads weights from Hugging Face automatically for supported models), specify how much of the model should live on the GPU versus in CPU memory versus on disk, and set a batch size. For very large models like the 175 billion parameter OPT model, weights can be offloaded entirely to an SSD. The project comes with documented examples for running benchmark evaluations and data processing tasks. It can also scale across multiple GPUs on separate machines using pipeline parallelism. The README notes clearly that FlexLLMGen will be slower than a setup that keeps the full model on powerful GPUs, so it is best suited to budget hardware and overnight batch workloads. The project was developed through a collaboration among Stanford, UC Berkeley, Carnegie Mellon, and other research groups.

Copy-paste prompts

Prompt 1

Using FlexLLMGen, run OPT-30B to summarize 10,000 customer support tickets overnight on a single 16GB GPU. Show me the command with the right GPU, CPU, and disk memory split settings and batch size.

Prompt 2

I want to evaluate OPT-175B on a benchmark. I have a 24GB GPU, 64GB RAM, and a fast SSD. What FlexLLMGen memory offload settings should I use to fit the model?

Prompt 3

Show me how to use FlexLLMGen pipeline parallelism to spread inference across two GPUs on separate machines for a large batch job.

Open on GitHub → Explain another repo

← fminference on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.