Run a 175-billion-parameter language model on a single consumer GPU overnight to classify or summarize thousands of documents.
Evaluate a large AI model on standard benchmarks without access to an expensive multi-GPU cluster.
Process large text datasets for information extraction on budget hardware by offloading model weights to an SSD.
Requires a CUDA-capable GPU, very large models also need a fast SSD with tens of GB free for weight offloading.
FlexLLMGen is a Python tool for running large language models on a single, ordinary GPU by intelligently spreading the model's data across the GPU, regular computer memory (RAM), and even the disk. Large language models are AI systems trained to understand and generate text. The challenge is that the most capable models are too large to fit inside a typical GPU's memory, which is the fastest but most limited type of storage a computer has. The project is specifically designed for batch processing jobs rather than live conversations. Examples include running a model across thousands of documents to extract information, classifying large collections of text, or evaluating a model against a standard benchmark. In these situations, raw speed for a single response matters less than how many responses the system can produce per hour across the whole job. FlexLLMGen trades some response latency for much higher overall throughput. It achieves this by offloading parts of the model that are not immediately needed to slower storage, either CPU memory or a fast SSD, then pulling them back onto the GPU just in time for processing. It also supports compression of the model's internal data to fit more into the same space, and it processes large batches of inputs together to keep the GPU busy. Installing FlexLLMGen is a single pip command. Running it is also straightforward: you point it at a model name (it downloads weights from Hugging Face automatically for supported models), specify how much of the model should live on the GPU versus in CPU memory versus on disk, and set a batch size. For very large models like the 175 billion parameter OPT model, weights can be offloaded entirely to an SSD. The project comes with documented examples for running benchmark evaluations and data processing tasks. It can also scale across multiple GPUs on separate machines using pipeline parallelism. The README notes clearly that FlexLLMGen will be slower than a setup that keeps the full model on powerful GPUs, so it is best suited to budget hardware and overnight batch workloads. The project was developed through a collaboration among Stanford, UC Berkeley, Carnegie Mellon, and other research groups.
← fminference on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.