Run powerful AI chat models locally on a gaming PC GPU without paying for cloud API access.
Fit large language models into limited GPU memory using smart compression that preserves output quality.
Build a local AI API server with TabbyAPI so existing ChatGPT-compatible apps work with your own hardware.
Generate text at high speed for scripts, prototypes, or personal AI tools entirely offline.
Install via pip from PyPI or prebuilt wheels. Requires a CUDA-capable NVIDIA GPU. Note: project is archived, consider ExLlamaV3 for new projects.
ExLlamaV2 is a library for running large language models on consumer-grade GPUs, meaning the kinds of graphics cards you might find in a gaming PC rather than a server data center. The goal is to make local AI inference fast and memory-efficient, so you can run capable models on hardware you already own. Note that this project is now archived and development has moved to a successor called ExLlamaV3. The library introduces its own model format called EXL2, which compresses model weights into fewer bits (anywhere from 2 to 8 bits per weight) to reduce how much GPU memory the model needs. Unlike simpler compression approaches, EXL2 can apply different levels of compression to different parts of the model, spending more bits on the layers that matter most for accuracy. This lets you fit large models into limited memory while minimizing quality loss. It also supports the older GPTQ 4-bit format used by many publicly shared models. For generating text, ExLlamaV2 has a dynamic generation engine that supports running multiple requests at once, caching repeated prompt sections to avoid reprocessing them, and streaming output token by token as it is generated. You can use it directly in Python scripts or pair it with the recommended server companion TabbyAPI, which wraps it in a web API compatible with OpenAI-style clients. Other frontends like text-generation-webui and ExUI also support it. Performance numbers in the README show speeds ranging from roughly 33 tokens per second for a 70-billion-parameter model down to over 700 tokens per second for a small 1.1-billion-parameter model, depending on GPU and compression settings. Installation is via pip, either from source, prebuilt wheels, or PyPI.
← turboderp-org on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.