explaingit

turboderp-org/exllamav2

4,520PythonAudience · developerComplexity · 3/5Setup · moderate

TLDR

Run large AI language models fast on your own gaming PC GPU. ExLlamaV2 compresses models to fit in limited memory while keeping quality high, so you can chat with powerful AI locally without cloud costs. Now archived, development continues in ExLlamaV3.

Mindmap

mindmap
  root((ExLlamaV2))
    Model Formats
      EXL2 compression
      GPTQ 4-bit support
      2 to 8 bits per weight
    Performance
      700 tokens per second
      Multi-request batching
      Prompt caching
    Installation
      pip install
      Prebuilt wheels
      PyPI package
    Frontends
      TabbyAPI server
      text-generation-webui
      ExUI
    Hardware
      Consumer GPUs
      Gaming PC cards
      Memory efficient
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Run powerful AI chat models locally on a gaming PC GPU without paying for cloud API access.

USE CASE 2

Fit large language models into limited GPU memory using smart compression that preserves output quality.

USE CASE 3

Build a local AI API server with TabbyAPI so existing ChatGPT-compatible apps work with your own hardware.

USE CASE 4

Generate text at high speed for scripts, prototypes, or personal AI tools entirely offline.

Tech stack

PythonCUDAEXL2GPTQpipOpenAI API

Getting it running

Difficulty · moderate Time to first run · 30min

Install via pip from PyPI or prebuilt wheels. Requires a CUDA-capable NVIDIA GPU. Note: project is archived, consider ExLlamaV3 for new projects.

License not mentioned in the explanation.

In plain English

ExLlamaV2 is a library for running large language models on consumer-grade GPUs, meaning the kinds of graphics cards you might find in a gaming PC rather than a server data center. The goal is to make local AI inference fast and memory-efficient, so you can run capable models on hardware you already own. Note that this project is now archived and development has moved to a successor called ExLlamaV3. The library introduces its own model format called EXL2, which compresses model weights into fewer bits (anywhere from 2 to 8 bits per weight) to reduce how much GPU memory the model needs. Unlike simpler compression approaches, EXL2 can apply different levels of compression to different parts of the model, spending more bits on the layers that matter most for accuracy. This lets you fit large models into limited memory while minimizing quality loss. It also supports the older GPTQ 4-bit format used by many publicly shared models. For generating text, ExLlamaV2 has a dynamic generation engine that supports running multiple requests at once, caching repeated prompt sections to avoid reprocessing them, and streaming output token by token as it is generated. You can use it directly in Python scripts or pair it with the recommended server companion TabbyAPI, which wraps it in a web API compatible with OpenAI-style clients. Other frontends like text-generation-webui and ExUI also support it. Performance numbers in the README show speeds ranging from roughly 33 tokens per second for a 70-billion-parameter model down to over 700 tokens per second for a small 1.1-billion-parameter model, depending on GPU and compression settings. Installation is via pip, either from source, prebuilt wheels, or PyPI.

Copy-paste prompts

Prompt 1
I have ExLlamaV2 installed and a model in EXL2 format. Write me a Python script that loads the model and streams a response to a user prompt, printing tokens as they arrive.
Prompt 2
Explain the difference between EXL2 and GPTQ formats in ExLlamaV2. Which should I use for a 13B model on a GPU with 12GB of VRAM?
Prompt 3
How do I set up TabbyAPI with ExLlamaV2 so I can use it as a drop-in replacement for the OpenAI API in my existing app?
Prompt 4
What compression level (bits per weight) should I choose in EXL2 if I want the best balance of speed, memory use, and output quality for a 7B model?
Prompt 5
ExLlamaV2 is now archived and replaced by ExLlamaV3. What are the main differences, and how do I migrate my existing Python code that uses ExLlamaV2?
Open on GitHub → Explain another repo

← turboderp-org on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.