artidoro/qlora

★ 10,905Jupyter NotebookAudience · researcherComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((QLoRA))
    What it does
      4-bit quantization
      LoRA adapter training
      Single GPU fine-tuning
    Tech Stack
      Python PyTorch
      Hugging Face
      Jupyter notebooks
    Use Cases
      Custom chatbots
      Domain fine-tuning
      Research experiments
    Models
      Guanaco family
      LLaMA base models

mindmap root((QLoRA)) What it does 4-bit quantization LoRA adapter training Single GPU fine-tuning Tech Stack Python PyTorch Hugging Face Jupyter notebooks Use Cases Custom chatbots Domain fine-tuning Research experiments Models Guanaco family LLaMA base models

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Fine-tune a 65B-parameter language model on your own dataset using a single 48GB GPU

USE CASE 2

Train a custom chatbot on domain-specific text without access to expensive multi-GPU clusters

USE CASE 3

Run QLoRA fine-tuning experiments in Google Colab using the included Jupyter notebooks

USE CASE 4

Use the Guanaco models as a starting point for building a chat assistant approaching ChatGPT quality

Tech stack

PythonPyTorchJupyter Notebook

Getting it running

Difficulty · hard Time to first run · 1h+

Requires a CUDA-compatible GPU, fine-tuning a 65B model needs 48GB VRAM, though smaller models work on 24GB cards.

The code is MIT, use freely for any purpose including commercial. Note: the Guanaco model weights inherit separate usage restrictions from the underlying LLaMA models.

In plain English

QLoRA is a research technique developed at the University of Washington that lets you customize (or "fine-tune") very large AI language models on hardware that would normally be far too small to handle them. Language models are software systems trained on huge amounts of text that can answer questions, summarize content, write code, and more. Fine-tuning means taking one of these already-trained models and teaching it to behave differently, usually by training it further on a smaller dataset you choose. The core problem QLoRA addresses is that large models require enormous amounts of GPU memory to train. A model with 65 billion parameters would normally need multiple high-end GPUs working together. QLoRA shrinks the model's memory footprint by compressing its stored numbers from 16-bit values down to 4-bit values, a process called quantization. This compression alone would degrade quality, but QLoRA adds a second technique: it attaches small trainable modules called Low Rank Adapters to the compressed model, and only trains those small modules rather than the entire model. The result is that fine-tuning a 65B-parameter model fits on a single GPU with 48 gigabytes of memory, and the fine-tuned model performs comparably to one trained the full expensive way. The repository also includes Guanaco, a family of chatbot models that the authors produced using QLoRA on the OpenAssistant dataset. The README reports that Guanaco 65B reached 99.3% of ChatGPT's performance on a standard benchmark after 24 hours of fine-tuning on one GPU. Those models are available separately on Hugging Face. The code integrates with widely used tools from Hugging Face, a popular platform for AI model hosting and training utilities. Installation requires Python, PyTorch, and a few supporting libraries. The repository includes example scripts, Jupyter notebooks for running experiments in Google Colab, and configuration options for single-GPU and multi-GPU setups. The codebase is released under the MIT license, though the Guanaco models inherit restrictions from the underlying LLaMA models they were built on.

Copy-paste prompts

Prompt 1

Using QLoRA, help me write a fine-tuning script that trains a 7B LLaMA model on a custom dataset of customer support conversations on a single RTX 4090 GPU

Prompt 2

Set up the QLoRA training pipeline with bitsandbytes 4-bit quantization and PEFT LoRA adapters for a text classification task on a medical dataset

Prompt 3

Help me adapt the QLoRA Colab notebook to fine-tune a model on my own CSV dataset and evaluate it with a held-out validation split

Prompt 4

Using QLoRA multi-GPU config, help me distribute fine-tuning across two A100 GPUs with the correct FSDP settings

Prompt 5

Explain QLoRA NF4 quantization and double quantization and help me choose the right rank and alpha hyperparameters for fine-tuning a 13B model

Open on GitHub → Explain another repo

← artidoro on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.