Fit a 3B-parameter model into 6 GB of VRAM by offloading alternating layers
Measure perplexity impact of training-free layer skipping
Benchmark KV cache versus simple disk offload for inference speed
Reproduce the RTX 4050 results on a different small transformer
Needs a CUDA capable GPU and the NF4-quantized weights pipeline.
This repository is an experiment in running large language models on a graphics card with limited memory. The author's goal is to reduce the amount of GPU memory, called VRAM, that a model needs while sitting idle on a personal computer. The technique is described as training free, meaning the model itself is never retrained or fine tuned; the change happens only at inference time. The core idea is to take a transformer model and move every other layer out to disk instead of keeping it in GPU memory. Normally a transformer runs by passing data through layer one, then layer two, then layer three, and so on. With this method, the even numbered layers are skipped. When the program reaches one of those skipped layers, it loads the layer from disk just long enough to compute what the author calls its residual contribution, which is the difference between what the layer would output and what it received. That difference is then added into the input of the next active layer with a scale factor. The README reports results on a Qwen2.5-3B model in NF4 quantization running on an RTX 4050 with 6 GB of VRAM. Compared to the baseline, standby VRAM dropped by about 31 percent, perplexity stayed essentially the same, and the variant that keeps a key value cache ran about 22 percent faster, going from 13.6 tokens per second to 16.6. A simpler disk offload variant without a KV cache saves the same idle memory but is much slower. The author is open about limitations. Tests were only on short sequences of around 100 tokens, long context behavior is untested, and the VRAM savings disappear during inference because the offloaded weights get reloaded. Requirements are Python 3.12, PyTorch 2, transformers, bitsandbytes, and a CUDA capable GPU.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.