explaingit

dazhix030-stack/llm-layer-pruning

1PythonAudience · researcherComplexity · 4/5ActiveSetup · hard

TLDR

Training-free experiment that offloads every other transformer layer to disk and adds back a scaled residual, cutting idle VRAM about 31 percent on a Qwen2.5-3B model.

Mindmap

mindmap
  root((llm-layer-pruning))
    Inputs
      Qwen2.5-3B NF4
      Short token sequences
    Outputs
      Lower standby VRAM
      Perplexity report
      Throughput numbers
    Use Cases
      Run LLMs on 6 GB GPU
      Test layer skipping
      KV cache benchmarks
    Tech Stack
      Python
      PyTorch
      Transformers
      bitsandbytes
      CUDA

Things people build with this

USE CASE 1

Fit a 3B-parameter model into 6 GB of VRAM by offloading alternating layers

USE CASE 2

Measure perplexity impact of training-free layer skipping

USE CASE 3

Benchmark KV cache versus simple disk offload for inference speed

USE CASE 4

Reproduce the RTX 4050 results on a different small transformer

Tech stack

PythonPyTorchTransformersbitsandbytesCUDA

Getting it running

Difficulty · hard Time to first run · 1day+

Needs a CUDA capable GPU and the NF4-quantized weights pipeline.

In plain English

This repository is an experiment in running large language models on a graphics card with limited memory. The author's goal is to reduce the amount of GPU memory, called VRAM, that a model needs while sitting idle on a personal computer. The technique is described as training free, meaning the model itself is never retrained or fine tuned; the change happens only at inference time. The core idea is to take a transformer model and move every other layer out to disk instead of keeping it in GPU memory. Normally a transformer runs by passing data through layer one, then layer two, then layer three, and so on. With this method, the even numbered layers are skipped. When the program reaches one of those skipped layers, it loads the layer from disk just long enough to compute what the author calls its residual contribution, which is the difference between what the layer would output and what it received. That difference is then added into the input of the next active layer with a scale factor. The README reports results on a Qwen2.5-3B model in NF4 quantization running on an RTX 4050 with 6 GB of VRAM. Compared to the baseline, standby VRAM dropped by about 31 percent, perplexity stayed essentially the same, and the variant that keeps a key value cache ran about 22 percent faster, going from 13.6 tokens per second to 16.6. A simpler disk offload variant without a KV cache saves the same idle memory but is much slower. The author is open about limitations. Tests were only on short sequences of around 100 tokens, long context behavior is untested, and the VRAM savings disappear during inference because the offloaded weights get reloaded. Requirements are Python 3.12, PyTorch 2, transformers, bitsandbytes, and a CUDA capable GPU.

Copy-paste prompts

Prompt 1
Set up the environment with Python 3.12, PyTorch 2, transformers, and bitsandbytes and reproduce the Qwen2.5-3B numbers
Prompt 2
Explain how the residual contribution is computed and added back into the next active layer
Prompt 3
Help me adapt the offload code path to a Llama-3 8B in NF4
Prompt 4
Profile why VRAM savings disappear during active inference
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.