dazhix030-stack/llm-layer-pruning

Analysis updated 2026-06-24

★ 1PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((llm-layer-pruning))
    Inputs
      Qwen2.5-3B NF4
      Short token sequences
    Outputs
      Lower standby VRAM
      Perplexity report
      Throughput numbers
    Use Cases
      Run LLMs on 6 GB GPU
      Test layer skipping
      KV cache benchmarks
    Tech Stack
      Python
      PyTorch
      Transformers
      bitsandbytes
      CUDA

mindmap root((llm-layer-pruning)) Inputs Qwen2.5-3B NF4 Short token sequences Outputs Lower standby VRAM Perplexity report Throughput numbers Use Cases Run LLMs on 6 GB GPU Test layer skipping KV cache benchmarks Tech Stack Python PyTorch Transformers bitsandbytes CUDA

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Fit a 3B-parameter model into 6 GB of VRAM by offloading alternating layers

USE CASE 2

Measure perplexity impact of training-free layer skipping

USE CASE 3

Benchmark KV cache versus simple disk offload for inference speed

USE CASE 4

Reproduce the RTX 4050 results on a different small transformer

What is it built with?

PythonPyTorchTransformersbitsandbytesCUDA

How does it compare?

	dazhix030-stack/llm-layer-pruning	a-bissell/unleash-lite	abhiinnovates/whatsapp-hr-assistant
Stars	1	1	1
Language	Python	Python	Python
Setup difficulty	hard	hard	hard
Complexity	4/5	4/5	3/5
Audience	researcher	researcher	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Needs a CUDA capable GPU and the NF4-quantized weights pipeline.

In plain English

This repository is an experiment in running large language models on a graphics card with limited memory. The author's goal is to reduce the amount of GPU memory, called VRAM, that a model needs while sitting idle on a personal computer. The technique is described as training free, meaning the model itself is never retrained or fine tuned, the change happens only at inference time. The core idea is to take a transformer model and move every other layer out to disk instead of keeping it in GPU memory. Normally a transformer runs by passing data through layer one, then layer two, then layer three, and so on. With this method, the even numbered layers are skipped. When the program reaches one of those skipped layers, it loads the layer from disk just long enough to compute what the author calls its residual contribution, which is the difference between what the layer would output and what it received. That difference is then added into the input of the next active layer with a scale factor. The README reports results on a Qwen2.5-3B model in NF4 quantization running on an RTX 4050 with 6 GB of VRAM. Compared to the baseline, standby VRAM dropped by about 31 percent, perplexity stayed essentially the same, and the variant that keeps a key value cache ran about 22 percent faster, going from 13.6 tokens per second to 16.6. A simpler disk offload variant without a KV cache saves the same idle memory but is much slower. The author is open about limitations. Tests were only on short sequences of around 100 tokens, long context behavior is untested, and the VRAM savings disappear during inference because the offloaded weights get reloaded. Requirements are Python 3.12, PyTorch 2, transformers, bitsandbytes, and a CUDA capable GPU.

Copy-paste prompts

Prompt 1

Set up the environment with Python 3.12, PyTorch 2, transformers, and bitsandbytes and reproduce the Qwen2.5-3B numbers

Prompt 2

Explain how the residual contribution is computed and added back into the next active layer

Prompt 3

Help me adapt the offload code path to a Llama-3 8B in NF4

Prompt 4

Profile why VRAM savings disappear during active inference

Frequently asked questions

What is llm-layer-pruning?

Training-free experiment that offloads every other transformer layer to disk and adds back a scaled residual, cutting idle VRAM about 31 percent on a Qwen2.5-3B model.

What language is llm-layer-pruning written in?

Mainly Python. The stack also includes Python, PyTorch, Transformers.

How hard is llm-layer-pruning to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is llm-layer-pruning for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.