lyogavin/airllm

★ 17,848Jupyter NotebookAudience · researcherComplexity · 3/5LicenseSetup · moderate

Mindmap

mindmap
  root((AirLLM))
    What it does
      Large model on small GPU
      Layer streaming from disk
      No full quantization needed
    Tech stack
      Python
      PyTorch
      Hugging Face
      CUDA
    Supported models
      Llama 70B and 405B
      Mistral
      Qwen and Baichuan
    Use cases
      Local experimentation
      Consumer hardware
      Apple Silicon
    Features
      4-bit and 8-bit compression
      Prefetch overlap
      CPU inference

mindmap root((AirLLM)) What it does Large model on small GPU Layer streaming from disk No full quantization needed Tech stack Python PyTorch Hugging Face CUDA Supported models Llama 70B and 405B Mistral Qwen and Baichuan Use cases Local experimentation Consumer hardware Apple Silicon Features 4-bit and 8-bit compression Prefetch overlap CPU inference

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run a 70B Llama model locally on a single consumer GPU with 4GB of VRAM without quantization or quality loss.

USE CASE 2

Experiment with large open-source models like Mistral or Qwen on a Mac with Apple Silicon.

USE CASE 3

Generate text from a 405B Llama 3.1 model on a machine with only 8GB of VRAM.

Tech stack

PythonPyTorchHugging FaceCUDA

Getting it running

Difficulty · moderate Time to first run · 30min

Requires a CUDA GPU or Apple Silicon, inference is disk-speed-bound, so an SSD is strongly recommended for usable performance.

Free to use for any purpose including commercial use, as long as you include the license notice (Apache 2.0).

In plain English

AirLLM is a Python package that lets you run very large language models on a modest GPU. Normally a 70-billion-parameter model would not fit in the memory of a small graphics card, so people are forced to use big multi-GPU servers, pay for hosted APIs, or shrink the model through techniques like quantization that hurt quality. AirLLM's pitch is that it can run a 70B model on a single 4GB GPU card without quantization, distillation, or pruning, and the README adds that it can now run a 405B Llama 3.1 model on as little as 8GB of VRAM. It works by reorganising how the model is held in memory. Instead of loading the whole model at once, AirLLM splits the model into its transformer layers, saves them layer-by-layer to disk, then streams the layers through the GPU one at a time during inference, with prefetching so loading a layer overlaps with computing on the previous one. The 2.0 release added an optional block-wise quantization mode that can compress weights to 4-bit or 8-bit for up to a 3x speedup, since the bottleneck is disk loading rather than arithmetic. Inference itself looks similar to using a normal Hugging Face transformer: install with pip, call AutoModel.from_pretrained with a Hugging Face repo ID or a local path, tokenize an input, and call generate. You would reach for AirLLM when you want to experiment with a large open model locally, for example a 70B Llama variant or one of the supported families like ChatGLM, Qwen, Baichuan, Mistral, or InternLM, but you only have a single consumer GPU or a Mac with Apple Silicon. The README also notes CPU inference and Mixtral support. The package is Python, distributed on PyPI as airllm, and licensed Apache 2.0.

Copy-paste prompts

Prompt 1

Show me the minimal Python code to load a 70B Hugging Face model with airllm and generate text from a prompt on a 4GB GPU.

Prompt 2

How does airllm's layer-by-layer disk streaming work, and how do I configure it to overlap loading with computation for faster inference?

Prompt 3

Set up airllm with 4-bit block quantization to speed up inference on a machine with a slow hard drive.

Prompt 4

Use airllm to run a Mistral model on Apple Silicon and generate a 200-token response from a text prompt.

Prompt 5

Write a benchmark script that compares airllm inference speed versus a fully in-memory model on the same GPU.

Open on GitHub → Explain another repo

← lyogavin on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.