tiiny-ai/powerinfer

★ 9,451C++Audience · researcherComplexity · 5/5LicenseSetup · hard

Mindmap

mindmap
  root((repo))
    What it does
      Local LLM inference
      GPU-CPU neuron split
      Fast token generation
    Hot cold neurons
      Hot on GPU
      Cold on CPU
      Fits large models
    Supported models
      Falcon 40B
      Llama 2 family
      Bamboo 7B
    Platforms
      Linux and Windows
      NVIDIA or AMD GPU
      macOS CPU only

mindmap root((repo)) What it does Local LLM inference GPU-CPU neuron split Fast token generation Hot cold neurons Hot on GPU Cold on CPU Fits large models Supported models Falcon 40B Llama 2 family Bamboo 7B Platforms Linux and Windows NVIDIA or AMD GPU macOS CPU only

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run large language models like Falcon-40B or Llama 2 locally on a single consumer NVIDIA RTX GPU at usable generation speeds.

USE CASE 2

Set up a fast local AI inference server on a gaming PC without needing cloud GPU credits or server hardware.

USE CASE 3

Compare inference speed between PowerInfer's GPU-split approach and CPU-only tools like llama.cpp on the same consumer hardware.

Tech stack

C++PythonCMakeCUDANVIDIA GPUAMD GPU

Getting it running

Difficulty · hard Time to first run · 1day+

Must compile from source using CMake and Python, requires NVIDIA or AMD GPU with appropriate CUDA or ROCm drivers, macOS Apple Silicon runs CPU-only without the speed gains.

MIT licensed: use freely for any purpose including commercial use, as long as you keep the copyright notice.

In plain English

PowerInfer is a C++ engine for running large language models on a personal computer with a consumer-grade GPU, rather than requiring expensive server hardware. It was developed by researchers at Shanghai Jiao Tong University and Tiiny AI. The core idea behind PowerInfer comes from a property of how large language models work internally. When an AI model generates text, it activates a relatively small portion of its internal components (called neurons) on any given step. A small fraction of neurons are activated frequently across many inputs, the researchers call these "hot" neurons. The vast majority are activated rarely and vary by input, these are "cold" neurons. PowerInfer exploits this pattern by keeping hot neurons loaded in the GPU (which is fast but has limited memory) while computing cold neurons on the CPU (which has more memory but is slower). This split approach lets it run much larger models on a single consumer GPU than would otherwise fit. The project reports that on a single NVIDIA RTX 4090 GPU, PowerInfer achieves an average generation speed of about 13 tokens per second across several large models, with peaks above 29 tokens per second. The README states this is about 11 times faster than llama.cpp (another popular local inference tool) on the same hardware, while being only about 18% slower than a server-grade data center GPU. Supported models include Falcon-40B, the Llama 2 family, ProSparse Llama 2, and Bamboo-7B. The engine runs on Linux and Windows with NVIDIA or AMD GPUs, and on macOS with Apple Silicon chips in CPU-only mode (without the speed gains from the GPU optimization). Building requires CMake and Python, and different build flags are used depending on whether you have an NVIDIA, AMD, or CPU-only setup. The project is licensed under MIT. Related work from the same team includes PowerInfer-2, an optimized version for smartphones, and a line of SmallThinker models designed for on-device inference.

Copy-paste prompts

Prompt 1

Help me build PowerInfer on Linux with an NVIDIA RTX 4090, give me the exact cmake flags and make commands for CUDA support.

Prompt 2

I have a consumer GPU with 16GB VRAM. Which models in the PowerInfer supported list can I run at full speed, and how do I load them?

Prompt 3

Show me how to run Llama 2 inference using PowerInfer and measure tokens per second output.

Prompt 4

Explain the hot and cold neuron split in PowerInfer in plain terms, why does keeping frequent neurons on the GPU speed things up so much?

Open on GitHub → Explain another repo

← tiiny-ai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.