Run large language models like Falcon-40B or Llama 2 locally on a single consumer NVIDIA RTX GPU at usable generation speeds.
Set up a fast local AI inference server on a gaming PC without needing cloud GPU credits or server hardware.
Compare inference speed between PowerInfer's GPU-split approach and CPU-only tools like llama.cpp on the same consumer hardware.
Must compile from source using CMake and Python, requires NVIDIA or AMD GPU with appropriate CUDA or ROCm drivers, macOS Apple Silicon runs CPU-only without the speed gains.
PowerInfer is a C++ engine for running large language models on a personal computer with a consumer-grade GPU, rather than requiring expensive server hardware. It was developed by researchers at Shanghai Jiao Tong University and Tiiny AI. The core idea behind PowerInfer comes from a property of how large language models work internally. When an AI model generates text, it activates a relatively small portion of its internal components (called neurons) on any given step. A small fraction of neurons are activated frequently across many inputs, the researchers call these "hot" neurons. The vast majority are activated rarely and vary by input, these are "cold" neurons. PowerInfer exploits this pattern by keeping hot neurons loaded in the GPU (which is fast but has limited memory) while computing cold neurons on the CPU (which has more memory but is slower). This split approach lets it run much larger models on a single consumer GPU than would otherwise fit. The project reports that on a single NVIDIA RTX 4090 GPU, PowerInfer achieves an average generation speed of about 13 tokens per second across several large models, with peaks above 29 tokens per second. The README states this is about 11 times faster than llama.cpp (another popular local inference tool) on the same hardware, while being only about 18% slower than a server-grade data center GPU. Supported models include Falcon-40B, the Llama 2 family, ProSparse Llama 2, and Bamboo-7B. The engine runs on Linux and Windows with NVIDIA or AMD GPUs, and on macOS with Apple Silicon chips in CPU-only mode (without the speed gains from the GPU optimization). Building requires CMake and Python, and different build flags are used depending on whether you have an NVIDIA, AMD, or CPU-only setup. The project is licensed under MIT. Related work from the same team includes PowerInfer-2, an optimized version for smartphones, and a line of SmallThinker models designed for on-device inference.
← tiiny-ai on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.