Analysis updated 2026-05-18
Run a small language model on an AMD Strix Halo NPU at 3x faster speed than CPU-only inference on Linux
Contribute kernel patches to fold amdxdna NPU support into the amdgpu driver
Study the root causes blocking INT8 and BF16 precision on AMD XDNA2 NPU hardware
| bong-water-water-bong/npu-gpu-cpu | dahorg/wlameshot | fatehmtd/gradiumpp | |
|---|---|---|---|
| Stars | 3 | 3 | 3 |
| Language | C++ | C++ | C++ |
| Setup difficulty | hard | moderate | moderate |
| Complexity | 5/5 | 3/5 | 3/5 |
| Audience | researcher | developer | developer |
Figures from each repo's GitHub metadata at analysis time.
Requires AMD Strix Halo hardware, Linux with amdxdna driver, and the XRT + torch2aie IRON toolchain. Research-quality, not production-ready.
Modern AMD laptops and desktops contain three different compute processors: the CPU (central processor), the GPU (graphics processor), and the NPU (a specialized chip for AI calculations). On Linux today, each runs on a separate software driver and has its own memory space. A program that wants to use all three has to manage three different programming models, copy data between them, and work with three different APIs. This project is an effort to merge all three under a single unified driver and a single memory manager, using AMD's existing ROCm software stack as the foundation. The specific hardware target is the AMD Strix Halo chip family. The goal is to fold the NPU driver (called amdxdna) into the GPU driver (called amdgpu), so that from a programmer's perspective the NPU is just another compute engine accessible through the same API used for the GPU. A single memory allocation would work across all three processors without copying. The repository documents both the long-term architecture goal and concrete results already achieved. The team successfully ran a real language model (Qwen3-0.6B, a small AI text-generation model) on the NPU at 4.8 tokens per second, which is 3.2 times faster than running the same model on the CPU, at roughly one-third the power consumption. This required writing custom C++ inference code, compiling specialized binary files for the NPU's AI processing tiles, and fixing several compiler bugs in the upstream toolchain. The README is candid about current blockers: INT8 precision (which would double throughput further) is blocked by a parser limitation in the compiler, and BF16 precision causes the hardware DMA controller to hang. Both issues are documented with root causes and proposed fixes. This is active research work, not a finished product.
A research project to unify AMD's CPU, GPU, and NPU under a single Linux driver and memory model. Already runs a small language model on the AMD Strix Halo NPU at 4.8 tok/s, 3.2x faster than CPU-only with 3x better power efficiency.
Mainly C++. The stack also includes C++, ROCm, MLIR.
Setup difficulty is rated hard, with roughly 1day+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.