explaingit

bong-water-water-bong/npu-gpu-cpu

Analysis updated 2026-05-18

3C++Audience · researcherComplexity · 5/5Setup · hard

TLDR

A research project to unify AMD's CPU, GPU, and NPU under a single Linux driver and memory model. Already runs a small language model on the AMD Strix Halo NPU at 4.8 tok/s, 3.2x faster than CPU-only with 3x better power efficiency.

Mindmap

mindmap
  root((npu-gpu-cpu))
    Goal
      Single amdgpu driver
      Unified memory manager
      One ROCm API for all
    Hardware
      AMD Strix Halo
      XDNA2 NPU tiles
      RDNA GPU
      Zen 5 CPU
    Results so far
      Qwen3 on NPU
      4.8 tok per second
      3x power efficiency
    Blockers
      INT8 parser bug
      BF16 DMA hang
      Needs upstream fix
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Run a small language model on an AMD Strix Halo NPU at 3x faster speed than CPU-only inference on Linux

USE CASE 2

Contribute kernel patches to fold amdxdna NPU support into the amdgpu driver

USE CASE 3

Study the root causes blocking INT8 and BF16 precision on AMD XDNA2 NPU hardware

What is it built with?

C++ROCmMLIRLinuxCMake

How does it compare?

bong-water-water-bong/npu-gpu-cpudahorg/wlameshotfatehmtd/gradiumpp
Stars333
LanguageC++C++C++
Setup difficultyhardmoderatemoderate
Complexity5/53/53/5
Audienceresearcherdeveloperdeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Requires AMD Strix Halo hardware, Linux with amdxdna driver, and the XRT + torch2aie IRON toolchain. Research-quality, not production-ready.

In plain English

Modern AMD laptops and desktops contain three different compute processors: the CPU (central processor), the GPU (graphics processor), and the NPU (a specialized chip for AI calculations). On Linux today, each runs on a separate software driver and has its own memory space. A program that wants to use all three has to manage three different programming models, copy data between them, and work with three different APIs. This project is an effort to merge all three under a single unified driver and a single memory manager, using AMD's existing ROCm software stack as the foundation. The specific hardware target is the AMD Strix Halo chip family. The goal is to fold the NPU driver (called amdxdna) into the GPU driver (called amdgpu), so that from a programmer's perspective the NPU is just another compute engine accessible through the same API used for the GPU. A single memory allocation would work across all three processors without copying. The repository documents both the long-term architecture goal and concrete results already achieved. The team successfully ran a real language model (Qwen3-0.6B, a small AI text-generation model) on the NPU at 4.8 tokens per second, which is 3.2 times faster than running the same model on the CPU, at roughly one-third the power consumption. This required writing custom C++ inference code, compiling specialized binary files for the NPU's AI processing tiles, and fixing several compiler bugs in the upstream toolchain. The README is candid about current blockers: INT8 precision (which would double throughput further) is blocked by a parser limitation in the compiler, and BF16 precision causes the hardware DMA controller to hang. Both issues are documented with root causes and proposed fixes. This is active research work, not a finished product.

Copy-paste prompts

Prompt 1
Walk me through the npu-gpu-cpu inference stack for running Qwen3-0.6B on an AMD Strix Halo NPU. What are xclbins, how are they built with IRON, and what does the C++ engine do with them?
Prompt 2
What is blocking INT8 inference on the AMD XDNA2 NPU and what patches does npu-gpu-cpu apply to the MLIR aiecc compiler to fix it?
Prompt 3
Explain the unified memory goal in npu-gpu-cpu: what is the DRM file descriptor, what is a shared page table, and why does unifying them matter for AI workloads across CPU, GPU, and NPU?
Prompt 4
I have an AMD Strix Halo laptop. What is the minimum setup to try the NPU inference work from this project on Linux?

Frequently asked questions

What is npu-gpu-cpu?

A research project to unify AMD's CPU, GPU, and NPU under a single Linux driver and memory model. Already runs a small language model on the AMD Strix Halo NPU at 4.8 tok/s, 3.2x faster than CPU-only with 3x better power efficiency.

What language is npu-gpu-cpu written in?

Mainly C++. The stack also includes C++, ROCm, MLIR.

How hard is npu-gpu-cpu to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is npu-gpu-cpu for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub bong-water-water-bong on gitmyhub

Verify against the repo before relying on details.