Speed up LLM inference on NVIDIA H100/H800 hardware by replacing standard matrix multiplication with optimized FP8 kernels
Run Mixture-of-Experts model inference faster using grouped GEMM kernels designed for MoE variable batch layouts
Study NVIDIA GPU optimization techniques by reading the small, well-structured FP8 kernel implementations
Requires NVIDIA H100 or H800 GPU, will not run on consumer GPUs or older data center hardware.
DeepGEMM is a low-level CUDA library from DeepSeek that makes the matrix multiplication operations inside large language models run faster on NVIDIA GPUs. Matrix multiplication, sometimes called GEMM, is the dominant computation in these models: when a model processes text, the bulk of the work is multiplying large tables of numbers together. How fast this happens determines how quickly the model responds. The library focuses on FP8 precision, which is a reduced-precision number format that trades a small amount of numerical accuracy for significantly faster computation and lower memory use. NVIDIA's H800 and H100 GPUs have dedicated hardware for FP8 operations, and DeepGEMM is written to get close to the theoretical peak throughput of that hardware. The README notes achieving up to 1550 TFLOPS on an H800, which is roughly the upper bound the hardware allows. Beyond basic dense matrix multiplication, the library includes specialized kernels for a component called Mixture-of-Experts, which is an architecture used in models like DeepSeek V3 where different subnetworks handle different inputs. These grouped GEMM kernels are designed around the specific data layouts that MoE inference and training produce. A Mega MoE kernel goes further by fusing and overlapping network communication between GPUs with the actual tensor core computation, so the GPU is not sitting idle waiting for data to move. All kernels are compiled at runtime using a lightweight just-in-time compilation module, so there is no CUDA compilation step during installation. The library is designed to be small and readable, with a limited number of core functions, making it accessible for GPU programmers who want to study NVIDIA hardware optimization techniques. This is a highly technical library intended for AI infrastructure engineers and researchers working on large model training or inference at scale. The full README is longer than what was shown.
← deepseek-ai on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.