deepseek-ai/deepgemm

★ 7,249CudaAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((DeepGEMM))
    What it does
      Fast matrix multiply
      LLM acceleration
      Near-peak TFLOPS
    Kernel Types
      Dense FP8 GEMM
      Grouped MoE GEMM
      Mega MoE fused kernel
    Technical Design
      FP8 precision
      JIT compilation
      No install build step
    Target Hardware
      NVIDIA H100
      NVIDIA H800
    Audience
      AI infra engineers
      GPU researchers

mindmap root((DeepGEMM)) What it does Fast matrix multiply LLM acceleration Near-peak TFLOPS Kernel Types Dense FP8 GEMM Grouped MoE GEMM Mega MoE fused kernel Technical Design FP8 precision JIT compilation No install build step Target Hardware NVIDIA H100 NVIDIA H800 Audience AI infra engineers GPU researchers

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Speed up LLM inference on NVIDIA H100/H800 hardware by replacing standard matrix multiplication with optimized FP8 kernels

USE CASE 2

Run Mixture-of-Experts model inference faster using grouped GEMM kernels designed for MoE variable batch layouts

USE CASE 3

Study NVIDIA GPU optimization techniques by reading the small, well-structured FP8 kernel implementations

Tech stack

CUDAC++Python

Getting it running

Difficulty · hard Time to first run · 1h+

Requires NVIDIA H100 or H800 GPU, will not run on consumer GPUs or older data center hardware.

License not mentioned in the explanation.

In plain English

DeepGEMM is a low-level CUDA library from DeepSeek that makes the matrix multiplication operations inside large language models run faster on NVIDIA GPUs. Matrix multiplication, sometimes called GEMM, is the dominant computation in these models: when a model processes text, the bulk of the work is multiplying large tables of numbers together. How fast this happens determines how quickly the model responds. The library focuses on FP8 precision, which is a reduced-precision number format that trades a small amount of numerical accuracy for significantly faster computation and lower memory use. NVIDIA's H800 and H100 GPUs have dedicated hardware for FP8 operations, and DeepGEMM is written to get close to the theoretical peak throughput of that hardware. The README notes achieving up to 1550 TFLOPS on an H800, which is roughly the upper bound the hardware allows. Beyond basic dense matrix multiplication, the library includes specialized kernels for a component called Mixture-of-Experts, which is an architecture used in models like DeepSeek V3 where different subnetworks handle different inputs. These grouped GEMM kernels are designed around the specific data layouts that MoE inference and training produce. A Mega MoE kernel goes further by fusing and overlapping network communication between GPUs with the actual tensor core computation, so the GPU is not sitting idle waiting for data to move. All kernels are compiled at runtime using a lightweight just-in-time compilation module, so there is no CUDA compilation step during installation. The library is designed to be small and readable, with a limited number of core functions, making it accessible for GPU programmers who want to study NVIDIA hardware optimization techniques. This is a highly technical library intended for AI infrastructure engineers and researchers working on large model training or inference at scale. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1

Show me how to call DeepGEMM's FP8 dense GEMM function from Python to multiply two matrices on an H100 GPU and verify the result against a PyTorch reference computation.

Prompt 2

How does DeepGEMM's just-in-time compilation work, what gets compiled at runtime and why does it mean there is no slow CUDA build step during pip install?

Prompt 3

I'm building a Mixture-of-Experts inference pipeline on H100 GPUs. How do I use DeepGEMM's grouped GEMM kernels to handle the variable expert batch sizes that MoE routing produces?

Prompt 4

Explain the Mega MoE fused kernel in DeepGEMM: what GPU-to-GPU communication does it overlap with tensor core computation and why does overlapping improve end-to-end throughput?

Prompt 5

Walk me through the FP8 quantization scheme DeepGEMM uses, what precision format is used for weights versus activations and what numerical accuracy trade-off should I expect versus BF16?

Open on GitHub → Explain another repo

← deepseek-ai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.