explaingit

deepseek-ai/flashmla

12,648C++Audience · researcherComplexity · 5/5Setup · hard

TLDR

Highly optimized GPU attention kernels from DeepSeek that accelerate the most expensive part of running large language models, targeting NVIDIA H800 and B200 hardware specifically.

Mindmap

mindmap
  root((flashmla))
    What it does
      Attention kernels
      LLM inference speed
    Attention Types
      Dense attention
      Sparse attention
    Inference Phases
      Prefill phase
      Decode phase
    Hardware
      NVIDIA H800
      NVIDIA B200
      CUDA required
    Tech Stack
      C++
      CUDA
      PyTorch
    Audience
      AI researchers
      LLM engineers
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Speed up large language model inference by replacing standard attention with FlashMLA's hardware-optimized GPU kernels.

USE CASE 2

Build a high-throughput DeepSeek-V3 inference pipeline using separate prefill and decode phase attention routines.

USE CASE 3

Benchmark attention operation throughput in teraflops on H800 or B200 GPUs to compare against baseline implementations.

USE CASE 4

Reduce per-token latency during the decode phase of LLM inference using sparse attention to skip less relevant tokens.

Tech stack

C++CUDAPythonPyTorch

Getting it running

Difficulty · hard Time to first run · 1h+

Requires specific NVIDIA H800 or B200 GPUs with compatible CUDA and PyTorch versions, no CPU fallback is available.

In plain English

FlashMLA is a collection of highly optimized low-level computation routines released by DeepSeek, the company behind the DeepSeek-V3 series of AI models. These routines handle a specific operation called attention, which is one of the most computationally expensive parts of running or training large language models. The library is written to extract maximum performance from specific NVIDIA GPU hardware. The library provides two broad categories of attention computation. Dense attention processes every token in a sequence, while sparse attention selectively processes only the most relevant tokens, reducing computation without sacrificing much accuracy. Both categories include variants optimized for the two main phases of AI model inference: the prefill phase, which processes the initial input prompt, and the decoding phase, which generates output tokens one at a time. FlashMLA is intended for AI researchers and engineers who are running or building large language model inference systems, particularly those working with DeepSeek models. It is not a general-purpose library and requires specific high-end NVIDIA GPUs (the H800 or B200 class) along with recent versions of CUDA and PyTorch. The performance numbers cited in the documentation are measured in teraflops, a unit describing hundreds of trillions of calculations per second, which gives a sense of how specialized this code is. Installation involves cloning the repository and running a standard Python package install command. Usage requires calling a small set of Python-facing functions that wrap the underlying GPU kernels. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1
Show me how to replace the standard attention call in a PyTorch transformer model with FlashMLA's optimized kernel and measure the throughput improvement on an H800 GPU.
Prompt 2
I'm building a DeepSeek-V3 inference pipeline. Walk me through integrating FlashMLA's decode-phase attention kernel to reduce per-token latency.
Prompt 3
Explain the difference between FlashMLA's dense and sparse attention variants, and help me decide which to use for a 128K-token context window inference job.
Prompt 4
Write a benchmark script that compares FlashMLA prefill attention throughput against a vanilla PyTorch attention implementation on the same H800 GPU.
Prompt 5
How do I install FlashMLA from source and verify it is using the correct CUDA version and GPU before running inference?
Open on GitHub → Explain another repo

← deepseek-ai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.