nvidia/megatron-lm

★ 16,322PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((Megatron-LM))
    What It Does
      Trains large LLMs
      Multi-GPU scaling
      Research and production
    Parallelism Types
      Tensor parallelism
      Pipeline parallelism
      Data parallelism
    Precision Support
      FP8 and BF16
      Mixed precision
    Components
      Megatron-LM scripts
      Megatron Core library
    Audience
      ML researchers
      LLM engineers
      GPU cluster teams

mindmap root((Megatron-LM)) What It Does Trains large LLMs Multi-GPU scaling Research and production Parallelism Types Tensor parallelism Pipeline parallelism Data parallelism Precision Support FP8 and BF16 Mixed precision Components Megatron-LM scripts Megatron Core library Audience ML researchers LLM engineers GPU cluster teams

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Train a custom large language model with hundreds of billions of parameters across a multi-GPU cluster

USE CASE 2

Fine-tune an existing large model using Megatron Core's composable pipeline designed for framework developers

USE CASE 3

Benchmark GPU cluster efficiency for LLM training using tensor, pipeline, and data parallelism together

USE CASE 4

Test FP8 and BF16 mixed-precision training to speed up compute on H100 GPU hardware

Tech stack

PythonPyTorchCUDA

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a multi-GPU cluster with CUDA, benchmarks use H100 and A100 hardware, not suitable for single-GPU or CPU setups.

In plain English

Megatron-LM is a GPU-optimized Python library from NVIDIA for training very large transformer models, the class of AI architectures that powers modern large language models. It is designed for research teams and ML engineers who need to train models ranging from 2 billion to hundreds of billions of parameters across thousands of GPUs simultaneously. The repository contains two main components. Megatron-LM is the higher-level reference implementation with pre-configured training scripts, useful for learning or experimentation. Megatron Core is the lower-level, composable library that framework developers can use to build custom training pipelines. The core technical challenge it solves is distributing model training across many GPUs efficiently, through multiple parallelism strategies: tensor parallelism (splitting individual operations across GPUs), pipeline parallelism (splitting model layers across GPUs), and data parallelism (running the same model on different data batches in parallel). It also supports mixed precision training, using lower-precision number formats like FP8 and BF16 to speed up computation. According to the benchmarks, it achieves up to 47% Model FLOP Utilization (a measure of hardware efficiency) on H100 GPU clusters, tested up to a 462-billion parameter model on 6,144 GPUs. You would use Megatron-LM if you are training or fine-tuning large language models at research or production scale and need tooling designed to work across large GPU clusters. The full README is longer than what was provided.

Copy-paste prompts

Prompt 1

How do I set up Megatron-LM to fine-tune a language model across 8 A100 GPUs using tensor parallelism?

Prompt 2

What is the difference between Megatron-LM and Megatron Core, and which should I use to build a custom training pipeline?

Prompt 3

Walk me through configuring tensor parallelism and pipeline parallelism in Megatron-LM for a 65B parameter model

Prompt 4

How do I enable FP8 mixed precision training in Megatron-LM on an H100 GPU cluster?

Prompt 5

What does Model FLOP Utilization mean in the Megatron-LM benchmarks and how do I measure it for my own training run?

Open on GitHub → Explain another repo

← nvidia on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.