Brief Summary: Megatron-LM

Full title: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Authors: Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro (NVIDIA) Year: 2020 (arXiv:1909.08053v4) Venue: arXiv / SC-adjacent systems work


Problem

Very large transformer language models (GPT-2, BERT) exceed the memory capacity of a single GPU. Existing model parallelism approaches (GPipe, Mesh-TensorFlow) require custom compilers, significant code rewrites, or introduce pipeline bubbles and optimizer instabilities.

Core Insight

The inherent structure of transformer layers — an MLP block and a self-attention block, each expressible as a sequence of GEMM operations — allows intra-layer (tensor) model parallelism to be implemented with only two carefully placed all-reduce operations per transformer layer, requiring no new compiler and no changes to the optimizer.

Method

Key Results

Config GPUs Throughput Weak scaling efficiency
1.2B (baseline) 1 39 TFLOPs 100%
8.3B model parallel 8 77%
8.3B model+data parallel 512 15.1 PFLOPs 74%

Limitations

Relevance to DynamICCL

Megatron-LM is a primary workload driver for DynamICCL. Its training loop generates the exact collective pattern DynamICCL must optimize: