Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia | NVIDIA / Stanford / Microsoft Research | SC '21 | DOI: 10.1145/3458817.3476209


Problem

Transformer language models grow faster than any single accelerator can hold or compute on. A 1-trillion-parameter GPT model needs roughly 16 TB just for fp16 parameters and optimizer state — far beyond a single 80 GB A100 — and GPT-3 alone would take ~288 years on a single V100. Pure data parallelism is insufficient because per-GPU minibatches collapse and model state still does not fit. Tensor (intra-layer) parallelism works inside one server but suffers across servers as all-reduces hit InfiniBand and GEMMs shrink. Pipeline (inter-layer) parallelism scales across servers but introduces "pipeline bubble" idle time that can reach 50%. Practitioners had no end-to-end recipe for composing all three at the 3000-GPU scale.


Core Insight

Compose Pipeline + Tensor + Data parallelism along the natural hardware boundaries — tensor parallelism inside a server (NVLink/NVSwitch), pipeline parallelism across servers (InfiniBand), data parallelism on top — and add a novel interleaved 1F1B schedule that shrinks the pipeline bubble by factor v at the cost of v-fold communication, made affordable by a scatter/gather optimization that cuts inter-stage IB volume by factor t.


Method

PTD-P composes three orthogonal parallelism dimensions and four implementation optimizations:

+-------------------------------------------------------------+
| PTD-P: Pipeline (p) x Tensor (t) x Data (d), n = p*t*d      |
+-------------------------------------------------------------+
| Tensor (t):  intra-server, NVLink, 4 all-reduces / layer    |
| Pipeline (p): inter-server, IB P2P, microbatched 1F1B       |
| Data (d):    cluster-wide, IB all-reduce once per batch     |
+-------------------------------------------------------------+
| Optimizations:                                              |
|   - Interleaved 1F1B schedule  (bubble /= v)                |
|   - Scatter/gather across t IB cards  (IB volume /= t)      |
|   - Operator fusion (bias+GeLU, bias+dropout+add, attn)     |
|   - Activation recomputation every 1-2 layers               |
+-------------------------------------------------------------+

Three configuration takeaways guide (p, t, d, b) selection:

Closed-form sizing equations are provided:


Experimental Setup

Component Value
Cluster NVIDIA Selene Supercomputer
Nodes up to 384 DGX A100
GPUs up to 3072 A100 80 GB
Intra-node NVLink + NVSwitch (8 GPUs/node)
Inter-node 8 × 200 Gb/s Mellanox HDR InfiniBand
Topology 3-level fat-tree, 850 switches
Storage All-NVMe shared parallel filesystem
Framework PyTorch 1.8.0a0
Collectives NCCL 2.8.4
Models GPT, 1.7 B – 1.008 T parameters
Batch sizes 512 – 3072

Headline Quantitative Results

Workload Best config Result
1 T model, 3072 GPUs t=8, p=64, d=6, B=3072 502 pFLOP/s aggregate, 163 tFLOP/s/GPU = 52% peak
530 B model, 2520 GPUs t=8, p=35 410.2 pFLOP/s, 163 tFLOP/s/GPU = 52% peak
175 B model, 1024 GPUs (per Table 1) 140 tFLOP/s/GPU = 45% peak; 34 days for 300 B tokens
1.7 B model, 32 GPUs t=1, p=1 137 tFLOP/s/GPU = 44% peak

PTD vs. ZeRO-3:

Optimization-by-optimization wins:

Inter-node bandwidth (1 T on 3072 GPUs):

Checkpoint I/O (1 T model):


Limitations


Open Problems

The paper itself flags four future directions: (1) automated search over (p, t, d, b) configurations rather than heuristic choice; (2) extending PTD-P to asymmetric / heterogeneous architectures; (3) bounded-staleness or asynchronous pipeline schedules that preserve convergence; and (4) reducing the data-parallel all-reduce cost at extreme scale, since at 1 T parameters the data-parallel collective already drives 12.9 TB/s of bisection traffic and would dominate further scale-out.


Note on NCCL Tuning

PTD-P puts three NCCL collective patterns on one fabric simultaneously: tiny intra-node tensor-parallel all-reduces (every microbatch, latency-bound), inter-node pipeline P2P sends (every microbatch), and cluster-wide data-parallel all-reduces (once per batch, ~12.9 TB/s bisection). The scatter/gather optimization deliberately reshapes inter-stage messages from b·s·h once to b·s·h/t per IB card — a t-fold change in the size regime NCCL sees. This is direct evidence that the right algorithm/protocol/chunkSize for a given collective depends on which phase is in flight, not on a single global setting.