Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia | NVIDIA / Stanford / Microsoft Research | SC '21 | DOI: 10.1145/3458817.3476209

Problem

Transformer language models grow faster than any single accelerator can hold or compute on. A 1-trillion-parameter GPT model needs roughly 16 TB just for fp16 parameters and optimizer state — far beyond a single 80 GB A100 — and GPT-3 alone would take ~288 years on a single V100. Pure data parallelism is insufficient because per-GPU minibatches collapse and model state still does not fit. Tensor (intra-layer) parallelism works inside one server but suffers across servers as all-reduces hit InfiniBand and GEMMs shrink. Pipeline (inter-layer) parallelism scales across servers but introduces "pipeline bubble" idle time that can reach 50%. Practitioners had no end-to-end recipe for composing all three at the 3000-GPU scale.

Core Insight

Compose Pipeline + Tensor + Data parallelism along the natural hardware boundaries — tensor parallelism inside a server (NVLink/NVSwitch), pipeline parallelism across servers (InfiniBand), data parallelism on top — and add a novel interleaved 1F1B schedule that shrinks the pipeline bubble by factor v at the cost of v-fold communication, made affordable by a scatter/gather optimization that cuts inter-stage IB volume by factor t.

Method

PTD-P composes three orthogonal parallelism dimensions and four implementation optimizations:

+-------------------------------------------------------------+
| PTD-P: Pipeline (p) x Tensor (t) x Data (d), n = p*t*d      |
+-------------------------------------------------------------+
| Tensor (t):  intra-server, NVLink, 4 all-reduces / layer    |
| Pipeline (p): inter-server, IB P2P, microbatched 1F1B       |
| Data (d):    cluster-wide, IB all-reduce once per batch     |
+-------------------------------------------------------------+
| Optimizations:                                              |
|   - Interleaved 1F1B schedule  (bubble /= v)                |
|   - Scatter/gather across t IB cards  (IB volume /= t)      |
|   - Operator fusion (bias+GeLU, bias+dropout+add, attn)     |
|   - Activation recomputation every 1-2 layers               |
+-------------------------------------------------------------+

Three configuration takeaways guide (p, t, d, b) selection:

Takeaway #1: Tensor parallelism up to degree g (GPUs/server), then pipeline parallelism across servers.
Takeaway #2: Pick model-parallel size M = t·p so the model fits in memory, then scale with d.
Takeaway #3: Microbatch size b trades GEMM efficiency vs. bubble fraction; optimum depends on (p, d, B).

Closed-form sizing equations are provided:

Bubble fraction: (p−1)/m default, (1/v)·(p−1)/m interleaved
Parameters: P = 12 l h² (1 + 13/(12h) + (V+s)/(12 l h))
FLOPs/iter: F = 96 B s l h² (1 + s/(6h) + V/(16 l h))
Train time: t_train ≈ 8 T P / (n X)

Experimental Setup

Component	Value
Cluster	NVIDIA Selene Supercomputer
Nodes	up to 384 DGX A100
GPUs	up to 3072 A100 80 GB
Intra-node	NVLink + NVSwitch (8 GPUs/node)
Inter-node	8 × 200 Gb/s Mellanox HDR InfiniBand
Topology	3-level fat-tree, 850 switches
Storage	All-NVMe shared parallel filesystem
Framework	PyTorch 1.8.0a0
Collectives	NCCL 2.8.4
Models	GPT, 1.7 B – 1.008 T parameters
Batch sizes	512 – 3072

Headline Quantitative Results

Workload	Best config	Result
1 T model, 3072 GPUs	t=8, p=64, d=6, B=3072	502 pFLOP/s aggregate, 163 tFLOP/s/GPU = 52% peak
530 B model, 2520 GPUs	t=8, p=35	410.2 pFLOP/s, 163 tFLOP/s/GPU = 52% peak
175 B model, 1024 GPUs	(per Table 1)	140 tFLOP/s/GPU = 45% peak; 34 days for 300 B tokens
1.7 B model, 32 GPUs	t=1, p=1	137 tFLOP/s/GPU = 44% peak

PTD vs. ZeRO-3:

175 B model: PTD beats ZeRO-3 by 6% at base scale.
530 B model: PTD beats ZeRO-3 by 24% at base scale.
Doubling GPUs at fixed batch: PTD pulls 70% ahead of ZeRO-3 because ZeRO-3's all-gather and reduce-scatter cross all GPUs every microbatch.

Optimization-by-optimization wins:

Interleaved schedule: up to 10% throughput.
Scatter/gather: up to 11% throughput on communication-bound schedules.
Operator fusion: +19% on 175 B (113 → 135 tFLOP/s/GPU); +11% on 530 B (133 → 148 tFLOP/s/GPU).
Cross-node tensor parallelism (t > 8) is fatal: t=2,p=32 ≈ 100 tFLOP/s/GPU vs. t=8,p=8 ≈ 170 tFLOP/s/GPU on 162.2 B model.
Optimal microbatch b = 2 for the 91 B model, (t,p) = (8,8), 64 GPUs.
Activation recomputation costs 33% locally but enables larger batches that win up to 2× end-to-end.

Inter-node bandwidth (1 T on 3072 GPUs):

Pipeline-parallel P2P: 892 GB/s effective bisection.
Data-parallel all-reduce: 12.9 TB/s effective bisection (dominant network user).

Checkpoint I/O (1 T model):

Checkpoint size: 13.8 TB.
Peak read: 1 TB/s.
Peak write: 273 GB/s (40% of peak filesystem bandwidth).

Limitations

Assumes a homogeneous stack of identical transformer blocks; layer-to-stage assignment for asymmetric models (encoder–decoder, MoE) is left to other work.
Strict synchronous semantics only — asynchronous and bounded-staleness pipeline schedules (PipeDream-2BW, PipeMare) are deferred to keep convergence identical to single-GPU SGD.
Configuration is manual: (p, t, d, b) are picked from Takeaways #1–#3 and closed-form analysis, not by an automated search.
Convergence-throughput trade-offs from relaxed-semantics techniques are deliberately avoided.

Open Problems

The paper itself flags four future directions: (1) automated search over (p, t, d, b) configurations rather than heuristic choice; (2) extending PTD-P to asymmetric / heterogeneous architectures; (3) bounded-staleness or asynchronous pipeline schedules that preserve convergence; and (4) reducing the data-parallel all-reduce cost at extreme scale, since at 1 T parameters the data-parallel collective already drives 12.9 TB/s of bisection traffic and would dominate further scale-out.

Note on NCCL Tuning

PTD-P puts three NCCL collective patterns on one fabric simultaneously: tiny intra-node tensor-parallel all-reduces (every microbatch, latency-bound), inter-node pipeline P2P sends (every microbatch), and cluster-wide data-parallel all-reduces (once per batch, ~12.9 TB/s bisection). The scatter/gather optimization deliberately reshapes inter-stage messages from b·s·h once to b·s·h/t per IB card — a t-fold change in the size regime NCCL sees. This is direct evidence that the right algorithm/protocol/chunkSize for a given collective depends on which phase is in flight, not on a single global setting.