Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia | NVIDIA / Stanford / Microsoft Research | SC '21 | DOI: 10.1145/3458817.3476209
Problem
Transformer language models grow faster than any single accelerator can hold or compute on. A 1-trillion-parameter GPT model needs roughly 16 TB just for fp16 parameters and optimizer state — far beyond a single 80 GB A100 — and GPT-3 alone would take ~288 years on a single V100. Pure data parallelism is insufficient because per-GPU minibatches collapse and model state still does not fit. Tensor (intra-layer) parallelism works inside one server but suffers across servers as all-reduces hit InfiniBand and GEMMs shrink. Pipeline (inter-layer) parallelism scales across servers but introduces "pipeline bubble" idle time that can reach 50%. Practitioners had no end-to-end recipe for composing all three at the 3000-GPU scale.
Core Insight
Compose Pipeline + Tensor + Data parallelism along the natural hardware boundaries — tensor parallelism inside a server (NVLink/NVSwitch), pipeline parallelism across servers (InfiniBand), data parallelism on top — and add a novel interleaved 1F1B schedule that shrinks the pipeline bubble by factor v at the cost of v-fold communication, made affordable by a scatter/gather optimization that cuts inter-stage IB volume by factor t.
Method
PTD-P composes three orthogonal parallelism dimensions and four implementation optimizations:
+-------------------------------------------------------------+
| PTD-P: Pipeline (p) x Tensor (t) x Data (d), n = p*t*d |
+-------------------------------------------------------------+
| Tensor (t): intra-server, NVLink, 4 all-reduces / layer |
| Pipeline (p): inter-server, IB P2P, microbatched 1F1B |
| Data (d): cluster-wide, IB all-reduce once per batch |
+-------------------------------------------------------------+
| Optimizations: |
| - Interleaved 1F1B schedule (bubble /= v) |
| - Scatter/gather across t IB cards (IB volume /= t) |
| - Operator fusion (bias+GeLU, bias+dropout+add, attn) |
| - Activation recomputation every 1-2 layers |
+-------------------------------------------------------------+
Three configuration takeaways guide (p, t, d, b) selection:
- Takeaway #1: Tensor parallelism up to degree g (GPUs/server), then pipeline parallelism across servers.
- Takeaway #2: Pick model-parallel size M = t·p so the model fits in memory, then scale with d.
- Takeaway #3: Microbatch size b trades GEMM efficiency vs. bubble fraction; optimum depends on (p, d, B).
Closed-form sizing equations are provided:
- Bubble fraction: (p−1)/m default, (1/v)·(p−1)/m interleaved
- Parameters: P = 12 l h² (1 + 13/(12h) + (V+s)/(12 l h))
- FLOPs/iter: F = 96 B s l h² (1 + s/(6h) + V/(16 l h))
- Train time: t_train ≈ 8 T P / (n X)
Experimental Setup
| Component | Value |
|---|---|
| Cluster | NVIDIA Selene Supercomputer |
| Nodes | up to 384 DGX A100 |
| GPUs | up to 3072 A100 80 GB |
| Intra-node | NVLink + NVSwitch (8 GPUs/node) |
| Inter-node | 8 × 200 Gb/s Mellanox HDR InfiniBand |
| Topology | 3-level fat-tree, 850 switches |
| Storage | All-NVMe shared parallel filesystem |
| Framework | PyTorch 1.8.0a0 |
| Collectives | NCCL 2.8.4 |
| Models | GPT, 1.7 B – 1.008 T parameters |
| Batch sizes | 512 – 3072 |
Headline Quantitative Results
| Workload | Best config | Result |
|---|---|---|
| 1 T model, 3072 GPUs | t=8, p=64, d=6, B=3072 | 502 pFLOP/s aggregate, 163 tFLOP/s/GPU = 52% peak |
| 530 B model, 2520 GPUs | t=8, p=35 | 410.2 pFLOP/s, 163 tFLOP/s/GPU = 52% peak |
| 175 B model, 1024 GPUs | (per Table 1) | 140 tFLOP/s/GPU = 45% peak; 34 days for 300 B tokens |
| 1.7 B model, 32 GPUs | t=1, p=1 | 137 tFLOP/s/GPU = 44% peak |
PTD vs. ZeRO-3:
- 175 B model: PTD beats ZeRO-3 by 6% at base scale.
- 530 B model: PTD beats ZeRO-3 by 24% at base scale.
- Doubling GPUs at fixed batch: PTD pulls 70% ahead of ZeRO-3 because ZeRO-3's all-gather and reduce-scatter cross all GPUs every microbatch.
Optimization-by-optimization wins:
- Interleaved schedule: up to 10% throughput.
- Scatter/gather: up to 11% throughput on communication-bound schedules.
- Operator fusion: +19% on 175 B (113 → 135 tFLOP/s/GPU); +11% on 530 B (133 → 148 tFLOP/s/GPU).
- Cross-node tensor parallelism (t > 8) is fatal: t=2,p=32 ≈ 100 tFLOP/s/GPU vs. t=8,p=8 ≈ 170 tFLOP/s/GPU on 162.2 B model.
- Optimal microbatch b = 2 for the 91 B model, (t,p) = (8,8), 64 GPUs.
- Activation recomputation costs 33% locally but enables larger batches that win up to 2× end-to-end.
Inter-node bandwidth (1 T on 3072 GPUs):
- Pipeline-parallel P2P: 892 GB/s effective bisection.
- Data-parallel all-reduce: 12.9 TB/s effective bisection (dominant network user).
Checkpoint I/O (1 T model):
- Checkpoint size: 13.8 TB.
- Peak read: 1 TB/s.
- Peak write: 273 GB/s (40% of peak filesystem bandwidth).
Limitations
- Assumes a homogeneous stack of identical transformer blocks; layer-to-stage assignment for asymmetric models (encoder–decoder, MoE) is left to other work.
- Strict synchronous semantics only — asynchronous and bounded-staleness pipeline schedules (PipeDream-2BW, PipeMare) are deferred to keep convergence identical to single-GPU SGD.
- Configuration is manual: (p, t, d, b) are picked from Takeaways #1–#3 and closed-form analysis, not by an automated search.
- Convergence-throughput trade-offs from relaxed-semantics techniques are deliberately avoided.
Open Problems
The paper itself flags four future directions: (1) automated search over (p, t, d, b) configurations rather than heuristic choice; (2) extending PTD-P to asymmetric / heterogeneous architectures; (3) bounded-staleness or asynchronous pipeline schedules that preserve convergence; and (4) reducing the data-parallel all-reduce cost at extreme scale, since at 1 T parameters the data-parallel collective already drives 12.9 TB/s of bisection traffic and would dominate further scale-out.
Note on NCCL Tuning
PTD-P puts three NCCL collective patterns on one fabric
simultaneously: tiny intra-node tensor-parallel all-reduces (every
microbatch, latency-bound), inter-node pipeline P2P sends (every
microbatch), and cluster-wide data-parallel all-reduces (once per batch,
~12.9 TB/s bisection). The scatter/gather optimization deliberately
reshapes inter-stage messages from b·s·h once to
b·s·h/t per IB card — a t-fold change in the size regime
NCCL sees. This is direct evidence that the right
algorithm/protocol/chunkSize for a given collective depends on which
phase is in flight, not on a single global setting.