Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM — Detailed Summary

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia | NVIDIA / Stanford / Microsoft Research | SC '21 | DOI: 10.1145/3458817.3476209 | github.com/nvidia/megatron-lm

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.


Abstract


1. Introduction


2. Modes of Parallelism

2.1 Data Parallelism

2.2 Pipeline Model Parallelism

2.2.1 Default Schedule (GPipe and PipeDream-Flush / 1F1B)

2.2.2 Schedule with Interleaved Stages (Novel)

2.3 Tensor Model Parallelism


3. Performance Analysis of Parallelization Configurations

3.1 Notation

3.2 Tensor and Pipeline Model Parallelism

3.3 Data and Model Parallelism

3.3.1 Pipeline Model Parallelism

3.3.2 Data and Tensor Model Parallelism

3.4 Microbatch Size

3.5 Activation Recomputation

3.6 Closed-form Sizes


4. Implementation

4.1 Communication Optimizations

4.2 Computation Optimizations


5. Evaluation

5.1 End-to-End Performance

5.2 Comparison to ZeRO-3

5.3 Pipeline Parallelism

5.3.1 Weak Scaling

5.3.2 Interleaved vs. Non-Interleaved (Figure 12)

5.4 Comparison of Parallel Configurations

5.4.1 Tensor vs. Pipeline (Figure 13)

5.4.2 Pipeline vs. Data (Figure 14)

5.4.3 Tensor vs. Data

5.5 Microbatch Size

5.6 Activation Recomputation

5.7 Scatter-Gather Optimization

5.8 Fused Operators

5.9 Inter-Node Communication Bandwidth (1 T model on 3072 GPUs)

5.10 Checkpoint Loading and Saving (1 T model)



7. Discussion and Conclusion


8. Major Named Methods, Systems, and Equations

Name Role
PTD-P The combined Pipeline + Tensor + Data parallelism recipe
1F1B (PipeDream-Flush) Default pipeline schedule, one forward / one backward
Interleaved 1F1B Novel schedule: each device holds v non-contiguous chunks, bubble shrinks by 1/v
Scatter/Gather Inter-stage communication optimization (IB volume / t reduction)
Megatron-LM Underlying tensor-parallel transformer library
ZeRO-3 DeepSpeed comparison baseline
Fused (bias + GeLU), (bias + dropout + add) PyTorch JIT operator fusion
Fused scale-mask-softmax Custom CUDA kernel for attention
Equation Meaning
t_pb / t_id = (p − 1)/m Bubble fraction, default schedule
t_pb_int / t_id = (1/v) · (p − 1)/m Bubble fraction, interleaved schedule
P = 12 l h² (1 + 13/(12h) + (V+s)/(12 l h)) Parameter count (transformer)
F = 96 B s l h² (1 + s/(6h) + V/(16 l h)) FLOPs per iteration
t_train ≈ 8 T P / (n X) Estimated training time

9. Quantitative Tables (Verbatim)

Table 1: Weak-scaling throughput for GPT models (1 B – 1 T parameters)

Params (B) Heads Hidden Layers t p GPUs Batch tFLOP/s/GPU % peak pFLOP/s
1.7 24 2304 24 1 1 32 512 137 44% 4.4
3.6 32 3072 30 2 1 64 512 138 44% 8.8
7.5 32 4096 36 4 1 128 512 142 46% 18.2
18.4 48 6144 40 8 1 256 1024 135 43% 34.6
39.1 64 8192 48 8 2 512 1536 138 44% 70.8
76.1 80 10240 60 8 4 1024 1792 140 45% 143.8
145.6 96 12288 80 8 8 1536 2304 148 47% 227.1
310.1 128 16384 96 8 16 1920 2160 155 50% 297.4
529.6 128 20480 105 8 35 2520 2520 163 52% 410.2
1008.0 160 25600 128 8 64 3072 3072 163 52% 502.0

Table 2: PTD vs. ZeRO-3

Scheme Params (B) Model-parallel size Batch GPUs Microbatch tFLOP/s/GPU Train time / 300 B tok (days)
ZeRO-3 174.6 1 1536 384 4 144 90
768 2 88 74
1536 1 44 74
529.6 1 2560* 640 4 138 169
2240 1120 2 98 137
2240 1 48 140
PTD 174.6 96 1536 384 1 153 84
768 1 149 43
1536 1 141 23
529.6 280 2240 560 1 171 156
1120 1 167 80
2240 1 159 42

10. Experimental Setup

Component Value
Cluster NVIDIA Selene Supercomputer
Nodes up to 384 DGX A100 nodes
GPUs up to 3072 NVIDIA A100 80 GB
Intra-node interconnect NVLink + NVSwitch (8 GPUs per node)
Inter-node interconnect 8 × 200 Gb/s Mellanox HDR InfiniBand per node
Network topology 3-level fat-tree, 850 switches
Storage All-NVMe shared parallel filesystem
Framework PyTorch 1.8.0a0
Collective library NCCL 2.8.4
Container nvcr.io/nvidia/pytorch:20.12-py3
Models GPT, 1.7 B – 1.008 T parameters
Batch sizes 512 – 3072 (per Table 1)
Headline runs 1 T model on 3072 GPUs; 530 B on 2520; 175 B on 1024

11. Limitations and Future Work


12. Discussion of NCCL and Collective Communication


13. Cross-Cutting Empirical Take-Aways

Take-away Quantitative evidence
Tensor parallelism within a server, pipeline across servers Table 1 (best configs all use t = 8 = GPUs/node); Figure 13 (t = 8, p = 8 wins on 64 GPUs)
Larger batch amortizes the pipeline bubble Sec 5.3.1 (B = 128 ≫ B = 8); Eq (p−1)/m
Interleaved schedule + scatter/gather buys ~10% Sec 5.3.2 + 5.7
Cross-node tensor parallelism is fatal Sec 5.4.1 (t = 2, p = 32 = ~100 tFLOP/s)
Fused kernels recover 11–19% throughput Sec 5.8
Activation recomputation enables 2× via larger batch despite 33% local cost Sec 5.6
1 T model is feasible at 52% peak on 3072 A100 Table 1 last row
ZeRO-3 stalls when sharded all-gather/reduce-scatter cross all GPUs Sec 5.2 (PTD-P 70% faster at 2× scale)

14. Figures Referenced


Note on NCCL Tuning

PTD-P composes three different NCCL collective patterns on the same fabric — small intra-node tensor-parallel all-reduces (every microbatch, latency-sensitive), inter-node pipeline P2P sends (every microbatch, scattered across t IB cards), and cluster-wide data-parallel all-reduces (once per batch, bandwidth-bound at 12.9 TB/s effective bisection). The paper's scatter/gather optimization explicitly chooses to send b·s·h/t per IB card rather than b·s·h once, which changes the message-size regime that NCCL sees per-call. This is a concrete demonstration that the right algorithm/protocol/chunkSize choice depends on which of the three collective phases is in flight, since each phase has a very different (size, topology, deadline) profile on the same hardware.