MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs — Detailed Summary

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu | ByteDance, Peking University | NSDI '24 (USENIX Symposium on Networked Systems Design and Implementation, 2024)

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.


Abstract


1. Introduction


2. Background


3. Efficient Training at Scale

3.1 Algorithmic Optimizations

3.2 Communication Overlapping in 3D Parallelism

3.3 Efficient Operators

3.4 Data Pipeline

3.5 Collective Communication Group Initialization

3.6 Network Performance Tuning


4. Fault Tolerance


5. Training Troubleshooting


6. Experience

6.1 Scalability — Strong Scaling on the 175B Model

Batch Size Method GPUs Iter Time (s) Throughput (tokens/s) Training Time (days, 300B tokens) MFU Aggregate PFlops/s
768 Megatron-LM 256 40.0 39.3k 88.35 53.0% 43.3
768 Megatron-LM 512 21.2 74.1k 46.86 49.9% 77.6
768 Megatron-LM 768 15.2 103.8k 33.45 46.7% 111.9
768 Megatron-LM 1024 11.9 132.7k 26.17 44.7% 131.9
768 MegaScale 256 32.0 49.0k 70.86 65.3% (1.23x) 52.2
768 MegaScale 512 16.5 95.1k 36.51 63.5% (1.27x) 101.4
768 MegaScale 768 11.5 136.7k 25.40 61.3% (1.31x) 146.9
768 MegaScale 1024 8.9 176.9k 19.62 59.0% (1.32x) 188.5
6144 Megatron-LM 3072 29.02 433.6k 8.01 48.7% 466.8
6144 Megatron-LM 6144 14.78 851.6k 4.08 47.8% 916.3
6144 Megatron-LM 8192 12.24 1027.9k 3.38 43.3% 1106.7
6144 Megatron-LM 12288 8.57 1466.8k 2.37 41.2% 1579.5
6144 MegaScale 3072 23.66 531.9k 6.53 59.1% (1.21x) 566.5
6144 MegaScale 6144 12.21 1030.9k 3.37 57.3% (1.19x) 1098.4
6144 MegaScale 8192 9.56 1315.6k 2.64 54.9% (1.26x) 1400.6
6144 MegaScale 12288 6.34 1984.0k 55.2% (1.34x) 2166.3

6.2 Scalability — Weak Scaling on the 530B Model

6.3 Ablation — Where the MFU Comes From

Training the 175B model with 256 GPUs and batch size 256:

Idx Method MFU Delta
1 Baseline (vanilla Megatron-LM) 47.7%
2 + Parallel Transformer Block (PTB) 52.3% +4.6%
3 + Sliding Window Attention (SWA) 53.3% +1.0% (cum +5.6)
4 + TP overlap 55.5% +2.2% (cum +7.8)
5 + PP overlap 58.0% +2.5% (cum +10.3)
6 + DP overlap 59.5% +1.5% (cum +11.8)
7 + efficient operators (FlashAttn-2, fused LN/GeLU) 61.2% +1.7% (cum +13.5)
8 + misc optimizations 62.3% +1.1% (cum +14.6)
9 + LAMB (batch x 3) 65.3% +3.0% (cum +17.6)

6.4 Production Run



8. Conclusion


9. Limitations / Future Work


10. Cross-Cutting Take-Aways

Take-away Source
Algorithm-system co-design beats pure system optimization Ablation table: ops-only delta is +1.7%; algorithmic/overlap deltas total +14.6%
Communication overlap dominates at scale DP+PP+TP overlap together = +6.2% MFU; the overlap gain grows with GPU count
Init-time scales worse than O(n^2) without care 2048-GPU init: 1047 s -> <5 s after Redis + O(n) barriers
Tail latency, not mean latency, sets cluster throughput 0.5% slow nodes throttled the whole cluster; removing them recovered ~0.7% MFU
Two-stage checkpointing makes per-checkpoint cost negligible Stage 1 = several seconds (blocking); Stage 2 = async to HDFS
Network co-tuning (Swift+DCQCN, ECMP, multi-rail) is required Default DCQCN suffers PFC HOL blocking at LLM traffic patterns
In-depth observability turns weeks-long mysteries into hours-long bugs Mid-training MFU drift root-caused to GC and PyTorch ops on critical path

Note on NCCL Tuning

MegaScale tunes NCCL itself rather than just scheduling on top of it: the authors enabled NIC adap_retrans, raised NCCL retransmit timer and retry count to survive multi-second link flapping, and tuned NCCL timeout thresholds explicitly to avoid spurious cluster-wide failures. The paper also notes that the last reduce-scatter in the DP step was the collective whose tail blew up under cross-rank time skew, identifying collective tail latency — not aggregate bandwidth — as the failure mode that matters at 10k+ GPUs. Both observations are direct evidence that NCCL parameter selection (timeouts, retransmit, and the chunking that governs reduce-scatter tail behaviour) deserves to be a first-class tuning dimension at scale.