MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs — Detailed Summary

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu | ByteDance, Peking University | NSDI '24 (USENIX Symposium on Networked Systems Design and Implementation, 2024)

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.

Abstract

MegaScale is a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs, deployed inside ByteDance.
Training LLMs at this scale brings unprecedented challenges to two intertwined goals: training efficiency (sustained MFU) and training stability (failure-free wall-clock time).
The authors take a full-stack co-design approach across (i) model block and optimizer design, (ii) computation/communication overlap, (iii) operator optimization, (iv) data pipeline, and (v) network performance tuning.
Stability is critical because LLM training jobs run for weeks; many hard stability issues only emerge at large scale, so in-depth observability is the second pillar of the design.
A custom diagnostic tool stack monitors system components and events deep in the stack to identify root causes of failures and stragglers and to drive fault-tolerance and straggler-mitigation policies.
Headline result: 55.2% MFU training a 175B model on 12,288 GPUs, a 1.34x improvement over Megatron-LM at the same configuration.
Operational experience identifying and fixing failures and stragglers is shared so that the systems community can replicate and extend the work.

1. Introduction

LLMs (GPT, Claude, Gemini, LLaMA) have become transformative AI technology; training them is resource-intensive (GPT-3 = 175B params, PaLM = 540B params), and the demand for ever-larger models is pushing training scale to

10,000 GPUs.
Two main scaling challenges: achieving high training efficiency measured by Model FLOPs Utilization (MFU = observed throughput / theoretical max throughput), and achieving high training stability so that failures and stragglers do not erase weeks of progress.
Devastating cost: a single failure on a 12k-GPU cluster wastes proportionally more wall-clock time than on a small cluster, so even rare faults dominate end-to-end training duration.
MegaScale's design rests on two principles:
1. Algorithm-system co-design — modify the model architecture and optimizer (parallel transformer block, sliding window attention, LAMB) in concert with the system (3D-parallel overlap, fused operators, data pipeline, network tuning).
2. In-depth observability — granular monitoring (heartbeat messages, diagnostic tests, CUDA event monitor) so root causes can be located and remediated.
Co-design contributions enumerated: parallel transformer blocks, sliding window attention (SWA), LAMB optimizer, customized 3D-parallelism overlap.
Observability contributions enumerated: heartbeat-driven driver, lightweight diagnostic tests, CUDA event monitor, distributed tracer, 3D-parallel visualization layer.
MegaScale is deployed in ByteDance datacenters; the paper reports a production run of multi-trillion tokens for several weeks on 10,000+ GPUs.

2. Background

LLM training is an iterative process; every iteration runs forward, backward, optimizer step, gradient sync, and parameter update across all replicas.
Data parallelism (DP): each worker holds a model replica, processes a micro-batch slice, and synchronizes gradients globally. ZeRO (Zero Redundancy Optimizer) stages 1/2/3 shard optimizer states / gradients / parameters; the all-reduce is decomposed into reduce-scatter (after backward) plus all-gather (before forward of next step).
Pipeline parallelism (PP): model layers are partitioned into stages; the global batch is split into micro-batches that flow through the stages. Megatron-LM uses interleaved 1F1B scheduling with model chunks and virtual pipeline stages to reduce bubble overhead.
Tensor parallelism (TP): individual operators (e.g., MLP and attention GEMMs) are sharded along the hidden dimension; sequence parallelism shards along the sequence dimension to reduce activation memory.
3D parallelism: in production, TP is confined within a node (NVLink), while DP and PP communication crosses nodes (RDMA over Ethernet/IB). This hierarchical assignment matches the 3D parallelism strategy to the cluster topology.

3. Efficient Training at Scale

The efficiency story is divided across algorithmic optimizations, communication overlap inside 3D parallelism, fused operators, the data pipeline, collective-communication initialization, and network tuning.

3.1 Algorithmic Optimizations

Parallel Transformer Block (PTB):
- Standard formulation: y = x + MLP(LN(x + Attention(LN(x)))) — Attention and MLP are serially dependent through one LayerNorm.
- Parallel formulation: y = x + MLP(LN(x)) + Attention(LN(x)) — Attention and MLP read the same normalized input and run in parallel; one LayerNorm instead of two on the critical path.
- Reduces compute on the critical path and exposes more parallelism, with no measurable accuracy degradation at the scales studied.
Sliding Window Attention (SWA):
- Replaces dense O(s^2) attention with windowed attention of complexity O(s * w) where s is sequence length and w is window size.
- Stacking layers expands the effective receptive field, recovering the long-range dependency capacity that dense attention provides.
LAMB Optimizer:
- Enables scaling the global batch size by 4x without accuracy loss.
- Larger batch reduces the number of pipeline micro-batches that the pipeline must absorb, cutting the pipeline bubble from 4(p-1) / (v * m) to (p-1) / (v * 4m) — an 87.5% reduction in pipeline bubble overhead.
The three algorithmic optimizations together change the work itself, not just the schedule, so they compose multiplicatively with the system-level overlap optimizations that follow.

3.2 Communication Overlapping in 3D Parallelism

Data parallelism overlap (Fig. 1):
- All-gather and reduce-scatter are scheduled on a per-model-chunk basis so each chunk's gradient sync overlaps with the backward of subsequent chunks.
- Initial all-gather of the first iteration is pre-fetched at iteration start so the model never blocks on parameter retrieval.
Pipeline parallelism overlap (Fig. 4):
- Send (S) and Receive (R) are decoupled in the warm-up and cool-down phases of the 1F1B schedule.
- In the warm-up, S overlaps with the next forward; in the steady phase, S and R are launched asynchronously to overlap with forward/backward compute.
- This converts pipeline-stage transitions from synchronous bubble events into asynchronously-overlapped events.
Tensor / sequence parallelism overlap (Fig. 3):
- All-gather and reduce-scatter on the FFN path are fused with the parallel Linear (column- and row-parallel GEMMs).
- GEMM kernels are broken into chunks A0, A1, ..., An; while chunk A1 computes, the partial result C0 is fed into reduce-scatter on a separate CUDA stream, pipelining compute and communication across chunks.

3.3 Efficient Operators

Adopted FlashAttention-2 (improved work partitioning over FlashAttention) for the attention kernel — better SM occupancy and fewer redundant reads/writes from HBM.
Custom kernel fusion for LayerNorm and GeLU — many small kernels collapsed into one launch, reducing CUDA-launch latency and memory traffic.
These op-level optimizations deliver bandwidth-bound speedups on the attention and norm hotspots that the algorithmic and 3D-parallelism layers cannot eliminate.

3.4 Data Pipeline

Asynchronous data preprocessing: the next iteration's tokenization / shuffling / batching runs in the background during the current iteration's gradient synchronization, so preprocessing never appears on the critical path.
Redundant dataloader elimination: instead of one dataloader per worker process pulling from disk independently, a single dedicated dataloader per machine fetches data and exposes it through shared memory to all workers on that node. This removes per-worker redundant disk reads, which can otherwise saturate the local disk bandwidth at the 8-GPU-per-node density.

3.5 Collective Communication Group Initialization

The default PyTorch initialization (which uses TCPStore) is single- threaded and synchronous; bringing up 2,048 GPUs took ~1,047 seconds in Megatron-LM's stock setup.
Two fixes:
1. Replace TCPStore with Redis as the key-value backing store — non-blocking and asynchronous, removing a serial bottleneck.
2. Optimize the barrier order during NCCL group construction so that redundant pairwise barriers are eliminated; complexity drops from O(n^2) to O(n).
Result: initialization time falls to <5 seconds on 2,048 GPUs and <30 seconds on >10,000 GPUs — a >35x speedup.

3.6 Network Performance Tuning

Topology: three-layer CLOS-like topology built around Broadcom Tomahawk 4 chips (25.6 Tbps, 64x400 Gbps) with 1:1 oversubscription.
ECMP hashing: 400G ports are split into 2x200G to reduce hash collision rate; multi-rail connectivity wires 8 NICs to 8 separate ToR switches; data-intensive nodes (same DP group) are scheduled under the same ToR to localize the heaviest traffic.
Congestion control: in-house algorithm fusing Swift (RTT-based measurement) and DCQCN (ECN-based notification). Swift's RTT signal reduces reliance on PFC, which suffers from head-of-line blocking; DCQCN's ECN response provides fast reaction to incipient congestion.
Retransmit: NCCL retransmit timer and retry count are tuned for fast recovery from transient link issues; the NIC's adap_retrans feature is enabled to shorten recovery from link flapping.
These four levers together push effective network utilization on the LLM training traffic pattern — bursty, many-flow, often elephant — close to line rate without PFC storms.

4. Fault Tolerance

Robust Training Workflow (Fig. 5):
- A central driver process orchestrates the training job; one executor per node runs the workers; Kubernetes evicts failed pods and replenishes them with healthy ones.
- A robust training daemon runs alongside each executor and emits heartbeats to the driver; the driver runs a Log Analyzer / Checker / User API to interpret these and decide policy.
Heartbeat data collection: each heartbeat carries IP, Pod name, hardware information, recent stdout/stderr logs, and RDMA traffic metrics. The driver compares these across the cluster to detect outliers.
Diagnostic tests: lightweight self-checks include intra-host loopback tests, RNIC-to-RNIC RDMA reachability tests, and NCCL all-to-all / all-reduce sanity tests. These run on suspicion of failure to localize the fault before training resumes.
Fast checkpointing (two-stage):
- Stage 1 — GPU to host RAM: on-chip optimizer/parameter states are written to host pinned memory in several seconds. This is the only blocking phase — training pauses just for this.
- Stage 2 — host RAM to HDFS: a separate background process asynchronously transfers the host-RAM snapshot to the durable distributed file system. Training has already resumed by this point.
Recovery flow:
1. Driver detects anomaly via heartbeats and RDMA-traffic metrics.
2. Driver suspends the task and triggers self-check diagnostics on suspect nodes.
3. Faulty nodes are identified, evicted by Kubernetes, replaced with healthy nodes.
4. Optimized data retrieval: instead of every worker reading the checkpoint from HDFS (which would saturate HDFS bandwidth), one worker per data-parallel group reads the shard and broadcasts it to the remaining group members via RDMA.
5. The job catches up to the latest checkpoint within ~15 minutes.

5. Training Troubleshooting

CUDA Event Monitor: records timestamps of CUDA events without forcing global synchronization; outputs heat-maps (per-rank per-step duration matrices) and trace files that can be ingested by standard visualization tools.
Heat-map case study — computational stragglers: revealed that 0.5% of machines were systematically ~10% slower in the forward pass. These outliers throttled the entire cluster because every iteration's tail latency is set by the slowest rank. Removing them lifted MFU by ~0.7%.
Distributed Tracer: visualizes execution order, idle bubbles, and cross-rank synchronization points; used to debug both pipeline-stage imbalance and pathological wait patterns inside collective communication.
3D-parallel visualization: maps log entries to the logical (TP, PP, DP) topology so that an apparent "cluster-wide timeout" can be traced back to the single faulty node whose hang propagated through the parallelism hierarchy.
Case study — cascading timeouts (defective GPUs): defective GPUs probabilistically hung inside NCCL operations, producing a swarm of cluster-wide timeouts. The diagnostic tool logs the in-flight operation on every rank that timed out; faulty nodes hang and log nothing, so they surface immediately.
Case study — mid-training MFU drop: during a long run, MFU drifted downward over hours/days. Step-by-step analysis with the CUDA event timer showed that forward / backward / optimizer compute were stable, but the last reduce-scatter in DP synchronization grew. Root cause: time skew across ranks where some ranks initiated reduce-scatter later than others, traced to fluctuations in the forward stage caused by irregular garbage collection and certain PyTorch operations on the critical path. Removing those code paths stabilized MFU.
Case study — network interface flapping: occasional multi-second stalls were caused by NIC links going down and up. Two lessons: (1) NCCL timeout thresholds must be set to a larger explicit value to avoid spurious failure declarations; (2) the deeper root cause was poor link quality between NIC, AOC cable, and switch — addressed via lower-level signal-strength quality control.

6. Experience

6.1 Scalability — Strong Scaling on the 175B Model

Batch Size	Method	GPUs	Iter Time (s)	Throughput (tokens/s)	Training Time (days, 300B tokens)	MFU	Aggregate PFlops/s
768	Megatron-LM	256	40.0	39.3k	88.35	53.0%	43.3
768	Megatron-LM	512	21.2	74.1k	46.86	49.9%	77.6
768	Megatron-LM	768	15.2	103.8k	33.45	46.7%	111.9
768	Megatron-LM	1024	11.9	132.7k	26.17	44.7%	131.9
768	MegaScale	256	32.0	49.0k	70.86	65.3% (1.23x)	52.2
768	MegaScale	512	16.5	95.1k	36.51	63.5% (1.27x)	101.4
768	MegaScale	768	11.5	136.7k	25.40	61.3% (1.31x)	146.9
768	MegaScale	1024	8.9	176.9k	19.62	59.0% (1.32x)	188.5
6144	Megatron-LM	3072	29.02	433.6k	8.01	48.7%	466.8
6144	Megatron-LM	6144	14.78	851.6k	4.08	47.8%	916.3
6144	Megatron-LM	8192	12.24	1027.9k	3.38	43.3%	1106.7
6144	Megatron-LM	12288	8.57	1466.8k	2.37	41.2%	1579.5
6144	MegaScale	3072	23.66	531.9k	6.53	59.1% (1.21x)	566.5
6144	MegaScale	6144	12.21	1030.9k	3.37	57.3% (1.19x)	1098.4
6144	MegaScale	8192	9.56	1315.6k	2.64	54.9% (1.26x)	1400.6
6144	MegaScale	12288	6.34	1984.0k	—	55.2% (1.34x)	2166.3

At every scale tested, MegaScale outperforms Megatron-LM in MFU; the gap widens at larger GPU counts, because MegaScale's communication-overlap and stragglers-mitigation techniques pay off more when the network and tail-latency burdens grow.

6.2 Scalability — Weak Scaling on the 530B Model

Weak-scaling experiments (where global batch grows with GPU count) on the 530B-parameter configuration show MegaScale maintaining 54.3% MFU at 11,200 GPUs versus Megatron-LM's 48.2% at the same scale — a 6.1 percentage-point absolute MFU advantage.
530B model configuration: 160 attention heads, 20,480 hidden size, 105 layers; sequence length 2,048; vocab 64,000.

6.3 Ablation — Where the MFU Comes From

Training the 175B model with 256 GPUs and batch size 256:

Idx	Method	MFU	Delta
1	Baseline (vanilla Megatron-LM)	47.7%	—
2	+ Parallel Transformer Block (PTB)	52.3%	+4.6%
3	+ Sliding Window Attention (SWA)	53.3%	+1.0% (cum +5.6)
4	+ TP overlap	55.5%	+2.2% (cum +7.8)
5	+ PP overlap	58.0%	+2.5% (cum +10.3)
6	+ DP overlap	59.5%	+1.5% (cum +11.8)
7	+ efficient operators (FlashAttn-2, fused LN/GeLU)	61.2%	+1.7% (cum +13.5)
8	+ misc optimizations	62.3%	+1.1% (cum +14.6)
9	+ LAMB (batch x 3)	65.3%	+3.0% (cum +17.6)

Every component contributes; the largest single jumps are PTB (+4.6%), the PP overlap step (+2.5%), and LAMB-driven batch scaling (+3.0%).
Op-level optimizations alone (row 7) deliver only +1.7%, showing that algorithm-system co-design is doing most of the heavy lifting.

6.4 Production Run

A proprietary model was trained on multi-trillion tokens for several weeks on more than 10,000 GPUs.
The job recovered automatically more than 100 times over the run.
Effective training time rate >90% (i.e., <10% wasted to failures and recoveries).
Average failure detection plus diagnostic time is <10 minutes; recovery (catch-up to last checkpoint) is <15 minutes.

Existing public reports on giant-model training (GPT-4, PaLM, LLaMA) focus on model performance and capabilities but seldom share infrastructure details at the >10k-GPU scale.
MegaScale is positioned as filling that gap: a systems paper that shares end-to-end pre-training experience, observability tools, and operational lessons.
Adjacent literature touched on but not exhaustively re-surveyed: Megatron-LM (the baseline), DeepSpeed/ZeRO (DP sharding), PyTorch FSDP (which inspired the all-gather pre-fetching), GShard / GSPMD, Alpa (auto-parallelism). Communication libraries: NCCL, MSCCL. Fault-tolerant training: CheckFreq, Bamboo, Oobleck.

8. Conclusion

MegaScale demonstrates that algorithm-system co-design plus deep observability are mandatory ingredients for scaling LLM training to

10,000 GPUs both efficiently and stably.
Headline numbers restated: 55.2% MFU for 175B on 12,288 GPUs, a 1.34x improvement over Megatron-LM; >90% effective training time; recovery in <15 minutes; cluster-init in <30 seconds even at 10k+ GPUs.
The authors hope that by articulating the problems and operational experience from a systems perspective, future LLM systems research can build on these foundations.

9. Limitations / Future Work

Reactive, not predictive, fault tolerance. Failures are detected after they occur; predicting failures in real large-scale distributed systems remains hard due to system complexity.
Staleness avoided. Some prior overlap techniques exploit gradient staleness for additional concurrency; MegaScale avoids staleness to preserve accuracy. Balanced staleness is left for future exploration.
Hardware-specific tuning. Network and operator optimizations are tuned for ByteDance's specific topology (Tomahawk-4, 8x400G NICs); some of the gains may not transfer to clusters with different switch radix or NIC counts.
No public open-source release of the full system at publication time; partial components were planned to be released via the veScale project.

10. Cross-Cutting Take-Aways

Take-away	Source
Algorithm-system co-design beats pure system optimization	Ablation table: ops-only delta is +1.7%; algorithmic/overlap deltas total +14.6%
Communication overlap dominates at scale	DP+PP+TP overlap together = +6.2% MFU; the overlap gain grows with GPU count
Init-time scales worse than O(n^2) without care	2048-GPU init: 1047 s -> <5 s after Redis + O(n) barriers
Tail latency, not mean latency, sets cluster throughput	0.5% slow nodes throttled the whole cluster; removing them recovered ~0.7% MFU
Two-stage checkpointing makes per-checkpoint cost negligible	Stage 1 = several seconds (blocking); Stage 2 = async to HDFS
Network co-tuning (Swift+DCQCN, ECMP, multi-rail) is required	Default DCQCN suffers PFC HOL blocking at LLM traffic patterns
In-depth observability turns weeks-long mysteries into hours-long bugs	Mid-training MFU drift root-caused to GC and PyTorch ops on critical path

Note on NCCL Tuning

MegaScale tunes NCCL itself rather than just scheduling on top of it: the authors enabled NIC adap_retrans, raised NCCL retransmit timer and retry count to survive multi-second link flapping, and tuned NCCL timeout thresholds explicitly to avoid spurious cluster-wide failures. The paper also notes that the last reduce-scatter in the DP step was the collective whose tail blew up under cross-rank time skew, identifying collective tail latency — not aggregate bandwidth — as the failure mode that matters at 10k+ GPUs. Both observations are direct evidence that NCCL parameter selection (timeouts, retransmit, and the chunking that governs reduce-scatter tail behaviour) deserves to be a first-class tuning dimension at scale.