Brief Summary: Megatron-LM
Full title: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Authors: Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro (NVIDIA) Year: 2020 (arXiv:1909.08053v4) Venue: arXiv / SC-adjacent systems work
Problem
Very large transformer language models (GPT-2, BERT) exceed the memory capacity of a single GPU. Existing model parallelism approaches (GPipe, Mesh-TensorFlow) require custom compilers, significant code rewrites, or introduce pipeline bubbles and optimizer instabilities.
Core Insight
The inherent structure of transformer layers — an MLP block and a self-attention block, each expressible as a sequence of GEMM operations — allows intra-layer (tensor) model parallelism to be implemented with only two carefully placed all-reduce operations per transformer layer, requiring no new compiler and no changes to the optimizer.
Method
- MLP parallelism: Split the first GEMM along columns (A = [A1, A2]) and the second GEMM along rows. GeLU can be applied independently on each partition. A single all-reduce (g operator) in the forward pass and one (f operator) in the backward pass suffice.
- Self-attention parallelism: Split attention heads across GPUs column-wise for Q, K, V projections; the per-head computation is fully local. The output linear projection is split row-wise. Again, only two all-reduces per layer total.
- Embedding: The vocabulary embedding matrix is split column-wise; logit computation fuses with cross-entropy to communicate scalar losses rather than full logit tensors, greatly reducing bandwidth.
- Hybrid parallelism: Combined with data parallelism — 8-way model parallel groups × 64-way data parallel = 512 GPUs total for the 8.3B model.
- Implemented in PyTorch with native NCCL collectives; no custom compiler.
Key Results
| Config | GPUs | Throughput | Weak scaling efficiency |
|---|---|---|---|
| 1.2B (baseline) | 1 | 39 TFLOPs | 100% |
| 8.3B model parallel | 8 | — | 77% |
| 8.3B model+data parallel | 512 | 15.1 PFLOPs | 74% |
- GPT-2 8.3B: WikiText103 perplexity 10.81 (SOTA vs. prior 15.79); LAMBADA accuracy 66.51% (SOTA vs. 63.24%)
- BERT 3.9B: RACE test accuracy 90.9% (SOTA vs. 89.4%)
- Layer normalization placement (rearranged residual + pre-LN) found to be critical for stable scaling of BERT beyond 336M parameters.
Limitations
- Intra-layer parallelism is bounded by the number of attention heads; cannot parallelize beyond the head count.
- Communication overhead grows with model parallel degree; above 8 GPUs the approach requires inter-node bandwidth, which degrades efficiency.
- Does not address pipeline parallelism — models must still fit across the model-parallel group within a node.
- Strong scaling (fixed batch) shows diminishing returns above 2–4 GPUs.
Relevance to DynamICCL
Megatron-LM is a primary workload driver for DynamICCL. Its training loop generates the exact collective pattern DynamICCL must optimize:
- AllReduce dominates: Data-parallel gradient synchronization produces large AllReduce collectives after each backward pass. With 64-way data parallelism across 512 GPUs, these are high-volume, latency-sensitive operations.
- Intra-layer all-reduces: The two all-reduces per transformer layer (forward + backward) are small, frequent, and bandwidth-sensitive — exactly the regime where NCCL algorithm/protocol selection matters most.
- nChannels and numThreads sensitivity: At 8-way model parallelism on NVLink-connected intra-node GPUs, the optimal NCCL configuration differs from the inter-node data-parallel AllReduces. DynamICCL's Config Agent must distinguish these two traffic classes.
- Congestion trigger: Mixed intra/inter-node traffic from hybrid parallelism creates bursty congestion patterns on the inter-node fabric — precisely what DynamICCL's Trigger Agent (LSTM+CUSUM) is designed to detect.