Brief Summary: Megatron-LM

Full title: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Authors: Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro (NVIDIA) Year: 2020 (arXiv:1909.08053v4) Venue: arXiv / SC-adjacent systems work

Problem

Very large transformer language models (GPT-2, BERT) exceed the memory capacity of a single GPU. Existing model parallelism approaches (GPipe, Mesh-TensorFlow) require custom compilers, significant code rewrites, or introduce pipeline bubbles and optimizer instabilities.

Core Insight

The inherent structure of transformer layers — an MLP block and a self-attention block, each expressible as a sequence of GEMM operations — allows intra-layer (tensor) model parallelism to be implemented with only two carefully placed all-reduce operations per transformer layer, requiring no new compiler and no changes to the optimizer.

Method

MLP parallelism: Split the first GEMM along columns (A = [A1, A2]) and the second GEMM along rows. GeLU can be applied independently on each partition. A single all-reduce (g operator) in the forward pass and one (f operator) in the backward pass suffice.
Self-attention parallelism: Split attention heads across GPUs column-wise for Q, K, V projections; the per-head computation is fully local. The output linear projection is split row-wise. Again, only two all-reduces per layer total.
Embedding: The vocabulary embedding matrix is split column-wise; logit computation fuses with cross-entropy to communicate scalar losses rather than full logit tensors, greatly reducing bandwidth.
Hybrid parallelism: Combined with data parallelism — 8-way model parallel groups × 64-way data parallel = 512 GPUs total for the 8.3B model.
Implemented in PyTorch with native NCCL collectives; no custom compiler.

Key Results

Config	GPUs	Throughput	Weak scaling efficiency
1.2B (baseline)	1	39 TFLOPs	100%
8.3B model parallel	8	—	77%
8.3B model+data parallel	512	15.1 PFLOPs	74%

GPT-2 8.3B: WikiText103 perplexity 10.81 (SOTA vs. prior 15.79); LAMBADA accuracy 66.51% (SOTA vs. 63.24%)
BERT 3.9B: RACE test accuracy 90.9% (SOTA vs. 89.4%)
Layer normalization placement (rearranged residual + pre-LN) found to be critical for stable scaling of BERT beyond 336M parameters.

Limitations

Intra-layer parallelism is bounded by the number of attention heads; cannot parallelize beyond the head count.
Communication overhead grows with model parallel degree; above 8 GPUs the approach requires inter-node bandwidth, which degrades efficiency.
Does not address pipeline parallelism — models must still fit across the model-parallel group within a node.
Strong scaling (fixed batch) shows diminishing returns above 2–4 GPUs.

Relevance to DynamICCL

Megatron-LM is a primary workload driver for DynamICCL. Its training loop generates the exact collective pattern DynamICCL must optimize:

AllReduce dominates: Data-parallel gradient synchronization produces large AllReduce collectives after each backward pass. With 64-way data parallelism across 512 GPUs, these are high-volume, latency-sensitive operations.
Intra-layer all-reduces: The two all-reduces per transformer layer (forward + backward) are small, frequent, and bandwidth-sensitive — exactly the regime where NCCL algorithm/protocol selection matters most.
nChannels and numThreads sensitivity: At 8-way model parallelism on NVLink-connected intra-node GPUs, the optimal NCCL configuration differs from the inter-node data-parallel AllReduces. DynamICCL's Config Agent must distinguish these two traffic classes.
Congestion trigger: Mixed intra/inter-node traffic from hybrid parallelism creates bursty congestion patterns on the inter-node fabric — precisely what DynamICCL's Trigger Agent (LSTM+CUSUM) is designed to detect.