Detailed Summary: Megatron-LM

Full title: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Authors: Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro (NVIDIA) Year: 2020 (arXiv:1909.08053v4, submitted Sep 2019) Venue: arXiv preprint

Abstract (paraphrased)

The paper presents a simple, efficient intra-layer model parallel approach for training transformer language models with billions of parameters. The approach requires no new compiler or library changes, is orthogonal and complementary to pipeline model parallelism, and is implemented in native PyTorch with a small number of communication primitives. The authors train GPT-2 models up to 8.3B parameters and BERT models up to 3.9B parameters on 512 NVIDIA V100 GPUs, achieving 15.1 PetaFLOPs/s at 76% weak scaling efficiency. New SOTA results are established on WikiText103, LAMBADA, and RACE.

Motivation

Natural language processing has shifted toward large pretrained transformer models (BERT, GPT-2) that benefit from scale: more parameters yield better downstream task performance. However, the memory required by model weights, activations, and optimizer state (Adam stores two additional momentum tensors per parameter) quickly exceeds the capacity of a single 32 GB GPU. Activation checkpointing alleviates activation memory but not weight/optimizer memory. Parameter sharing (ALBERT) reduces capacity. The only general solution is distributing the model across multiple devices.

Existing model-parallel approaches have significant drawbacks:

GPipe / pipeline parallelism: Introduces pipeline bubbles, requires modified gradient synchronization logic, and changes the optimizer semantics.
Mesh-TensorFlow / FlexFlow: Require custom compilers or IR transformations; not compatible with standard PyTorch training loops.
Parameter servers with pipeline: Suffer from consistency issues under asynchronous updates.

The authors observe that the transformer architecture has a specific repeating structure that can be exploited for model parallelism with minimal intervention.

Background

Transformer Architecture

A transformer layer consists of:

A self-attention block: multi-head attention with Q, K, V projections, scaled dot-product attention, and an output linear projection.
A two-layer MLP block: GEMM → GeLU → GEMM → Dropout, with residual connections and layer normalization surrounding both blocks.

Both GPT-2 (decoder-only, left-to-right) and BERT (encoder-only, bidirectional) use GeLU nonlinearities and layer normalization. The original transformer used ReLU. This distinction matters for how parallelism is applied (GeLU, unlike ReLU, is not element-wise separable across a row-split, hence a column-split is preferred for the first GEMM).

Data vs. Model Parallelism

Data parallelism (Valiant 1990): Each worker holds a full model copy; minibatch is split. Gradients are synchronized via AllReduce after each backward pass. Scales well but is bounded by single-GPU memory.
Model parallelism: Distributes model weights and computation across devices. Two subtypes: pipeline (layer-wise) and tensor (intra-layer). This paper addresses tensor-parallel model parallelism.

System Design

Core Principle: Minimal All-Reduce Injection

The paper identifies that both the MLP and self-attention blocks can be made model-parallel by splitting GEMM operations, such that the result is mathematically equivalent to the unpartitioned computation with only one all-reduce in the forward pass and one in the backward pass per block — two all-reduces per transformer layer total.

MLP Block (column-split first GEMM, row-split second GEMM):
┌────────────────────────────────────────────────────────────────┐
│  GPU 0:  X → GEMM(X, A1) → GeLU → Y1 → GEMM(Y1, B1) → Z1    │
│  GPU 1:  X → GEMM(X, A2) → GeLU → Y2 → GEMM(Y2, B2) → Z2    │
│                                          ↓                      │
│                              g = AllReduce(Z1 + Z2)             │
│                              → Dropout → output                 │
└────────────────────────────────────────────────────────────────┘

Self-Attention Block (column-split Q,K,V, row-split output proj):
┌────────────────────────────────────────────────────────────────┐
│  GPU 0: heads [0..H/2] — local attention → output proj rows   │
│  GPU 1: heads [H/2..H] — local attention → output proj rows   │
│                                          ↓                      │
│                              g = AllReduce(outputs)             │
│                              → Dropout → residual               │
└────────────────────────────────────────────────────────────────┘

Per transformer layer: 2 all-reduces in forward + 2 in backward = 4 total

f and g Operators

f operator: Identity in forward pass; AllReduce in backward pass.
g operator: AllReduce in forward pass; identity in backward pass.
These are conjugates; implemented as custom PyTorch autograd Functions in ~5 lines of code each.

Embedding Parallelism

The input embedding matrix E (H × vocab_size) is split column-wise: E = [E1, E2]. After the parallel GEMM producing logits [Y1, Y2], the cross-entropy loss fuses with an all-gather, communicating scalar per-sample losses (dimension b × s) rather than full logit tensors (dimension b × s × vocab_size). This is a critical bandwidth optimization since vocab_size ~ 50,000.

Layer Normalization Placement (BERT)

The original BERT architecture applies layer normalization after the residual addition (post-LN). The paper empirically finds this causes training instability and accuracy degradation as BERT model size increases beyond 336M. Rearranging to pre-LN (layer norm applied before self-attention and MLP, inside the residual path) eliminates instabilities and allows monotonically improving accuracy up to 3.9B parameters.

Hybrid Model + Data Parallelism

┌──────────────────────────────────────────────────────────────┐
│  Model parallel group (intra-node, NVLink):                  │
│  GPU-1, GPU-2, ..., GPU-8  → 1 model replica                │
│                                                              │
│  Data parallel group (inter-node, InfiniBand):               │
│  One GPU from each model-parallel group per rank             │
│  64 such groups → 64-way data parallelism                    │
│                                                              │
│  Total: 8 × 64 = 512 GPUs for 8.3B model                    │
└──────────────────────────────────────────────────────────────┘

Gradient AllReduces for data parallelism are performed in parallel across all 64 data-parallel groups using NCCL, with each group containing one GPU from each model-parallel group.

Random Number Generation

Dropout within model-parallel regions must produce different random patterns per GPU (for correctness), while residual-connection dropout outside model-parallel regions must be identical across GPUs (for consistency). This is achieved by maintaining a separate per-GPU RNG for intra-model-parallel dropout, seeded differently per worker, while seeding the global RNG identically across the model-parallel group.

Training Setup

Hardware: 32 DGX-2H servers, 512 Tesla V100 SXM3 32 GB GPUs. 300 GB/s NVSwitch bandwidth intra-node; 100 GB/s InfiniBand inter-node (8 adapters per server).

Precision: Mixed precision (FP16 compute, FP32 master weights) with dynamic loss scaling.

Optimizer: Adam (β1=0.9, β2=0.999), weight decay λ=0.01, global gradient norm clipping at 1.0.

Initialization: W ~ N(0, 0.02); residual layer weights scaled by 1/√(2N) where N = number of transformer layers.

GPT-2 training: Sequence length 1024, batch size 512, 300K iterations, cosine LR decay from 1.5e-4 to 1e-5 with 3K warmup steps.

BERT training: Sequence length 1024, batch size 1024, 2M iterations (336M, 1.3B), 1.5M iterations (3.9B), linear LR decay from 1e-4, 10K warmup steps. Whole-word n-gram masking (Joshi et al., 2019); sentence order prediction head (Lan et al., 2019).

Dataset: 174 GB deduplicated text from Wikipedia, CC-Stories, RealNews, OpenWebtext. LSH deduplication at Jaccard similarity > 0.7.

Key Algorithm / Formulation

The MLP block parallel computation:

Given input X, weight matrices A (split by columns: A = [A1, A2]) and B (split by rows: B = [B1; B2]):

GPU 0: Y1 = GeLU(X A1),  Z1 = Y1 B1
GPU 1: Y2 = GeLU(X A2),  Z2 = Y2 B2
Output: Z = AllReduce(Z1 + Z2)  [g operator]

This is exact because column-splitting A allows GeLU to be applied independently (no synchronization point), and the row-split of B means the partial sums can be accumulated via AllReduce. The original approach of row-splitting A would require a synchronization before GeLU since GeLU(XA1 + XA2) ≠ GeLU(XA1) + GeLU(XA2).

Evaluation Methodology

Scaling metric: Weak scaling — baseline is the 1.2B model on 1 GPU achieving 39 TFLOPs (30% of theoretical peak). Larger models are trained on proportionally more GPUs, keeping parameters-per-GPU roughly constant.

Strong scaling: Fixed batch size of 8, increase GPU count from 1 to 8 for the 1.2B model. Measures pure parallelization speedup.

Language modeling: Zero-shot evaluation on WikiText103 (perplexity) and LAMBADA (cloze accuracy).

BERT downstream: Fine-tuning on MNLI, QQP, SQuAD 1.1, SQuAD 2.0, RACE. Hyperparameter search on dev set; median over 5 random seeds reported.

Results

Scaling Efficiency (Weak Scaling)

Config	GPUs	Weak Scaling
1.2B model parallel	1	100%
2.5B model parallel	2	95%
4.2B model parallel	4	82%
8.3B model parallel	8	77%
1.2B model+data	64	91%
4.2B model+data	256	79%
8.3B model+data	512	74%

Sustained throughput: 15.1 PetaFLOPs/s for the 8.3B model+data parallel configuration on 512 GPUs.

Strong Scaling (1.2B model, fixed batch size 8)

GPUs	Speedup
1	1.0×
2	1.64×
4	2.34×
8	2.98×

Language Modeling (GPT-2, zero-shot)

Model	WikiText103 PPL ↓	LAMBADA Acc ↑
355M	19.31	45.18%
2.5B	12.76	61.73%
8.3B	10.81	66.51%
Previous SOTA	15.79	63.24%

BERT Downstream (fine-tuned, single model)

Model	MNLI m/mm	QQP	SQuAD 1.1 F1/EM	RACE
Megatron-336M	90.9/91.0	92.3	89.7/83.0	—
Megatron-1.3B	90.9/91.0	92.6	94.9/89.1	87.3
Megatron-3.9B	91.4/91.4	92.7	91.2/88.5	90.9
Previous SOTA (XLNet)	90.8/90.8	92.3	95.1/89.7	85.4

Limitations

Attention head bottleneck: Model parallelism degree is bounded by the number of attention heads. Cannot parallelize an 8.3B model across more than 32 GPUs intra-layer with 32 heads.
Inter-node communication for large model parallel: Extending model parallelism beyond a single DGX-2H (8 GPUs) requires crossing InfiniBand, which is ~3× slower than NVLink intra-node — efficiency drops sharply.
No pipeline parallelism: This work is purely tensor-parallel. Pipeline parallelism would be needed to go significantly beyond 16B parameters with 16-GPU inter-layer splitting.
Fixed architecture: The technique is tailored to transformers. Non-transformer architectures would require different splitting strategies.
Large batch training: The hybrid approach uses very large global batch sizes (512 for GPT-2), which can hurt generalization. The paper notes this is a trade-off but does not address it.
Activation memory: Activation checkpointing is used but the paper does not deeply analyze the activation memory / recomputation trade-off at billion-parameter scale.

GPipe (Huang et al., 2018): Pipeline parallelism for TensorFlow; requires custom compiler, introduces pipeline bubbles. Complementary to Megatron's tensor parallelism.
Mesh-TensorFlow (Shazeer et al., 2018): General distributed tensor computation language; requires TF compiler integration. Megatron achieves similar result with far simpler implementation.
FlexFlow (Jia et al., 2018): Searches over parallelization strategies beyond data and model parallelism. More general but requires a compiler/search phase.
PipeDream (Harlap et al., 2018): Asynchronous pipeline parallelism with weight stashing; efficiency gains but consistency issues. Cited as complementary.
ALBERT (Lan et al., 2019): Uses parameter sharing to reduce model size; limits total model capacity. Megatron shows you can scale capacity directly with proper parallelism.

RL Formulation Table

This paper contains no reinforcement learning. Not applicable.

Relevance to DynamICCL

Megatron-LM is one of the most directly relevant papers to DynamICCL, both as a workload source and as a communication pattern reference.

Collective communication pattern generated by Megatron-LM training:

Per-layer (intra-model-parallel group, NVLink, small messages):
  Forward:  f_identity → [compute] → g_AllReduce  (×2 per layer: MLP + attention)
  Backward: f_AllReduce → [compute] → g_identity   (×2 per layer)

Per-step (inter-node data-parallel AllReduce, large messages):
  AllReduce(gradients) across 64 data-parallel workers via InfiniBand

Implications for DynamICCL's Config Agent:

The intra-node all-reduces (model-parallel) are small (hidden_size × hidden_size), occur at high frequency within the layer loop, and traverse NVLink. These benefit from NCCL's ll or ll128 protocol and do not require many channels.
The inter-node gradient AllReduces (data-parallel) are large (sum of all parameter gradients across 8.3B parameters), bandwidth-bound, and traverse InfiniBand. These benefit from NCCL's simple protocol with higher nChannels.
DynamICCL's Trigger Agent must detect when the inter-node fabric becomes congested due to simultaneous gradient AllReduces from multiple nodes — a realistic scenario at 512 GPUs where 64 data-parallel groups fire gradient syncs near-simultaneously.
The paper's hardware setup (DGX-2H with 300 GB/s NVSwitch + 100 GB/s InfiniBand with 8 adapters per server) represents the high-end regime. DynamICCL's Chameleon Cloud 1 GbE environment is the low-bandwidth analogue where protocol/algorithm selection is even more consequential.
Megatron's code is publicly available at https://github.com/NVIDIA/Megatron-LM and serves as a realistic benchmark workload for DynamICCL evaluation.