Detailed Summary: ZeRO

Full title: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Authors: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He (Microsoft) Year: 2020 (arXiv:1910.02054v3, submitted Oct 2019) Venue: arXiv preprint / SC 2020 Open source: https://github.com/microsoft/deepspeed

Abstract (paraphrased)

ZeRO (Zero Redundancy Optimizer) eliminates memory redundancies present in both data-parallel and model-parallel training without sacrificing communication efficiency or compute throughput. By partitioning optimizer states, gradients, and parameters across data-parallel processes — rather than replicating them — ZeRO scales model size linearly with the number of devices. The ZeRO-100B implementation (a subset of full ZeRO) trains 170B parameter models on 400 V100 GPUs at 15 PetaFLOPs sustained throughput, an 8× model size increase and 10× throughput increase over prior SOTA (Megatron-LM). ZeRO powered the Turing-NLG 17B model, the world's largest at the time of publication.

Motivation

The memory consumed during DNN training falls into two categories:

Model states (the dominant consumer):

Optimizer states: with Adam, two fp32 tensors (momentum m, variance v) per parameter → 8 bytes/parameter
Gradients: fp16 during forward/backward, fp32 for update → 2–4 bytes/parameter
Parameters: fp16 for compute, fp32 master copy → 6 bytes/parameter
Total for mixed-precision Adam: 16Ψ bytes (K=12 multiplier for optimizer states + 4 for parameters/gradients in fp16)
A 1.5B parameter model: ~24 GB just for model states — exceeds 32 GB GPU capacity when combined with activations.

Residual states:

Activations: for GPT-2 style model, activation memory ≈ 12 × hidden_dim × batch × seq_len × transformer_layers. A 1.5B model with batch=32, seq=1K requires ~60 GB; activation checkpointing reduces this to ~8 GB (at 33% recompute cost).
Temporary buffers: gradient all-reduce fuses all gradients into a single fp32 buffer — for a 1.5B model this is 6 GB.
Fragmented memory: interleaving of short-lived (recomputed activations, gradient buckets) and long-lived (checkpointed activations, parameter gradients) tensors causes fragmentation; systems can fail OOM with >30% memory nominally available.

Why existing approaches fall short:

DP: good compute/communication efficiency but replicates all model states → poor memory efficiency.
MP: good memory efficiency but fine-grained computation reduces GPU utilization, and inter-node communication (Megatron tested a 40B model across two DGX-2 nodes at <5 TFLOPs/GPU) causes severe degradation.
PP (GPipe): requires large batch sizes proportional to pipeline depth to hide bubbles; hurts convergence. PipeDream keeps stale parameters → memory and correctness trade-offs.

Background

Memory Accounting for Mixed-Precision Adam

For a model with Ψ parameters:

fp16 parameters:     2Ψ bytes
fp16 gradients:      2Ψ bytes
fp32 parameters:     4Ψ bytes  (master copy)
fp32 momentum:       4Ψ bytes
fp32 variance:       4Ψ bytes
────────────────────────────
Total model states: 16Ψ bytes  (K=12 for optimizer states: 12Ψ bytes)

For Ψ = 7.5B: 120 GB per GPU in standard DP.

Standard DP Communication Volume

AllReduce = ReduceScatter + AllGather. Each operation moves Ψ elements per data-parallel process. Total data movement per training step: 2Ψ elements.

System Design

ZeRO has two major components: ZeRO-DP (model state memory) and ZeRO-R (residual state memory).

ZeRO-DP Architecture

Standard DP (N_d workers, each stores full model):
┌─────────┐  ┌─────────┐  ┌─────────┐
│GPU 0    │  │GPU 1    │  │GPU N-1  │
│params Ψ │  │params Ψ │  │params Ψ │
│grads  Ψ │  │grads  Ψ │  │grads  Ψ │
│opt 12Ψ  │  │opt 12Ψ  │  │opt 12Ψ  │
└─────────┘  └─────────┘  └─────────┘

ZeRO-DP P_os+g+p (all states partitioned across N_d workers):
┌──────────────────┐  ┌──────────────────┐
│GPU 0             │  │GPU N-1            │
│params  Ψ/N_d    │  │params  Ψ/N_d     │
│grads   Ψ/N_d    │  │grads   Ψ/N_d     │
│opt    12Ψ/N_d   │  │opt    12Ψ/N_d    │
└──────────────────┘  └──────────────────┘
  ↑↑ reconstructed via AllGather on demand ↑↑

Stage 1: P_os — Optimizer State Partitioning

Each data-parallel rank i stores only the optimizer states for its partition (Ψ/N_d parameters). After gradients are reduced, each rank updates only its partition. An AllGather at the end of each step collects all updated parameters.

Memory: 4Ψ + (KΨ/N_d). Communication: same as baseline DP (2Ψ).

Stage 2: P_g — Gradient Partitioning

Each rank only needs the reduced gradients for its own parameter partition. Rather than an AllReduce of all gradients, ZeRO performs a ReduceScatter: each gradient is reduced only on the rank responsible for that partition. After the rank updates its parameters, it can free the gradients.

Implementation uses a bucket strategy: gradients are bucketed per partition boundary, and reduction is triggered when a bucket is full — overlapping communication with backward computation, similar to NVIDIA AMP's all-reduce overlap.

Memory: 2Ψ + ((2+K)Ψ/N_d). Communication: still 2Ψ (ReduceScatter Ψ + AllGather Ψ).

Stage 3: P_p — Parameter Partitioning

Each rank stores only Ψ/N_d parameters. During forward pass, parameters for each layer are obtained via AllGather before that layer's computation and discarded immediately after. The AllGather is pipelined across the forward pass, spreading communication over time.

During backward pass, the same AllGather precedes gradient computation for each layer, and ReduceScatter replaces AllReduce.

Communication overhead: AllGather forward (Ψ) + AllGather backward (Ψ) + ReduceScatter backward (Ψ) = 3Ψ total = 1.5× baseline DP.

Memory: (2+2+K)Ψ/N_d = 16Ψ/N_d. For N_d=64: 1.9 GB for a 7.5B model.

ZeRO-R Architecture

P_a: Partitioned Activation Checkpointing

In model parallelism, each MP worker replicates the full activation checkpoints (needed by all workers during backward). ZeRO-R partitions these checkpoints across the MP group, storing only 1/N_m of the activations per GPU. An AllGather reconstructs the full activation just before it is needed for recomputation.

Communication overhead < 10% of baseline MP communication (one AllGather per transformer block vs. 4 AllReduces per block in Megatron-LM). For a 100B model with MP=16 and batch=32: reduces activation memory from 33 GB/GPU to ~2 GB/GPU.

Optional CPU offload (P_a+cpu): moves even the partitioned checkpoints to CPU memory, reducing activation memory to near zero at the cost of 2× added PCIe data movement.

C_B: Constant-Size Buffers

Instead of fusing all gradients into a single buffer (which grows with model size — 12 GB for a 3B model), ZeRO-R uses a fixed-size fused buffer. The buffer is large enough for efficiency but does not scale with model size.

M_D: Memory Defragmentation

Forward pass: checkpointed activations (long-lived) interleave with recomputed activations (short-lived). Backward pass: parameter gradients (long-lived) interleave with activation gradients and intermediate buffers (short-lived).

ZeRO-R pre-allocates contiguous memory chunks for long-lived objects (checkpoints and parameter gradients) and copies tensors into them as they are produced, eliminating fragmentation. This enables training with smaller effective memory margin.

Key Algorithm / Formulation

Communication Volume Analysis

Standard DP AllReduce total data movement per step: 2Ψ (Ψ for ReduceScatter + Ψ for AllGather).

ZeRO-DP stages:

P_os: AllGather at step end = Ψ. ReduceScatter of gradients = Ψ. Total = 2Ψ (same as DP).
P_os+g: Same as above — ReduceScatter during backward already counted.
P_os+g+p: Add AllGather forward (Ψ) + AllGather backward (Ψ) = 3Ψ (1.5× DP).

ZeRO-R P_a communication overhead in Megatron-LM context:

Baseline MP: 4 AllReduces per transformer block × message size = 4 × (batch × seq_len × hidden_dim) per block.
P_a: 1 AllGather per activation checkpoint per block = seq_len × hidden_dim per block.
Ratio: P_a communication / MP communication < 10%.

Super-linear Scalability Mechanism

As DP degree N_d increases (more GPUs), per-GPU memory decreases proportionally (with P_os+g+p). This enables fitting a larger local batch size per GPU, which increases arithmetic intensity (operations per memory access). Higher arithmetic intensity → better GPU utilization → throughput per GPU increases even as N_d grows. This creates super-linear aggregate scaling.

Evaluation Methodology

Hardware: 400 NVIDIA V100 32 GB GPUs, 25 DGX-2 nodes, 800 Gbps inter-node bandwidth.

Baseline: PyTorch DDP (no MP) and Megatron-LM (with MP). Megatron-LM open-source version (Sep 2019).

ZeRO-100B: Implements P_os+g (ZeRO-DP stage 2) + ZeRO-R (C_B + M_D + P_a). Does not include P_p (parameter partitioning).

Model architecture: GPT-2-style transformers with varying hidden dimensions and layer counts (Table 4 in paper).

Five ZeRO configurations tested (Table 3): | Config | ZeRO-DP | ZeRO-R | |--------|---------|--------| | C1 | P_os | C_B + M_D | | C2 | P_os | C_B + M_D + P_a | | C3 | P_os+g | C_B + M_D | | C4 | P_os+g | C_B + M_D + P_a | | C5 | P_os+g | C_B + M_D + P_a+cpu |

Results

Model Size and Speed

ZeRO-100B trains up to 170B parameter models on 400 GPUs.
Megatron-LM with MP alone cannot scale beyond ~40B parameters on this hardware without extreme efficiency degradation.
For 8B–100B models, ZeRO-100B achieves >30% peak hardware throughput (>15 TFLOPs/GPU), sustained at 38 TFLOPs/GPU average.
10× throughput improvement over baseline (Megatron-MP) for the same model size class.

Super-linear Scaling (60B model)

GPUs	Total TFLOPs	TFLOPs/GPU
64	~1100	~17
128	~3000	~23
256	~7000	~27
400	~14000	~35

Performance per GPU increases as GPU count increases — super-linear aggregate scaling.

Memory Analysis (7.5B model, N_d=64, K=12)

Stage	Memory/device
Baseline DP	120 GB
P_os	31.4 GB
P_os+g	16.6 GB
P_os+g+p	1.9 GB

Max Model Size vs. ZeRO Config (MP=16, fixed batch)

Config	Max Model Size
C1	40B
C2	60B
C3	50B
C4	140B
C5	150B

Turing-NLG (17B parameters)

Trained with ZeRO-100B.
Converges to lower validation perplexity than Megatron-LM 8.3B.
WikiText-103 perplexity: 10.21 (new SOTA at time of publication).
Throughput: 41.4 TFLOPs/GPU sustained.

Limitations

P_p not in ZeRO-100B: Full parameter partitioning (stage 3, which achieves 1/N_d memory reduction) is described analytically but not included in the evaluated implementation. The 1.5× communication overhead of P_p may not be justified on bandwidth-constrained links.
CPU offloading cost: P_a+cpu reduces memory at the cost of 2× activation data movement through PCIe. Performance degrades for models that do not need it (Figure 8, C5 worse than C4 for 60B model). Only beneficial when DP communication is the bottleneck at small batch sizes.
Trillion-parameter compute gap: ZeRO provides the memory capacity to fit 1T parameters on 1024 GPUs with full P_os+g+p, but training time would be ~140 days assuming Bert-Large-equivalent compute density. Exa-FLOP compute infrastructure would be needed.
Batch size dependency for super-linear speedup: The super-linear scaling property relies on increasing batch size with GPU count. Very large batch sizes can degrade convergence quality.
Memory fragmentation fix is probabilistic: Pre-allocation of contiguous buffers assumes the lifetime patterns of activations and gradients remain consistent. Irregular model architectures or variable-length sequences may undermine this.
No dynamic adaptation: ZeRO selects a configuration statically. There is no mechanism to switch between stages during training based on observed memory pressure or communication congestion.

Megatron-LM (Shoeybi et al., 2019): Best prior MP implementation; degrades severely beyond a single DGX-2 node. ZeRO combines with Megatron-LM's intra-node MP for the best combined approach.
GPipe (Huang et al., 2018): Pipeline parallelism; large batch requirement and bubble overhead limit flexibility.
PipeDream (Harlap et al., 2018): Asynchronous PP; stale parameters reduce memory efficiency and introduce convergence complications.
Activation checkpointing (Chen et al., 2016): Reduces activation memory at 33% recompute cost. Complementary to ZeRO-R, which further partitions checkpoints.
CPU offloading (Pudipeddi et al.; L-Zero): Moves model states to CPU; up to 50% training time spent on GPU-CPU transfers. ZeRO differs by only offloading activations when beneficial.

RL Formulation Table

This paper contains no reinforcement learning. Not applicable.

Relevance to DynamICCL

ZeRO is highly relevant to DynamICCL because it directly determines the collective communication pattern generated by large-scale training jobs that DynamICCL must optimize.

Changed collective pattern vs. standard DP:

Standard DP (per training step):
  Backward ends → AllReduce(gradients, size=Ψ) → optimizer step

ZeRO P_os+g (per training step):
  During backward → ReduceScatter(gradient_bucket_i, size≈Ψ/N_buckets) × N_buckets [overlapped]
  Before forward  → AllGather(parameters, size=Ψ) [may be pipelined layer by layer with P_p]

Implications for DynamICCL's Config Agent:

Two new collective types: ReduceScatter and AllGather appear explicitly in the training loop. NCCL's optimal algorithm/protocol for these may differ from AllReduce. DynamICCL must tune all three (or configure NCCL to treat them uniformly through the tuner API).
Smaller, more frequent messages: ZeRO's bucketed ReduceScatter fires smaller messages during backward computation rather than one large AllReduce at the end. This changes the message-size distribution seen by NCCL. Small messages favor LL/LL128 protocols; DynamICCL's Config Agent should account for this.
Reduced peak bandwidth demand: By spreading collectives across the backward pass (overlap with compute), ZeRO reduces peak network utilization versus a single end-of-backward AllReduce. This changes congestion onset timing — DynamICCL's Trigger Agent (LSTM+CUSUM) may need to adapt its detection window to the more distributed traffic pattern.
Communication volume budget: ZeRO P_os+g maintains 2Ψ total communication volume (same as standard DP), meaning DynamICCL's throughput optimization targets remain the same in aggregate but the temporal distribution of that volume changes significantly.
ZeRO + Megatron combined workloads: In production, ZeRO-DP (inter-node, large-message ReduceScatter/AllGather) co-exists with Megatron-style tensor-parallel AllReduces (intra-node, small-message). DynamICCL must handle this multi-class collective environment, where the two traffic streams have fundamentally different optimal NCCL configurations.