Detailed Summary: ZeRO

Full title: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Authors: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He (Microsoft) Year: 2020 (arXiv:1910.02054v3, submitted Oct 2019) Venue: arXiv preprint / SC 2020 Open source: https://github.com/microsoft/deepspeed


Abstract (paraphrased)

ZeRO (Zero Redundancy Optimizer) eliminates memory redundancies present in both data-parallel and model-parallel training without sacrificing communication efficiency or compute throughput. By partitioning optimizer states, gradients, and parameters across data-parallel processes — rather than replicating them — ZeRO scales model size linearly with the number of devices. The ZeRO-100B implementation (a subset of full ZeRO) trains 170B parameter models on 400 V100 GPUs at 15 PetaFLOPs sustained throughput, an 8× model size increase and 10× throughput increase over prior SOTA (Megatron-LM). ZeRO powered the Turing-NLG 17B model, the world's largest at the time of publication.


Motivation

The memory consumed during DNN training falls into two categories:

Model states (the dominant consumer):

Residual states:

Why existing approaches fall short:


Background

Memory Accounting for Mixed-Precision Adam

For a model with Ψ parameters:

fp16 parameters:     2Ψ bytes
fp16 gradients:      2Ψ bytes
fp32 parameters:     4Ψ bytes  (master copy)
fp32 momentum:       4Ψ bytes
fp32 variance:       4Ψ bytes
────────────────────────────
Total model states: 16Ψ bytes  (K=12 for optimizer states: 12Ψ bytes)

For Ψ = 7.5B: 120 GB per GPU in standard DP.

Standard DP Communication Volume

AllReduce = ReduceScatter + AllGather. Each operation moves Ψ elements per data-parallel process. Total data movement per training step: 2Ψ elements.


System Design

ZeRO has two major components: ZeRO-DP (model state memory) and ZeRO-R (residual state memory).

ZeRO-DP Architecture

Standard DP (N_d workers, each stores full model):
┌─────────┐  ┌─────────┐  ┌─────────┐
│GPU 0    │  │GPU 1    │  │GPU N-1  │
│params Ψ │  │params Ψ │  │params Ψ │
│grads  Ψ │  │grads  Ψ │  │grads  Ψ │
│opt 12Ψ  │  │opt 12Ψ  │  │opt 12Ψ  │
└─────────┘  └─────────┘  └─────────┘

ZeRO-DP P_os+g+p (all states partitioned across N_d workers):
┌──────────────────┐  ┌──────────────────┐
│GPU 0             │  │GPU N-1            │
│params  Ψ/N_d    │  │params  Ψ/N_d     │
│grads   Ψ/N_d    │  │grads   Ψ/N_d     │
│opt    12Ψ/N_d   │  │opt    12Ψ/N_d    │
└──────────────────┘  └──────────────────┘
  ↑↑ reconstructed via AllGather on demand ↑↑

Stage 1: P_os — Optimizer State Partitioning

Each data-parallel rank i stores only the optimizer states for its partition (Ψ/N_d parameters). After gradients are reduced, each rank updates only its partition. An AllGather at the end of each step collects all updated parameters.

Memory: 4Ψ + (KΨ/N_d). Communication: same as baseline DP (2Ψ).

Stage 2: P_g — Gradient Partitioning

Each rank only needs the reduced gradients for its own parameter partition. Rather than an AllReduce of all gradients, ZeRO performs a ReduceScatter: each gradient is reduced only on the rank responsible for that partition. After the rank updates its parameters, it can free the gradients.

Implementation uses a bucket strategy: gradients are bucketed per partition boundary, and reduction is triggered when a bucket is full — overlapping communication with backward computation, similar to NVIDIA AMP's all-reduce overlap.

Memory: 2Ψ + ((2+K)Ψ/N_d). Communication: still 2Ψ (ReduceScatter Ψ + AllGather Ψ).

Stage 3: P_p — Parameter Partitioning

Each rank stores only Ψ/N_d parameters. During forward pass, parameters for each layer are obtained via AllGather before that layer's computation and discarded immediately after. The AllGather is pipelined across the forward pass, spreading communication over time.

During backward pass, the same AllGather precedes gradient computation for each layer, and ReduceScatter replaces AllReduce.

Communication overhead: AllGather forward (Ψ) + AllGather backward (Ψ) + ReduceScatter backward (Ψ) = 3Ψ total = 1.5× baseline DP.

Memory: (2+2+K)Ψ/N_d = 16Ψ/N_d. For N_d=64: 1.9 GB for a 7.5B model.

ZeRO-R Architecture

P_a: Partitioned Activation Checkpointing

In model parallelism, each MP worker replicates the full activation checkpoints (needed by all workers during backward). ZeRO-R partitions these checkpoints across the MP group, storing only 1/N_m of the activations per GPU. An AllGather reconstructs the full activation just before it is needed for recomputation.

Communication overhead < 10% of baseline MP communication (one AllGather per transformer block vs. 4 AllReduces per block in Megatron-LM). For a 100B model with MP=16 and batch=32: reduces activation memory from 33 GB/GPU to ~2 GB/GPU.

Optional CPU offload (P_a+cpu): moves even the partitioned checkpoints to CPU memory, reducing activation memory to near zero at the cost of 2× added PCIe data movement.

C_B: Constant-Size Buffers

Instead of fusing all gradients into a single buffer (which grows with model size — 12 GB for a 3B model), ZeRO-R uses a fixed-size fused buffer. The buffer is large enough for efficiency but does not scale with model size.

M_D: Memory Defragmentation

Forward pass: checkpointed activations (long-lived) interleave with recomputed activations (short-lived). Backward pass: parameter gradients (long-lived) interleave with activation gradients and intermediate buffers (short-lived).

ZeRO-R pre-allocates contiguous memory chunks for long-lived objects (checkpoints and parameter gradients) and copies tensors into them as they are produced, eliminating fragmentation. This enables training with smaller effective memory margin.


Key Algorithm / Formulation

Communication Volume Analysis

Standard DP AllReduce total data movement per step: (Ψ for ReduceScatter + Ψ for AllGather).

ZeRO-DP stages:

ZeRO-R P_a communication overhead in Megatron-LM context:

Super-linear Scalability Mechanism

As DP degree N_d increases (more GPUs), per-GPU memory decreases proportionally (with P_os+g+p). This enables fitting a larger local batch size per GPU, which increases arithmetic intensity (operations per memory access). Higher arithmetic intensity → better GPU utilization → throughput per GPU increases even as N_d grows. This creates super-linear aggregate scaling.


Evaluation Methodology

Hardware: 400 NVIDIA V100 32 GB GPUs, 25 DGX-2 nodes, 800 Gbps inter-node bandwidth.

Baseline: PyTorch DDP (no MP) and Megatron-LM (with MP). Megatron-LM open-source version (Sep 2019).

ZeRO-100B: Implements P_os+g (ZeRO-DP stage 2) + ZeRO-R (C_B + M_D + P_a). Does not include P_p (parameter partitioning).

Model architecture: GPT-2-style transformers with varying hidden dimensions and layer counts (Table 4 in paper).

Five ZeRO configurations tested (Table 3): | Config | ZeRO-DP | ZeRO-R | |--------|---------|--------| | C1 | P_os | C_B + M_D | | C2 | P_os | C_B + M_D + P_a | | C3 | P_os+g | C_B + M_D | | C4 | P_os+g | C_B + M_D + P_a | | C5 | P_os+g | C_B + M_D + P_a+cpu |


Results

Model Size and Speed

Super-linear Scaling (60B model)

GPUs Total TFLOPs TFLOPs/GPU
64 ~1100 ~17
128 ~3000 ~23
256 ~7000 ~27
400 ~14000 ~35

Performance per GPU increases as GPU count increases — super-linear aggregate scaling.

Memory Analysis (7.5B model, N_d=64, K=12)

Stage Memory/device
Baseline DP 120 GB
P_os 31.4 GB
P_os+g 16.6 GB
P_os+g+p 1.9 GB

Max Model Size vs. ZeRO Config (MP=16, fixed batch)

Config Max Model Size
C1 40B
C2 60B
C3 50B
C4 140B
C5 150B

Turing-NLG (17B parameters)


Limitations

  1. P_p not in ZeRO-100B: Full parameter partitioning (stage 3, which achieves 1/N_d memory reduction) is described analytically but not included in the evaluated implementation. The 1.5× communication overhead of P_p may not be justified on bandwidth-constrained links.

  2. CPU offloading cost: P_a+cpu reduces memory at the cost of 2× activation data movement through PCIe. Performance degrades for models that do not need it (Figure 8, C5 worse than C4 for 60B model). Only beneficial when DP communication is the bottleneck at small batch sizes.

  3. Trillion-parameter compute gap: ZeRO provides the memory capacity to fit 1T parameters on 1024 GPUs with full P_os+g+p, but training time would be ~140 days assuming Bert-Large-equivalent compute density. Exa-FLOP compute infrastructure would be needed.

  4. Batch size dependency for super-linear speedup: The super-linear scaling property relies on increasing batch size with GPU count. Very large batch sizes can degrade convergence quality.

  5. Memory fragmentation fix is probabilistic: Pre-allocation of contiguous buffers assumes the lifetime patterns of activations and gradients remain consistent. Irregular model architectures or variable-length sequences may undermine this.

  6. No dynamic adaptation: ZeRO selects a configuration statically. There is no mechanism to switch between stages during training based on observed memory pressure or communication congestion.



RL Formulation Table

This paper contains no reinforcement learning. Not applicable.


Relevance to DynamICCL

ZeRO is highly relevant to DynamICCL because it directly determines the collective communication pattern generated by large-scale training jobs that DynamICCL must optimize.

Changed collective pattern vs. standard DP:

Standard DP (per training step):
  Backward ends → AllReduce(gradients, size=Ψ) → optimizer step

ZeRO P_os+g (per training step):
  During backward → ReduceScatter(gradient_bucket_i, size≈Ψ/N_buckets) × N_buckets [overlapped]
  Before forward  → AllGather(parameters, size=Ψ) [may be pipelined layer by layer with P_p]

Implications for DynamICCL's Config Agent:

  1. Two new collective types: ReduceScatter and AllGather appear explicitly in the training loop. NCCL's optimal algorithm/protocol for these may differ from AllReduce. DynamICCL must tune all three (or configure NCCL to treat them uniformly through the tuner API).

  2. Smaller, more frequent messages: ZeRO's bucketed ReduceScatter fires smaller messages during backward computation rather than one large AllReduce at the end. This changes the message-size distribution seen by NCCL. Small messages favor LL/LL128 protocols; DynamICCL's Config Agent should account for this.

  3. Reduced peak bandwidth demand: By spreading collectives across the backward pass (overlap with compute), ZeRO reduces peak network utilization versus a single end-of-backward AllReduce. This changes congestion onset timing — DynamICCL's Trigger Agent (LSTM+CUSUM) may need to adapt its detection window to the more distributed traffic pattern.

  4. Communication volume budget: ZeRO P_os+g maintains 2Ψ total communication volume (same as standard DP), meaning DynamICCL's throughput optimization targets remain the same in aggregate but the temporal distribution of that volume changes significantly.

  5. ZeRO + Megatron combined workloads: In production, ZeRO-DP (inter-node, large-message ReduceScatter/AllGather) co-exists with Megatron-style tensor-parallel AllReduces (intra-node, small-message). DynamICCL must handle this multi-class collective environment, where the two traffic streams have fundamentally different optimal NCCL configurations.