Brief Summary: ZeRO

Full title: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Authors: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He (Microsoft) Year: 2020 (arXiv:1910.02054v3) Venue: arXiv / SC 2020


Problem

Training very large deep learning models (billions to trillions of parameters) is blocked by GPU memory capacity. Data parallelism (DP) replicates all model states on every device — wasteful. Model parallelism (MP) reduces memory per device but degrades efficiency severely when crossing node boundaries due to high inter-node communication. A 1.5B GPT-2 model requires ~24 GB to train with Adam in mixed precision, exceeding a 32 GB GPU's usable capacity when activations and buffers are included.

Core Insight

Memory redundancy in DP is the core problem, not DP's communication pattern. By partitioning model states (optimizer states, gradients, parameters) across data-parallel workers instead of replicating them, ZeRO achieves memory efficiency comparable to MP while retaining DP's favorable compute/communication characteristics. Crucially, not all model states are needed simultaneously — a dynamic communication schedule can reconstruct them on demand at minimal added communication cost.

Method: ZeRO-DP (Three Stages)

For a model with Ψ parameters, Adam optimizer (K=12 memory multiplier), and N_d data-parallel workers:

Stage What is partitioned Memory per device Extra communication vs. baseline DP
P_os Optimizer states only 4Ψ + (K·Ψ)/N_d None (same as DP)
P_os+g Optimizer states + gradients 2Ψ + (2+K)·Ψ/N_d None (same as DP)
P_os+g+p All model states (2+2+K)·Ψ/N_d 1.5× baseline DP

At N_d=64: baseline DP uses 120 GB per device for a 7.5B model; P_os+g+p reduces this to 1.9 GB.

P_g is implemented as a Reduce-Scatter (not AllReduce) of gradients during backward; each process only reduces the gradients for its own parameter partition. P_p requires an AllGather of parameters during both forward and backward passes (pipelined to overlap with compute).

ZeRO-R (Residual Memory)

Three components address activation, buffer, and fragmentation memory:

Key Results (ZeRO-100B implementation: P_os+g + ZeRO-R)

Limitations

Relevance to DynamICCL

ZeRO fundamentally changes the collective communication pattern of data-parallel training, which DynamICCL must handle: