Brief Summary: ZeRO
Full title: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Authors: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He (Microsoft) Year: 2020 (arXiv:1910.02054v3) Venue: arXiv / SC 2020
Problem
Training very large deep learning models (billions to trillions of parameters) is blocked by GPU memory capacity. Data parallelism (DP) replicates all model states on every device — wasteful. Model parallelism (MP) reduces memory per device but degrades efficiency severely when crossing node boundaries due to high inter-node communication. A 1.5B GPT-2 model requires ~24 GB to train with Adam in mixed precision, exceeding a 32 GB GPU's usable capacity when activations and buffers are included.
Core Insight
Memory redundancy in DP is the core problem, not DP's communication pattern. By partitioning model states (optimizer states, gradients, parameters) across data-parallel workers instead of replicating them, ZeRO achieves memory efficiency comparable to MP while retaining DP's favorable compute/communication characteristics. Crucially, not all model states are needed simultaneously — a dynamic communication schedule can reconstruct them on demand at minimal added communication cost.
Method: ZeRO-DP (Three Stages)
For a model with Ψ parameters, Adam optimizer (K=12 memory multiplier), and N_d data-parallel workers:
| Stage | What is partitioned | Memory per device | Extra communication vs. baseline DP |
|---|---|---|---|
| P_os | Optimizer states only | 4Ψ + (K·Ψ)/N_d | None (same as DP) |
| P_os+g | Optimizer states + gradients | 2Ψ + (2+K)·Ψ/N_d | None (same as DP) |
| P_os+g+p | All model states | (2+2+K)·Ψ/N_d | 1.5× baseline DP |
At N_d=64: baseline DP uses 120 GB per device for a 7.5B model; P_os+g+p reduces this to 1.9 GB.
P_g is implemented as a Reduce-Scatter (not AllReduce) of gradients during backward; each process only reduces the gradients for its own parameter partition. P_p requires an AllGather of parameters during both forward and backward passes (pipelined to overlap with compute).
ZeRO-R (Residual Memory)
Three components address activation, buffer, and fragmentation memory:
- P_a: Partition activation checkpoints across MP workers; use AllGather to reconstruct on demand. Reduces activation footprint by MP degree (up to 16×). Optionally offload to CPU (P_a+cpu).
- C_B: Fixed-size constant buffers for gradient all-reduce/norm computation, preventing O(model size) temporary tensor allocation.
- M_D: On-the-fly memory defragmentation by pre-allocating contiguous buffers for activation checkpoints and gradients.
Key Results (ZeRO-100B implementation: P_os+g + ZeRO-R)
- Trains models up to 170B parameters on 400 V100 GPUs (8× larger than Megatron's 20B practical limit).
- Sustained throughput: 15 PetaFLOPs aggregate, 38 TFLOPs/GPU on 100B models (>30% peak).
- Super-linear scaling from 64→400 GPUs for a 60B model: performance more than doubles when doubling GPU count, because larger DP degree enables larger batch size per GPU, improving arithmetic intensity.
- Powered Turing-NLG (17B), the world's largest language model at the time, with WikiText-103 perplexity 10.21.
- Without MP: trains up to 13B parameters, democratizing access for researchers without custom parallelism code.
Limitations
- P_os+g+p (full parameter partitioning) increases communication by 1.5× — matters on bandwidth-constrained inter-node links.
- CPU offloading (P_a+cpu) adds 2× activation data movement overhead and is only beneficial when DP communication is the bottleneck.
- Compute capacity remains the binding constraint for trillion-parameter models; a 1T model would take ~140 days on 1K V100s.
- ZeRO-100B implements only P_os+g (not P_p) due to practicality; full P_os+g+p available in DeepSpeed.
- Memory fragmentation fixes require pre-allocation, which can reduce available memory for other uses.
Relevance to DynamICCL
ZeRO fundamentally changes the collective communication pattern of data-parallel training, which DynamICCL must handle:
- Replaces AllReduce with ReduceScatter + AllGather: P_os+g uses a ReduceScatter for gradients during backward pass and an AllGather for parameters before forward pass. DynamICCL's Config Agent must support both AllReduce and ReduceScatter/AllGather as distinct collective types with different optimal NCCL configurations.
- Communication volume stays ~2Ψ per step for P_os+g — same as standard DP's AllReduce. NCCL tuning insights from standard DP transfer directly.
- Burstiness changes: In standard DP, one large AllReduce fires after the full backward pass. In ZeRO, ReduceScatter fires bucket-by-bucket during backward (overlapping compute), and AllGather fires at the start of each forward pass — creating a more distributed, lower-peak-bandwidth traffic pattern that changes congestion dynamics.
- ZeRO + MP combined: Using ZeRO-DP with Megatron-LM style MP creates two distinct communication planes (intra-node model-parallel all-reduces + inter-node ZeRO scatter/gather), exactly the multi-collective environment DynamICCL must navigate.