A Quantitative Survey of Communication Optimizations in Distributed Deep Learning — Detailed Summary
Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Chengjian Liu, Wei Wang, Bo Li | HKUST / HKBU / Shenzhen Tech U. / HKUST | IEEE Network, Vol. 35, Issue 3, May/June 2021 | DOI: 10.1109/MNET.011.2000530
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.
Abstract
- Modern DL models are increasingly trained in distributed settings to overcome the compute and memory limits of a single accelerator.
- Distributed training requires extensive cross-worker communication; the resulting communication overhead is now the dominant scaling bottleneck.
- The article surveys techniques for reducing communication cost in data-parallel training and conducts a head-to-head quantitative comparison of seven common lossless distributed-DL methods on a 32-GPU cluster connected by 100 Gb/s InfiniBand.
- Three families of optimizations are identified: communication-efficient learning algorithms (lossy), system-architecture and scheduling techniques (lossless), and infrastructure-level techniques (lossless).
- Empirical bake-off identifies which method-architecture pairings actually compose and which workloads scale to 32 GPUs.
1. Introduction
Background and motivation:
- DL compute requirements have doubled every 3.4 months since 2012, outpacing Moore's Law.
- Single-GPU training is insufficient for modern models such as BERT-Large and GPT-2; distributed training across many GPUs is now standard practice.
- The dominant pattern is data parallelism: each worker holds a replica of the model, computes gradients on a local mini-batch, and synchronizes gradients globally each step.
Communication wall:
- GPU compute throughput has scaled rapidly (V100, A100), but inter-GPU bandwidth has lagged; the result is a steep "communication wall."
- Amdahl's Law: non-parallel fraction (here, communication time) caps total speedup, regardless of how many GPUs are added.
Why a quantitative survey:
- Existing surveys are qualitative — they classify techniques but cannot guide a practitioner choosing between, say, BytePS and Horovod for a specific workload.
- This article fills that gap: it fixes the cluster (32 GPU, 100 Gb IB), fixes the framework (PyTorch 1.4 + NCCL 2.4.8), and empirically measures seven lossless methods on three real workloads.
2. Three-Level Taxonomy
The authors propose a unified taxonomy of communication optimizations arranged top-down from algorithm to wire:
+---------------------------------------------------------------+
| Level 1: Learning Algorithm (LOSSY) |
| - Synchronization: BSP | SSP | Local SGD | ASP |
| - Compression: Quantization | Sparsification |
| (e.g., 1-bit SGD, TernGrad, DGC) |
+---------------------------------------------------------------+
| Level 2: System Architecture / Scheduling (LOSSLESS) |
| - Architecture: Parameter Server (PS) | All-to-All (A2A) |
| - Scheduling: WFBP | Tensor Fusion (MG-WFBP) |
| Tensor Partition (ByteScheduler) |
+---------------------------------------------------------------+
| Level 3: Communication Infrastructure (LOSSLESS) |
| - Protocol: TCP/IP | RoCE | InfiniBand RDMA |
| - Topology: Tree | Fat-Tree | BCube | Torus |
+---------------------------------------------------------------+
2.1 Level 1 — Learning Algorithm
Synchronization variants:
- BSP (Bulk Synchronous Parallel): all workers must finish each step before the next; exact gradient match with single-GPU SGD; convergence is preserved but stragglers stall the cluster.
- SSP (Stale Synchronous Parallel): allows bounded staleness s; faster workers may proceed by up to s steps; convergence proven under bounded-staleness assumptions but slower per epoch.
- Local SGD: each worker performs h local steps before a single synchronization; communication is amortized 1/h-fold; convergence still holds for h up to model-dependent bounds.
- ASP (Asynchronous Parallel): no barrier; workers push-pull arbitrarily; highest throughput but worst convergence stability.
Compression variants:
- Quantization: reduce bits per gradient (e.g., 32 -> 1, 32 -> 8). Citing 1-bit SGD and TernGrad: up to 32x volume reduction, near-baseline accuracy on speech / image classification.
- Sparsification: select top-k or random-k gradient entries (DGC). Compression ratios up to 600x with <1% accuracy loss reported on ResNet-50 / ImageNet in cited works.
- Low-rank: send factored gradients UV^T instead of W. Less common in practice; trade-off between rank and accuracy is workload-dependent.
2.2 Level 2 — System Architecture and Scheduling
Architecture: Parameter Server (PS) vs. All-to-All (A2A):
- PS: a logically centralized server (or shard set) holds the master model copy; workers push gradients and pull updated weights. Pros: easy to implement, supports heterogeneous workers, allows fine-grained partitioned updates. Cons: PS bandwidth is the single bottleneck unless multi-PS load balancing (BytePS) is used.
- A2A: every worker exchanges gradients directly via collective operations (Ring/Tree AllReduce). Pros: no central bottleneck, NCCL highly optimized. Cons: latency-sensitive; small tensors waste bandwidth.
Scheduling techniques:
- WFBP (Wait-Free Backprop, Poseidon): each layer's gradient is sent as soon as it is produced during backward pass, overlapping comm with the remaining backward compute. Foundational technique in both PS and A2A.
- Tensor Fusion (MG-WFBP): small adjacent tensors are concatenated to amortize startup latency. Solves the small-message penalty critical for A2A.
- Tensor Partition (ByteScheduler): large tensors are split into sub- chunks so push and pull halves can overlap; combined with priority scheduling, gives near-linear scaling on PS for ResNet-50.
2.3 Level 3 — Communication Infrastructure
Protocols:
- TCP/IP over Ethernet: ubiquitous, but software stack overhead dominates at high speeds.
- RoCE (RDMA over Converged Ethernet): kernel bypass over Ethernet; needs PFC for losslessness.
- InfiniBand (RDMA): lowest latency, highest bandwidth in HPC; the survey's testbed uses 100 Gb/s IB.
Topologies:
- Tree: simplest, suffers from root bottleneck.
- Fat-Tree: standard datacenter topology; equal-cost multipath increases bisection bandwidth.
- BCube: server-centric topology used in some custom HPC settings.
- Torus: used by Google TPU pods; 2D / 3D torus reduces switch radix.
3. Network Microbenchmarks (Sec. III in the paper)
The survey first characterizes raw network performance to establish the hardware envelope.
Achieved one-way throughput at 1 MB message:
| Network | Protocol | Throughput |
|---|---|---|
| 10 GbE | TCP/IP | 8.2 Gb/s |
| 100 GbE | TCP/IP | 16.5 Gb/s |
| 100 Gb IB | RDMA | 83.2 Gb/s |
- 100 GbE TCP/IP achieves only ~16.5% of line rate due to kernel-stack overhead; RDMA reaches 83% of line rate.
Achieved one-way throughput at 16 KB message (latency-dominated regime):
| Network | Throughput |
|---|---|
| 10 GbE | 1.2 Gb/s |
| 100 GbE | 4.6 Gb/s |
| 100 Gb IB | 16.7 Gb/s (~16.7% of line rate) |
- Even on IB-RDMA, small-message efficiency collapses to ~16.7% of line rate — this is the wall that motivates Tensor Fusion.
Latency at small message sizes on IB:
| Message size | One-way latency |
|---|---|
| 16 KiB | 7.85 us |
| 32 KiB | 10.1 us |
- Sending 2x 16 KiB messages costs 15.7 us; sending one 32 KiB message costs 10.1 us. The 35% saving from fusion is the empirical motivation for MG-WFBP.
4. Quantitative Bake-off (Sec. IV in the paper)
4.1 Methodology
- Cluster: 8 nodes, each with 4x Nvidia RTX 2080Ti (11 GB), 2x Xeon Gold 6230, 512 GB RAM, PCIe 3.0 x16.
- Network: 100 Gb/s InfiniBand, RDMA.
- Software stack: PyTorch 1.4, CUDA 10.1, cuDNN 7.6, NCCL 2.4.8.
- PS framework: BytePS 0.2.0 (multi-PS with load balancing).
- A2A framework: Horovod 0.19.2 (NCCL backend).
- Methods evaluated:
| # | Method | Architecture |
|---|---|---|
| 1 | BSP-PS | PS, naive |
| 2 | BSP-A2A | A2A, naive |
| 3 | WFBP-PS | PS + Wait-Free Backprop |
| 4 | WFBP-A2A | A2A + Wait-Free Backprop |
| 5 | MG-WFBP | A2A + WFBP + Tensor Fusion |
| 6 | ByteScheduler-PS | PS + WFBP + Partition + Priority |
| 7 | ByteScheduler-A2A | A2A + WFBP + Partition + Priority |
- Workloads:
| Model | Local Batch Size | Model Intensity I = C/D |
|---|---|---|
| ResNet-50 | 64 | 470 |
| BERT-Base | 64 | 249 |
| BERT-Large | 8 (memory-limited) | 248 |
- Metric: throughput in samples/sec; speedup over single-GPU SGD.
4.2 ResNet-50
- Best method at 32 GPUs: ByteScheduler-PS with 31.6x speedup (near-linear).
- ResNet-50's high intensity (I = 470) means each backward step generates enough compute to cover communication of its many small filter tensors; partition + priority + WFBP + multi-PS together hide nearly all comm.
- A2A variants (Horovod + MG-WFBP) lag PS by a few percent because the PS architecture exploits cross-tensor priority scheduling more effectively than ring AllReduce.
4.3 BERT-Base
- Best method at 32 GPUs: MG-WFBP with 23.2x speedup.
- BERT-Base has many small attention/FFN tensors (intensity I = 249); fusion is dominant — small-message latency is the binding constraint.
- ByteScheduler's partition strategy actually hurts here because most tensors are already small; partitioning amplifies startup overhead.
4.4 BERT-Large
- BERT-Large (340 M params) fails to scale on RTX 2080Ti.
- Maximum measured speedup: only 1.2x at 4 GPUs, then degrades.
- Memory wall: 11 GB GPU memory forces LBS = 8, leaving little compute to hide AllReduce.
- Reference comparison on V100 32 GB + NVLink at LBS = 128: speedup of 3.82x at 4 GPUs, demonstrating that the bottleneck is intra-node bandwidth (PCIe 3.0 vs. NVLink) plus the small-batch problem.
- Compute / Communication breakdown for BERT-Large on RTX 2080Ti:
- Compute = 163 ms per iteration
- AllReduce = 441 ms per iteration
- C2C ratio ~2.7 — hopelessly comm-bound.
4.5 Cross-Cutting Findings
| Finding | Quantitative evidence |
|---|---|
| Small messages waste bandwidth | 16 KB on IB = 16.7% of line rate |
| Tensor fusion is essential for A2A | 2x16 KiB = 15.7 us vs. 1x32 KiB = 10.1 us (35% saving) |
| Tensor partition is essential for PS | ByteScheduler-PS reaches 31.6x; WFBP-PS lags by ~36% on ResNet-50 |
| Higher I scales better | ResNet-50 (I=470) reaches 31.6x; BERT-Base (I=249) reaches 23.2x |
| Memory + intra-node bw matter as much as inter-node | BERT-Large fails on PCIe 2080Ti, succeeds on NVLink V100 |
5. Major Focal Point Papers Cited
| System / Paper | Core technique | Headline result |
|---|---|---|
| BytePS | Multi-PS load balancing | Lifts PS to compete with A2A on small-tensor models |
| MG-WFBP | Optimal tensor fusion under WFBP | 23.2x on BERT-Base / 32 GPUs |
| ByteScheduler | Priority + partition + WFBP | 31.6x on ResNet-50 / 32 GPUs (near-linear) |
| Horovod | A2A wrapper using NCCL Ring/Tree AllReduce | Standard A2A baseline |
| Poseidon | Wait-Free Backprop, original | Foundation for all overlap-based work |
| 1-bit SGD / TernGrad / DGC | Quantization / sparsification | 32x to 600x compression at near-baseline accuracy |
| NCCL (referenced) | Double-binary-tree AllReduce | Logarithmic latency, bandwidth-optimal at large sizes |
6. Open Problems Explicitly Identified
The authors list four open questions in the survey's Discussion section:
- Convergence-aware extreme compression. Current sparsification reaches 600x at <1% accuracy loss; the question is whether 1000x or higher is reachable with formal convergence guarantees, especially when combined with momentum-aware optimizers.
- Automated PS-vs-A2A selection. No principled rule exists to choose architecture from (model, hardware) — a learned or analytical model would eliminate the need for hand-tuning.
- Generic schedulers for lossy DAGs. When compression itself adds compute (encode/decode kernels), the comp+comm DAG becomes irregular; existing WFBP-based schedulers do not handle this.
- Adaptive fuse-vs-partition. A run-time scheduler that decides per-tensor whether to merge or split would unify MG-WFBP and ByteScheduler — neither system does this dynamically.
7. Limitations of the Survey
- Data-parallelism only. ZeRO, Megatron-LM, Pipeline parallelism (PipeDream, GPipe), and hybrid 3D parallelism are out of the bake-off, although mentioned in passing.
- One hardware generation (RTX 2080Ti + PCIe 3.0); no V100/A100/H100 + NVLink/NVSwitch measurements (only a single reference data point for V100). H100 + NVSwitch + 400 Gb/s would invert several conclusions.
- Lossy methods are surveyed but not measured side-by-side with lossless methods — convergence cost vs. iteration savings is left implicit.
- Ethernet (TCP, RoCE) measurements are limited to the network-microbenchmark table; the bake-off is IB-only.
- NCCL 2.4.8 is the wire library; modern NCCL (2.18+) and MSCCL/MSCCL++ collectives are not evaluated.
8. Discussion of NCCL
- NCCL is the underlying transport for all A2A measurements (Horovod, ByteScheduler-A2A, MG-WFBP).
- The survey explicitly cites NCCL's double binary trees AllReduce as providing logarithmic-depth latency at bandwidth-optimal cost in dense GPU clusters.
- The 16 KB / 7.85 us small-message floor on IB is specifically attributed to NCCL + IB-verbs setup overhead; the survey treats this as a fixed property and motivates higher-layer fusion to hide it.
- Tensor Fusion (MG-WFBP) is presented as the right answer to NCCL's small-tensor inefficiency at the time of writing (2020-2021).
9. Cross-Cutting Empirical Take-Aways
| Take-away | Derived from |
|---|---|
| 100 Gb IB-RDMA achieves ~83% line rate at 1 MB; ~17% at 16 KB | Sec. III network microbenchmarks |
| Small-message penalty is universal — fuse aggressively | 16 KiB latency 7.85 us, 32 KiB 10.1 us |
| PS+partition wins on tensor-rich, high-I models | ResNet-50 results |
| A2A+fusion wins on tensor-rich, mid-I models | BERT-Base results |
| Memory + intra-node bandwidth gate large-model scaling | BERT-Large failure on 2080Ti vs. V100 |
| NCCL Ring/Tree algorithms inherit a small-message wall that no scheduler can fully hide | Sec. III + Sec. IV combined |
10. Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective parameters (algorithm, protocol, nChannels, numThreads) to minimize collective completion time on HPC GPU clusters. This paper provides DynamICCL with empirically-grounded design priors — both for the action space and the reward function.
Direct mappings to DynamICCL design:
| Survey finding | DynamICCL design implication |
|---|---|
| 16 KB IB latency 7.85 us (16.7% of line rate); doubling msg saves 35% | Strong prior: at <64 KiB, prefer LL/LL128 + Tree; numChannels=1-2 is enough — don't waste channels on a latency-bound message. |
| 1 MB IB throughput 83 Gb/s (~83% line rate) | At >=1 MB, switch to Simple protocol + Ring + max numChannels. Use this as exploration anchor. |
| Tensor partitioning (chunkSize) is decisive on PS | DynamICCL's chunkSize/numPipeOps action axis is real and load-dependent — must be in the action space, not held constant. |
| Tensor fusion (concat) is decisive on A2A | DynamICCL should observe message-size-after-fusion (post-MG-WFBP) as state, since tuner runs after the scheduling layer. |
| Model intensity I = C/D predicts scaling | Add I as a state feature; policy generalizes across workloads with same I band. |
| Iteration time, not bandwidth, is the right metric | Reward = -collective_completion_time (or normalized iter time); avoid algBw/busBw proxy. |
| Hardware decisive: PCIe vs NVLink, IB vs Ethernet | Topology must be an explicit state feature; train per-topology heads or feed topology as embedding. |
Specific design priors for the RL agent:
Exploration prior on (algorithm, protocol):
- msg < 16 KiB: Tree + LL (latency-dominated; fewer hops, less overhead)
- 16 KiB - 1 MiB: Tree or Ring + LL128 (transition zone — let RL explore)
- msg >= 1 MiB: Ring + Simple (bandwidth-dominated; full pipeline)
Exploration prior on (nChannels, numThreads):
- At small message sizes (<64 KiB), nChannels=1, numThreads=128: extra channels add only setup overhead.
- At large message sizes (>=1 MiB), nChannels=4-8, numThreads=512: parallelism amortizes startup.
Reward shaping:
- Primary: r = -collective_wall_clock_us
- Optional secondary: penalize p99 latency to catch tail outliers (consistent with the survey's emphasis on iteration time, not throughput).
State features (per the survey's predictive variables):
- Message size (log-binned)
- Model intensity I (workload identity feature)
- Topology fingerprint: NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet
- Recent-collective timing window (LSTM-encoded, per Saraswati's notes)
Action-space duality (fuse-vs-partition):
- The survey's insight that fusion and partition are duals of the same parameter family supports DynamICCL's view of chunkSize and numPipeOps as a single "granularity" action axis. The RL agent should learn to pick coarse chunks for small messages (effectively fusion) and fine chunks for large messages (effectively partition).
Research positioning:
- This survey holds NCCL constant and varies the schedule above NCCL. DynamICCL inverts this: holds the schedule constant and varies NCCL itself via the tuner-plugin API. The survey's findings establish the baseline that DynamICCL should beat, and identify the workload regimes (BERT-Large on PCIe + 11 GB GPU) where the existing schedule layer gives up — exactly where tuner-level optimization has the most room.
Open-problem alignment:
- The survey's open problem #2 ("automated PS-vs-A2A selection") and open problem #4 ("adaptive fuse-vs-partition") are direct precursors of DynamICCL: replace hand-tuned heuristics with a learned policy that observes runtime state. DynamICCL's RL formulation is the natural answer to both, scoped to the NCCL configuration sub-problem.