A Quantitative Survey of Communication Optimizations in Distributed Deep Learning — Detailed Summary

Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Chengjian Liu, Wei Wang, Bo Li | HKUST / HKBU / Shenzhen Tech U. / HKUST | IEEE Network, Vol. 35, Issue 3, May/June 2021 | DOI: 10.1109/MNET.011.2000530

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.

Abstract

Modern DL models are increasingly trained in distributed settings to overcome the compute and memory limits of a single accelerator.
Distributed training requires extensive cross-worker communication; the resulting communication overhead is now the dominant scaling bottleneck.
The article surveys techniques for reducing communication cost in data-parallel training and conducts a head-to-head quantitative comparison of seven common lossless distributed-DL methods on a 32-GPU cluster connected by 100 Gb/s InfiniBand.
Three families of optimizations are identified: communication-efficient learning algorithms (lossy), system-architecture and scheduling techniques (lossless), and infrastructure-level techniques (lossless).
Empirical bake-off identifies which method-architecture pairings actually compose and which workloads scale to 32 GPUs.

1. Introduction

Background and motivation:

DL compute requirements have doubled every 3.4 months since 2012, outpacing Moore's Law.
Single-GPU training is insufficient for modern models such as BERT-Large and GPT-2; distributed training across many GPUs is now standard practice.
The dominant pattern is data parallelism: each worker holds a replica of the model, computes gradients on a local mini-batch, and synchronizes gradients globally each step.

Communication wall:

GPU compute throughput has scaled rapidly (V100, A100), but inter-GPU bandwidth has lagged; the result is a steep "communication wall."
Amdahl's Law: non-parallel fraction (here, communication time) caps total speedup, regardless of how many GPUs are added.

Why a quantitative survey:

Existing surveys are qualitative — they classify techniques but cannot guide a practitioner choosing between, say, BytePS and Horovod for a specific workload.
This article fills that gap: it fixes the cluster (32 GPU, 100 Gb IB), fixes the framework (PyTorch 1.4 + NCCL 2.4.8), and empirically measures seven lossless methods on three real workloads.

2. Three-Level Taxonomy

The authors propose a unified taxonomy of communication optimizations arranged top-down from algorithm to wire:

+---------------------------------------------------------------+
| Level 1: Learning Algorithm  (LOSSY)                          |
|   - Synchronization:  BSP | SSP | Local SGD | ASP             |
|   - Compression:      Quantization | Sparsification           |
|                       (e.g., 1-bit SGD, TernGrad, DGC)        |
+---------------------------------------------------------------+
| Level 2: System Architecture / Scheduling  (LOSSLESS)         |
|   - Architecture:  Parameter Server (PS) | All-to-All (A2A)   |
|   - Scheduling:    WFBP | Tensor Fusion (MG-WFBP)             |
|                    Tensor Partition (ByteScheduler)           |
+---------------------------------------------------------------+
| Level 3: Communication Infrastructure  (LOSSLESS)             |
|   - Protocol:  TCP/IP | RoCE | InfiniBand RDMA                |
|   - Topology:  Tree | Fat-Tree | BCube | Torus                |
+---------------------------------------------------------------+

2.1 Level 1 — Learning Algorithm

Synchronization variants:

BSP (Bulk Synchronous Parallel): all workers must finish each step before the next; exact gradient match with single-GPU SGD; convergence is preserved but stragglers stall the cluster.
SSP (Stale Synchronous Parallel): allows bounded staleness s; faster workers may proceed by up to s steps; convergence proven under bounded-staleness assumptions but slower per epoch.
Local SGD: each worker performs h local steps before a single synchronization; communication is amortized 1/h-fold; convergence still holds for h up to model-dependent bounds.
ASP (Asynchronous Parallel): no barrier; workers push-pull arbitrarily; highest throughput but worst convergence stability.

Compression variants:

Quantization: reduce bits per gradient (e.g., 32 -> 1, 32 -> 8). Citing 1-bit SGD and TernGrad: up to 32x volume reduction, near-baseline accuracy on speech / image classification.
Sparsification: select top-k or random-k gradient entries (DGC). Compression ratios up to 600x with <1% accuracy loss reported on ResNet-50 / ImageNet in cited works.
Low-rank: send factored gradients UV^T instead of W. Less common in practice; trade-off between rank and accuracy is workload-dependent.

2.2 Level 2 — System Architecture and Scheduling

Architecture: Parameter Server (PS) vs. All-to-All (A2A):

PS: a logically centralized server (or shard set) holds the master model copy; workers push gradients and pull updated weights. Pros: easy to implement, supports heterogeneous workers, allows fine-grained partitioned updates. Cons: PS bandwidth is the single bottleneck unless multi-PS load balancing (BytePS) is used.
A2A: every worker exchanges gradients directly via collective operations (Ring/Tree AllReduce). Pros: no central bottleneck, NCCL highly optimized. Cons: latency-sensitive; small tensors waste bandwidth.

Scheduling techniques:

WFBP (Wait-Free Backprop, Poseidon): each layer's gradient is sent as soon as it is produced during backward pass, overlapping comm with the remaining backward compute. Foundational technique in both PS and A2A.
Tensor Fusion (MG-WFBP): small adjacent tensors are concatenated to amortize startup latency. Solves the small-message penalty critical for A2A.
Tensor Partition (ByteScheduler): large tensors are split into sub- chunks so push and pull halves can overlap; combined with priority scheduling, gives near-linear scaling on PS for ResNet-50.

2.3 Level 3 — Communication Infrastructure

Protocols:

TCP/IP over Ethernet: ubiquitous, but software stack overhead dominates at high speeds.
RoCE (RDMA over Converged Ethernet): kernel bypass over Ethernet; needs PFC for losslessness.
InfiniBand (RDMA): lowest latency, highest bandwidth in HPC; the survey's testbed uses 100 Gb/s IB.

Topologies:

Tree: simplest, suffers from root bottleneck.
Fat-Tree: standard datacenter topology; equal-cost multipath increases bisection bandwidth.
BCube: server-centric topology used in some custom HPC settings.
Torus: used by Google TPU pods; 2D / 3D torus reduces switch radix.

3. Network Microbenchmarks (Sec. III in the paper)

The survey first characterizes raw network performance to establish the hardware envelope.

Achieved one-way throughput at 1 MB message:

Network	Protocol	Throughput
10 GbE	TCP/IP	8.2 Gb/s
100 GbE	TCP/IP	16.5 Gb/s
100 Gb IB	RDMA	83.2 Gb/s

100 GbE TCP/IP achieves only ~16.5% of line rate due to kernel-stack overhead; RDMA reaches 83% of line rate.

Achieved one-way throughput at 16 KB message (latency-dominated regime):

Network	Throughput
10 GbE	1.2 Gb/s
100 GbE	4.6 Gb/s
100 Gb IB	16.7 Gb/s (~16.7% of line rate)

Even on IB-RDMA, small-message efficiency collapses to ~16.7% of line rate — this is the wall that motivates Tensor Fusion.

Latency at small message sizes on IB:

Message size	One-way latency
16 KiB	7.85 us
32 KiB	10.1 us

Sending 2x 16 KiB messages costs 15.7 us; sending one 32 KiB message costs 10.1 us. The 35% saving from fusion is the empirical motivation for MG-WFBP.

4. Quantitative Bake-off (Sec. IV in the paper)

4.1 Methodology

Cluster: 8 nodes, each with 4x Nvidia RTX 2080Ti (11 GB), 2x Xeon Gold 6230, 512 GB RAM, PCIe 3.0 x16.
Network: 100 Gb/s InfiniBand, RDMA.
Software stack: PyTorch 1.4, CUDA 10.1, cuDNN 7.6, NCCL 2.4.8.
PS framework: BytePS 0.2.0 (multi-PS with load balancing).
A2A framework: Horovod 0.19.2 (NCCL backend).
Methods evaluated:

#	Method	Architecture
1	BSP-PS	PS, naive
2	BSP-A2A	A2A, naive
3	WFBP-PS	PS + Wait-Free Backprop
4	WFBP-A2A	A2A + Wait-Free Backprop
5	MG-WFBP	A2A + WFBP + Tensor Fusion
6	ByteScheduler-PS	PS + WFBP + Partition + Priority
7	ByteScheduler-A2A	A2A + WFBP + Partition + Priority

Workloads:

Model	Local Batch Size	Model Intensity I = C/D
ResNet-50	64	470
BERT-Base	64	249
BERT-Large	8 (memory-limited)	248

Metric: throughput in samples/sec; speedup over single-GPU SGD.

4.2 ResNet-50

Best method at 32 GPUs: ByteScheduler-PS with 31.6x speedup (near-linear).
ResNet-50's high intensity (I = 470) means each backward step generates enough compute to cover communication of its many small filter tensors; partition + priority + WFBP + multi-PS together hide nearly all comm.
A2A variants (Horovod + MG-WFBP) lag PS by a few percent because the PS architecture exploits cross-tensor priority scheduling more effectively than ring AllReduce.

4.3 BERT-Base

Best method at 32 GPUs: MG-WFBP with 23.2x speedup.
BERT-Base has many small attention/FFN tensors (intensity I = 249); fusion is dominant — small-message latency is the binding constraint.
ByteScheduler's partition strategy actually hurts here because most tensors are already small; partitioning amplifies startup overhead.

4.4 BERT-Large

BERT-Large (340 M params) fails to scale on RTX 2080Ti.
Maximum measured speedup: only 1.2x at 4 GPUs, then degrades.
Memory wall: 11 GB GPU memory forces LBS = 8, leaving little compute to hide AllReduce.
Reference comparison on V100 32 GB + NVLink at LBS = 128: speedup of 3.82x at 4 GPUs, demonstrating that the bottleneck is intra-node bandwidth (PCIe 3.0 vs. NVLink) plus the small-batch problem.
Compute / Communication breakdown for BERT-Large on RTX 2080Ti:
- Compute = 163 ms per iteration
- AllReduce = 441 ms per iteration
- C2C ratio ~2.7 — hopelessly comm-bound.

4.5 Cross-Cutting Findings

Finding	Quantitative evidence
Small messages waste bandwidth	16 KB on IB = 16.7% of line rate
Tensor fusion is essential for A2A	2x16 KiB = 15.7 us vs. 1x32 KiB = 10.1 us (35% saving)
Tensor partition is essential for PS	ByteScheduler-PS reaches 31.6x; WFBP-PS lags by ~36% on ResNet-50
Higher I scales better	ResNet-50 (I=470) reaches 31.6x; BERT-Base (I=249) reaches 23.2x
Memory + intra-node bw matter as much as inter-node	BERT-Large fails on PCIe 2080Ti, succeeds on NVLink V100

5. Major Focal Point Papers Cited

System / Paper	Core technique	Headline result
BytePS	Multi-PS load balancing	Lifts PS to compete with A2A on small-tensor models
MG-WFBP	Optimal tensor fusion under WFBP	23.2x on BERT-Base / 32 GPUs
ByteScheduler	Priority + partition + WFBP	31.6x on ResNet-50 / 32 GPUs (near-linear)
Horovod	A2A wrapper using NCCL Ring/Tree AllReduce	Standard A2A baseline
Poseidon	Wait-Free Backprop, original	Foundation for all overlap-based work
1-bit SGD / TernGrad / DGC	Quantization / sparsification	32x to 600x compression at near-baseline accuracy
NCCL (referenced)	Double-binary-tree AllReduce	Logarithmic latency, bandwidth-optimal at large sizes

6. Open Problems Explicitly Identified

The authors list four open questions in the survey's Discussion section:

Convergence-aware extreme compression. Current sparsification reaches 600x at <1% accuracy loss; the question is whether 1000x or higher is reachable with formal convergence guarantees, especially when combined with momentum-aware optimizers.
Automated PS-vs-A2A selection. No principled rule exists to choose architecture from (model, hardware) — a learned or analytical model would eliminate the need for hand-tuning.
Generic schedulers for lossy DAGs. When compression itself adds compute (encode/decode kernels), the comp+comm DAG becomes irregular; existing WFBP-based schedulers do not handle this.
Adaptive fuse-vs-partition. A run-time scheduler that decides per-tensor whether to merge or split would unify MG-WFBP and ByteScheduler — neither system does this dynamically.

7. Limitations of the Survey

Data-parallelism only. ZeRO, Megatron-LM, Pipeline parallelism (PipeDream, GPipe), and hybrid 3D parallelism are out of the bake-off, although mentioned in passing.
One hardware generation (RTX 2080Ti + PCIe 3.0); no V100/A100/H100 + NVLink/NVSwitch measurements (only a single reference data point for V100). H100 + NVSwitch + 400 Gb/s would invert several conclusions.
Lossy methods are surveyed but not measured side-by-side with lossless methods — convergence cost vs. iteration savings is left implicit.
Ethernet (TCP, RoCE) measurements are limited to the network-microbenchmark table; the bake-off is IB-only.
NCCL 2.4.8 is the wire library; modern NCCL (2.18+) and MSCCL/MSCCL++ collectives are not evaluated.

8. Discussion of NCCL

NCCL is the underlying transport for all A2A measurements (Horovod, ByteScheduler-A2A, MG-WFBP).
The survey explicitly cites NCCL's double binary trees AllReduce as providing logarithmic-depth latency at bandwidth-optimal cost in dense GPU clusters.
The 16 KB / 7.85 us small-message floor on IB is specifically attributed to NCCL + IB-verbs setup overhead; the survey treats this as a fixed property and motivates higher-layer fusion to hide it.
Tensor Fusion (MG-WFBP) is presented as the right answer to NCCL's small-tensor inefficiency at the time of writing (2020-2021).

9. Cross-Cutting Empirical Take-Aways

Take-away	Derived from
100 Gb IB-RDMA achieves ~83% line rate at 1 MB; ~17% at 16 KB	Sec. III network microbenchmarks
Small-message penalty is universal — fuse aggressively	16 KiB latency 7.85 us, 32 KiB 10.1 us
PS+partition wins on tensor-rich, high-I models	ResNet-50 results
A2A+fusion wins on tensor-rich, mid-I models	BERT-Base results
Memory + intra-node bandwidth gate large-model scaling	BERT-Large failure on 2080Ti vs. V100
NCCL Ring/Tree algorithms inherit a small-message wall that no scheduler can fully hide	Sec. III + Sec. IV combined

10. Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective parameters (algorithm, protocol, nChannels, numThreads) to minimize collective completion time on HPC GPU clusters. This paper provides DynamICCL with empirically-grounded design priors — both for the action space and the reward function.

Direct mappings to DynamICCL design:

Survey finding	DynamICCL design implication
16 KB IB latency 7.85 us (16.7% of line rate); doubling msg saves 35%	Strong prior: at <64 KiB, prefer LL/LL128 + Tree; numChannels=1-2 is enough — don't waste channels on a latency-bound message.
1 MB IB throughput 83 Gb/s (~83% line rate)	At >=1 MB, switch to Simple protocol + Ring + max numChannels. Use this as exploration anchor.
Tensor partitioning (chunkSize) is decisive on PS	DynamICCL's chunkSize/numPipeOps action axis is real and load-dependent — must be in the action space, not held constant.
Tensor fusion (concat) is decisive on A2A	DynamICCL should observe message-size-after-fusion (post-MG-WFBP) as state, since tuner runs after the scheduling layer.
Model intensity I = C/D predicts scaling	Add I as a state feature; policy generalizes across workloads with same I band.
Iteration time, not bandwidth, is the right metric	Reward = -collective_completion_time (or normalized iter time); avoid algBw/busBw proxy.
Hardware decisive: PCIe vs NVLink, IB vs Ethernet	Topology must be an explicit state feature; train per-topology heads or feed topology as embedding.

Specific design priors for the RL agent:

Exploration prior on (algorithm, protocol):
- msg < 16 KiB: Tree + LL (latency-dominated; fewer hops, less overhead)
- 16 KiB - 1 MiB: Tree or Ring + LL128 (transition zone — let RL explore)
- msg >= 1 MiB: Ring + Simple (bandwidth-dominated; full pipeline)
Exploration prior on (nChannels, numThreads):
- At small message sizes (<64 KiB), nChannels=1, numThreads=128: extra channels add only setup overhead.
- At large message sizes (>=1 MiB), nChannels=4-8, numThreads=512: parallelism amortizes startup.
Reward shaping:
- Primary: r = -collective_wall_clock_us
- Optional secondary: penalize p99 latency to catch tail outliers (consistent with the survey's emphasis on iteration time, not throughput).
State features (per the survey's predictive variables):
- Message size (log-binned)
- Model intensity I (workload identity feature)
- Topology fingerprint: NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet
- Recent-collective timing window (LSTM-encoded, per Saraswati's notes)
Action-space duality (fuse-vs-partition):
- The survey's insight that fusion and partition are duals of the same parameter family supports DynamICCL's view of chunkSize and numPipeOps as a single "granularity" action axis. The RL agent should learn to pick coarse chunks for small messages (effectively fusion) and fine chunks for large messages (effectively partition).
Research positioning:
- This survey holds NCCL constant and varies the schedule above NCCL. DynamICCL inverts this: holds the schedule constant and varies NCCL itself via the tuner-plugin API. The survey's findings establish the baseline that DynamICCL should beat, and identify the workload regimes (BERT-Large on PCIe + 11 GB GPU) where the existing schedule layer gives up — exactly where tuner-level optimization has the most room.
Open-problem alignment:
- The survey's open problem #2 ("automated PS-vs-A2A selection") and open problem #4 ("adaptive fuse-vs-partition") are direct precursors of DynamICCL: replace hand-tuned heuristics with a learned policy that observes runtime state. DynamICCL's RL formulation is the natural answer to both, scoped to the NCCL configuration sub-problem.