A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Chengjian Liu, Wei Wang, Bo Li | HKUST / HKBU / Shenzhen Tech U. / HKUST | IEEE Network, Vol. 35, Issue 3, May/June 2021 | DOI: 10.1109/MNET.011.2000530

Problem

Deep learning model size and compute requirements have doubled every 3.4 months since 2012, far outpacing Moore's Law. To meet this demand, training is distributed across many GPUs, but inter-GPU and inter-node communication has become the dominant scaling bottleneck. Per Amdahl's Law, the non-parallelizable communication fraction caps achievable speedup. Prior surveys cataloged communication optimizations qualitatively but never measured them on a single hardware testbed with the same models and the same protocol — so practitioners had no way to know which optimizations actually compose, which help small vs. large messages, and which architecture (parameter server vs. all-to-all) wins for a given workload.

Core Insight

A unified three-level taxonomy of communication optimizations (learning-algorithm / system-architecture / infrastructure) plus a head-to-head quantitative bake-off of seven lossless data-parallel methods on a fixed 32-GPU 100Gb/s InfiniBand cluster reveals concrete compose-rules: tensor fusion dominates for all-to-all on small-tensor models, tensor partitioning dominates for parameter-server on large-tensor models, and model intensity I = C/D predicts which family wins.

Method

The survey organizes ~50 prior optimizations into three orthogonal levels:

Level 1: Learning Algorithm  (LOSSY)
  - Synchronization: BSP, SSP, Local SGD, ASP
  - Compression: Quantization, Sparsification, Low-rank

Level 2: System Architecture / Scheduling  (LOSSLESS)
  - Architecture: Parameter Server (PS) vs. All-to-All (A2A)
  - Scheduling:   Wait-Free Backprop (WFBP), Tensor Fusion, Tensor Partition

Level 3: Communication Infrastructure  (LOSSLESS)
  - Protocol: TCP/IP, RoCE, IB-RDMA
  - Topology: Tree, Fat-Tree, BCube, Torus

The empirical study fixes Level 1 and Level 3 (BSP synchronization, IB-RDMA), varies Level 2, and benchmarks seven combinations: {BSP, WFBP, MG-WFBP, ByteScheduler} x {PS via BytePS, A2A via Horovod}.

Experimental Setup

Component	Value
Nodes	8
GPUs per node	4 x RTX 2080Ti (11 GB)
Total GPUs	32
CPU/RAM per node	2x Xeon Gold 6230, 512 GB
Intra-node	PCIe 3.0 x16
Inter-node	100 Gb/s InfiniBand (RDMA)
Framework	PyTorch 1.4 + CUDA 10.1 + cuDNN 7.6 + NCCL 2.4.8
PS stack	BytePS 0.2.0
A2A stack	Horovod 0.19.2
Workloads	ResNet-50 (LBS=64, I=470), BERT-Base (LBS=64, I=249), BERT-Large (LBS=8, I=248)
Metric	Samples/sec; speedup over single-GPU SGD

Microbenchmarks include osu/iperf-style network probes at message sizes from 16 KB to 1 MB to characterize the latency/bandwidth wall.

Headline Quantitative Results

Network-layer baseline (raw bandwidth at 1 MB messages):

10 GbE (TCP/IP): 8.2 Gb/s
100 GbE (TCP/IP): 16.5 Gb/s
100 Gb IB (RDMA): 83.2 Gb/s

Latency penalty for small messages on IB:

16 KiB: 7.85 us (one round)
32 KiB: 10.1 us (one round; sending 2x16 KiB separately costs 15.7 us total)
This 30%+ overhead per fragmented small tensor motivates Tensor Fusion.

End-to-end speedup at 32 GPUs:

Workload	Best method	Speedup vs. 1 GPU
ResNet-50	ByteScheduler-PS	31.6x (near-linear)
BERT-Base	MG-WFBP (A2A)	23.2x
BERT-Large (RTX 2080Ti)	failed; max 1.2x at 4 GPUs	bottlenecked by 11 GB memory + PCIe
BERT-Large (V100 32 GB + NVLink reference)	3.82x at 4 GPUs	shows hardware sensitivity

Computation/Communication ratio for BERT-Large on RTX 2080Ti:

Compute = 163 ms per iter
AllReduce = 441 ms per iter (PCIe is the wall, not IB)

Composition rules empirically established:

Tensor Fusion (MG-WFBP) is essential for A2A — solves small-tensor latency.
Tensor Partitioning (ByteScheduler) wins for PS — overlaps push/pull halves.
Higher Model Intensity I = C/D scales better; ResNet-50 (I=470) scales nearly twice as well as BERT (I~248) at 32 GPUs.

Limitations

Data-parallelism only; model/pipeline/hybrid parallelism (ZeRO, Megatron-LM) are excluded from the bake-off.
Hardware-specific: RTX 2080Ti + PCIe 3.0; results would differ on A100/H100 + NVLink/NVSwitch.
Lossy compression methods (Quantization, Sparsification, Local SGD) are surveyed but not measured in the bake-off — convergence-vs-throughput trade-offs are deferred.
Compute and communication are still treated as separable phases; modern fine-grained tile-level overlap (DistFuse, MSCCL++) is out of scope.

Open Problems Called Out

Accuracy-aware compression at 1000x ratios with provable convergence.
Automated PS-vs-A2A architecture selection driven by hardware topology and model intensity.
Generic schedulers that handle the irregular DAGs created when lossy compression itself adds compute ops.
Adaptive run-time scheduling that decides fuse-vs-partition per tensor.

Relevance to DynamICCL

This survey provides DynamICCL with empirically-grounded design priors that should directly shape both the action space and the reward function:

Latency wall at small messages (16 KiB ~= 7.85 us; doubling messages costs 30%+): This quantifies the regime where Tree algorithms and LL/LL128 protocols matter most. DynamICCL's exploration prior should bias toward LL/LL128 for messages below ~64 KiB and toward Simple/Ring for messages above ~1 MB — matching the observed bandwidth crossover.
Compute-vs-comm ratio is the right reward signal: BERT-Large on PCIe 3.0 spends 441 ms in AllReduce per 163 ms of compute (2.7x). The survey shows iteration time and not raw AllReduce bandwidth is the true end-to-end metric. DynamICCL's reward should be normalized iteration time (or per-collective wall-clock) rather than a peak-bandwidth proxy.
Model Intensity I = C/D as a state feature: The survey establishes that I predicts scaling. DynamICCL's RL state should include a per-model intensity feature — this generalizes the policy across workloads with different compute-to-comm ratios.
Fuse-vs-partition is a real action dimension: ByteScheduler's tensor-partition wins by ~36% over WFBP-PS on ResNet-50, while MG-WFBP's fusion wins on BERT-Base. DynamICCL's action space should consider chunkSize / numPipeOps in coordination with algorithm choice — fusion and partition are duals of the same NCCL parameter family.
Hardware is decisive — generalization needs topology features: The same code goes from 1.2x speedup (PCIe 2080Ti) to 3.82x (NVLink V100) at 4 GPUs. DynamICCL's policy must observe interconnect identity (NVLink vs. PCIe vs. IB vs. Ethernet) as an explicit feature, not assume a fixed topology — confirms the need for topology-conditioned policy heads.
NCCL 2.4.8's small-tensor overhead is a measurement artifact DynamICCL directly inherits: the survey treats this as a fixed wall, but tuner- plugin-driven algorithm/protocol selection is precisely the lever that the survey did not pull. DynamICCL's contribution is to dynamize what this survey holds constant.