A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Chengjian Liu, Wei Wang, Bo Li | HKUST / HKBU / Shenzhen Tech U. / HKUST | IEEE Network, Vol. 35, Issue 3, May/June 2021 | DOI: 10.1109/MNET.011.2000530


Problem

Deep learning model size and compute requirements have doubled every 3.4 months since 2012, far outpacing Moore's Law. To meet this demand, training is distributed across many GPUs, but inter-GPU and inter-node communication has become the dominant scaling bottleneck. Per Amdahl's Law, the non-parallelizable communication fraction caps achievable speedup. Prior surveys cataloged communication optimizations qualitatively but never measured them on a single hardware testbed with the same models and the same protocol — so practitioners had no way to know which optimizations actually compose, which help small vs. large messages, and which architecture (parameter server vs. all-to-all) wins for a given workload.


Core Insight

A unified three-level taxonomy of communication optimizations (learning-algorithm / system-architecture / infrastructure) plus a head-to-head quantitative bake-off of seven lossless data-parallel methods on a fixed 32-GPU 100Gb/s InfiniBand cluster reveals concrete compose-rules: tensor fusion dominates for all-to-all on small-tensor models, tensor partitioning dominates for parameter-server on large-tensor models, and model intensity I = C/D predicts which family wins.


Method

The survey organizes ~50 prior optimizations into three orthogonal levels:

Level 1: Learning Algorithm  (LOSSY)
  - Synchronization: BSP, SSP, Local SGD, ASP
  - Compression: Quantization, Sparsification, Low-rank

Level 2: System Architecture / Scheduling  (LOSSLESS)
  - Architecture: Parameter Server (PS) vs. All-to-All (A2A)
  - Scheduling:   Wait-Free Backprop (WFBP), Tensor Fusion, Tensor Partition

Level 3: Communication Infrastructure  (LOSSLESS)
  - Protocol: TCP/IP, RoCE, IB-RDMA
  - Topology: Tree, Fat-Tree, BCube, Torus

The empirical study fixes Level 1 and Level 3 (BSP synchronization, IB-RDMA), varies Level 2, and benchmarks seven combinations: {BSP, WFBP, MG-WFBP, ByteScheduler} x {PS via BytePS, A2A via Horovod}.


Experimental Setup

Component Value
Nodes 8
GPUs per node 4 x RTX 2080Ti (11 GB)
Total GPUs 32
CPU/RAM per node 2x Xeon Gold 6230, 512 GB
Intra-node PCIe 3.0 x16
Inter-node 100 Gb/s InfiniBand (RDMA)
Framework PyTorch 1.4 + CUDA 10.1 + cuDNN 7.6 + NCCL 2.4.8
PS stack BytePS 0.2.0
A2A stack Horovod 0.19.2
Workloads ResNet-50 (LBS=64, I=470), BERT-Base (LBS=64, I=249), BERT-Large (LBS=8, I=248)
Metric Samples/sec; speedup over single-GPU SGD

Microbenchmarks include osu/iperf-style network probes at message sizes from 16 KB to 1 MB to characterize the latency/bandwidth wall.


Headline Quantitative Results

Network-layer baseline (raw bandwidth at 1 MB messages):

Latency penalty for small messages on IB:

End-to-end speedup at 32 GPUs:

Workload Best method Speedup vs. 1 GPU
ResNet-50 ByteScheduler-PS 31.6x (near-linear)
BERT-Base MG-WFBP (A2A) 23.2x
BERT-Large (RTX 2080Ti) failed; max 1.2x at 4 GPUs bottlenecked by 11 GB memory + PCIe
BERT-Large (V100 32 GB + NVLink reference) 3.82x at 4 GPUs shows hardware sensitivity

Computation/Communication ratio for BERT-Large on RTX 2080Ti:

Composition rules empirically established:


Limitations


Open Problems Called Out

  1. Accuracy-aware compression at 1000x ratios with provable convergence.
  2. Automated PS-vs-A2A architecture selection driven by hardware topology and model intensity.
  3. Generic schedulers that handle the irregular DAGs created when lossy compression itself adds compute ops.
  4. Adaptive run-time scheduling that decides fuse-vs-partition per tensor.

Relevance to DynamICCL

This survey provides DynamICCL with empirically-grounded design priors that should directly shape both the action space and the reward function:

  1. Latency wall at small messages (16 KiB ~= 7.85 us; doubling messages costs 30%+): This quantifies the regime where Tree algorithms and LL/LL128 protocols matter most. DynamICCL's exploration prior should bias toward LL/LL128 for messages below ~64 KiB and toward Simple/Ring for messages above ~1 MB — matching the observed bandwidth crossover.

  2. Compute-vs-comm ratio is the right reward signal: BERT-Large on PCIe 3.0 spends 441 ms in AllReduce per 163 ms of compute (2.7x). The survey shows iteration time and not raw AllReduce bandwidth is the true end-to-end metric. DynamICCL's reward should be normalized iteration time (or per-collective wall-clock) rather than a peak-bandwidth proxy.

  3. Model Intensity I = C/D as a state feature: The survey establishes that I predicts scaling. DynamICCL's RL state should include a per-model intensity feature — this generalizes the policy across workloads with different compute-to-comm ratios.

  4. Fuse-vs-partition is a real action dimension: ByteScheduler's tensor-partition wins by ~36% over WFBP-PS on ResNet-50, while MG-WFBP's fusion wins on BERT-Base. DynamICCL's action space should consider chunkSize / numPipeOps in coordination with algorithm choice — fusion and partition are duals of the same NCCL parameter family.

  5. Hardware is decisive — generalization needs topology features: The same code goes from 1.2x speedup (PCIe 2080Ti) to 3.82x (NVLink V100) at 4 GPUs. DynamICCL's policy must observe interconnect identity (NVLink vs. PCIe vs. IB vs. Ethernet) as an explicit feature, not assume a fixed topology — confirms the need for topology-conditioned policy heads.

  6. NCCL 2.4.8's small-tensor overhead is a measurement artifact DynamICCL directly inherits: the survey treats this as a fixed wall, but tuner- plugin-driven algorithm/protocol selection is precisely the lever that the survey did not pull. DynamICCL's contribution is to dynamize what this survey holds constant.