A Quantitative Survey of Communication Optimizations in Distributed Deep Learning — Detailed Summary

Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Chengjian Liu, Wei Wang, Bo Li | HKUST / HKBU / Shenzhen Tech U. / HKUST | IEEE Network, Vol. 35, Issue 3, May/June 2021 | DOI: 10.1109/MNET.011.2000530

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.


Abstract


1. Introduction

Background and motivation:

Communication wall:

Why a quantitative survey:


2. Three-Level Taxonomy

The authors propose a unified taxonomy of communication optimizations arranged top-down from algorithm to wire:

+---------------------------------------------------------------+
| Level 1: Learning Algorithm  (LOSSY)                          |
|   - Synchronization:  BSP | SSP | Local SGD | ASP             |
|   - Compression:      Quantization | Sparsification           |
|                       (e.g., 1-bit SGD, TernGrad, DGC)        |
+---------------------------------------------------------------+
| Level 2: System Architecture / Scheduling  (LOSSLESS)         |
|   - Architecture:  Parameter Server (PS) | All-to-All (A2A)   |
|   - Scheduling:    WFBP | Tensor Fusion (MG-WFBP)             |
|                    Tensor Partition (ByteScheduler)           |
+---------------------------------------------------------------+
| Level 3: Communication Infrastructure  (LOSSLESS)             |
|   - Protocol:  TCP/IP | RoCE | InfiniBand RDMA                |
|   - Topology:  Tree | Fat-Tree | BCube | Torus                |
+---------------------------------------------------------------+

2.1 Level 1 — Learning Algorithm

Synchronization variants:

Compression variants:

2.2 Level 2 — System Architecture and Scheduling

Architecture: Parameter Server (PS) vs. All-to-All (A2A):

Scheduling techniques:

2.3 Level 3 — Communication Infrastructure

Protocols:

Topologies:


3. Network Microbenchmarks (Sec. III in the paper)

The survey first characterizes raw network performance to establish the hardware envelope.

Achieved one-way throughput at 1 MB message:

Network Protocol Throughput
10 GbE TCP/IP 8.2 Gb/s
100 GbE TCP/IP 16.5 Gb/s
100 Gb IB RDMA 83.2 Gb/s

Achieved one-way throughput at 16 KB message (latency-dominated regime):

Network Throughput
10 GbE 1.2 Gb/s
100 GbE 4.6 Gb/s
100 Gb IB 16.7 Gb/s (~16.7% of line rate)

Latency at small message sizes on IB:

Message size One-way latency
16 KiB 7.85 us
32 KiB 10.1 us

4. Quantitative Bake-off (Sec. IV in the paper)

4.1 Methodology

# Method Architecture
1 BSP-PS PS, naive
2 BSP-A2A A2A, naive
3 WFBP-PS PS + Wait-Free Backprop
4 WFBP-A2A A2A + Wait-Free Backprop
5 MG-WFBP A2A + WFBP + Tensor Fusion
6 ByteScheduler-PS PS + WFBP + Partition + Priority
7 ByteScheduler-A2A A2A + WFBP + Partition + Priority
Model Local Batch Size Model Intensity I = C/D
ResNet-50 64 470
BERT-Base 64 249
BERT-Large 8 (memory-limited) 248

4.2 ResNet-50

4.3 BERT-Base

4.4 BERT-Large

4.5 Cross-Cutting Findings

Finding Quantitative evidence
Small messages waste bandwidth 16 KB on IB = 16.7% of line rate
Tensor fusion is essential for A2A 2x16 KiB = 15.7 us vs. 1x32 KiB = 10.1 us (35% saving)
Tensor partition is essential for PS ByteScheduler-PS reaches 31.6x; WFBP-PS lags by ~36% on ResNet-50
Higher I scales better ResNet-50 (I=470) reaches 31.6x; BERT-Base (I=249) reaches 23.2x
Memory + intra-node bw matter as much as inter-node BERT-Large fails on PCIe 2080Ti, succeeds on NVLink V100

5. Major Focal Point Papers Cited

System / Paper Core technique Headline result
BytePS Multi-PS load balancing Lifts PS to compete with A2A on small-tensor models
MG-WFBP Optimal tensor fusion under WFBP 23.2x on BERT-Base / 32 GPUs
ByteScheduler Priority + partition + WFBP 31.6x on ResNet-50 / 32 GPUs (near-linear)
Horovod A2A wrapper using NCCL Ring/Tree AllReduce Standard A2A baseline
Poseidon Wait-Free Backprop, original Foundation for all overlap-based work
1-bit SGD / TernGrad / DGC Quantization / sparsification 32x to 600x compression at near-baseline accuracy
NCCL (referenced) Double-binary-tree AllReduce Logarithmic latency, bandwidth-optimal at large sizes

6. Open Problems Explicitly Identified

The authors list four open questions in the survey's Discussion section:

  1. Convergence-aware extreme compression. Current sparsification reaches 600x at <1% accuracy loss; the question is whether 1000x or higher is reachable with formal convergence guarantees, especially when combined with momentum-aware optimizers.
  2. Automated PS-vs-A2A selection. No principled rule exists to choose architecture from (model, hardware) — a learned or analytical model would eliminate the need for hand-tuning.
  3. Generic schedulers for lossy DAGs. When compression itself adds compute (encode/decode kernels), the comp+comm DAG becomes irregular; existing WFBP-based schedulers do not handle this.
  4. Adaptive fuse-vs-partition. A run-time scheduler that decides per-tensor whether to merge or split would unify MG-WFBP and ByteScheduler — neither system does this dynamically.

7. Limitations of the Survey


8. Discussion of NCCL


9. Cross-Cutting Empirical Take-Aways

Take-away Derived from
100 Gb IB-RDMA achieves ~83% line rate at 1 MB; ~17% at 16 KB Sec. III network microbenchmarks
Small-message penalty is universal — fuse aggressively 16 KiB latency 7.85 us, 32 KiB 10.1 us
PS+partition wins on tensor-rich, high-I models ResNet-50 results
A2A+fusion wins on tensor-rich, mid-I models BERT-Base results
Memory + intra-node bandwidth gate large-model scaling BERT-Large failure on 2080Ti vs. V100
NCCL Ring/Tree algorithms inherit a small-message wall that no scheduler can fully hide Sec. III + Sec. IV combined

10. Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective parameters (algorithm, protocol, nChannels, numThreads) to minimize collective completion time on HPC GPU clusters. This paper provides DynamICCL with empirically-grounded design priors — both for the action space and the reward function.

Direct mappings to DynamICCL design:

Survey finding DynamICCL design implication
16 KB IB latency 7.85 us (16.7% of line rate); doubling msg saves 35% Strong prior: at <64 KiB, prefer LL/LL128 + Tree; numChannels=1-2 is enough — don't waste channels on a latency-bound message.
1 MB IB throughput 83 Gb/s (~83% line rate) At >=1 MB, switch to Simple protocol + Ring + max numChannels. Use this as exploration anchor.
Tensor partitioning (chunkSize) is decisive on PS DynamICCL's chunkSize/numPipeOps action axis is real and load-dependent — must be in the action space, not held constant.
Tensor fusion (concat) is decisive on A2A DynamICCL should observe message-size-after-fusion (post-MG-WFBP) as state, since tuner runs after the scheduling layer.
Model intensity I = C/D predicts scaling Add I as a state feature; policy generalizes across workloads with same I band.
Iteration time, not bandwidth, is the right metric Reward = -collective_completion_time (or normalized iter time); avoid algBw/busBw proxy.
Hardware decisive: PCIe vs NVLink, IB vs Ethernet Topology must be an explicit state feature; train per-topology heads or feed topology as embedding.

Specific design priors for the RL agent:

  1. Exploration prior on (algorithm, protocol):

    • msg < 16 KiB: Tree + LL (latency-dominated; fewer hops, less overhead)
    • 16 KiB - 1 MiB: Tree or Ring + LL128 (transition zone — let RL explore)
    • msg >= 1 MiB: Ring + Simple (bandwidth-dominated; full pipeline)
  2. Exploration prior on (nChannels, numThreads):

    • At small message sizes (<64 KiB), nChannels=1, numThreads=128: extra channels add only setup overhead.
    • At large message sizes (>=1 MiB), nChannels=4-8, numThreads=512: parallelism amortizes startup.
  3. Reward shaping:

    • Primary: r = -collective_wall_clock_us
    • Optional secondary: penalize p99 latency to catch tail outliers (consistent with the survey's emphasis on iteration time, not throughput).
  4. State features (per the survey's predictive variables):

    • Message size (log-binned)
    • Model intensity I (workload identity feature)
    • Topology fingerprint: NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet
    • Recent-collective timing window (LSTM-encoded, per Saraswati's notes)
  5. Action-space duality (fuse-vs-partition):

    • The survey's insight that fusion and partition are duals of the same parameter family supports DynamICCL's view of chunkSize and numPipeOps as a single "granularity" action axis. The RL agent should learn to pick coarse chunks for small messages (effectively fusion) and fine chunks for large messages (effectively partition).
  6. Research positioning:

    • This survey holds NCCL constant and varies the schedule above NCCL. DynamICCL inverts this: holds the schedule constant and varies NCCL itself via the tuner-plugin API. The survey's findings establish the baseline that DynamICCL should beat, and identify the workload regimes (BERT-Large on PCIe + 11 GB GPU) where the existing schedule layer gives up — exactly where tuner-level optimization has the most room.
  7. Open-problem alignment:

    • The survey's open problem #2 ("automated PS-vs-A2A selection") and open problem #4 ("adaptive fuse-vs-partition") are direct precursors of DynamICCL: replace hand-tuned heuristics with a learned policy that observes runtime state. DynamICCL's RL formulation is the natural answer to both, scoped to the NCCL configuration sub-problem.