Communication-Efficient Data Parallel Distributed Deep Learning: A Comprehensive Survey — Detailed Summary

Zhenheng Tang (HKBU), Shaohuai Shi (HIT-Shenzhen), Wei Wang, Bo Li (HKUST), Xiaowen Chu (HKUST-GZ) | arXiv:2003.06307v2 | Sep 2023 | 35 pages | ACM CSUR-style submission

This summary is structured around the survey's central artifact: the four-dimension taxonomy of communication-efficient distributed DL. Within each branch I label each concept as either:

The final section maps the entire taxonomy to DynamICCL.


0. Paper-Level Summary

0.1 Abstract (verbatim distillation)

Distributed deep learning is bottlenecked by communication. The survey proposes a four-dimension taxonomy — communication synchronization, system architectures, compression techniques, and parallelism of communication and computing — investigates state-of-the-art works in each, compares convergence rates analytically and empirically, and offers extrapolated future directions.

0.2 What is new vs. prior surveys (Section 1.1)

This survey claims novelty in:

  1. Demystifying communication compression in depth (largely missing in priors).
  2. Comparing convergence bounds across all four taxonomy dimensions in a single table (Table 9).
  3. Running uniform empirical benchmarks (FedML + MPI framework) over many algorithms with 4–32 workers.

0.3 Benchmark framework and configuration (Section 1.3)


1. Background: The Optimization Problem (Section 2)

The base optimization is:

min_x  E_xi~D [ F(x; xi) ]

solved by mini-batch SGD:

G_t(x_t) = grad F_t(x_t; xi_t)
x_{t+1}  = x_t - gamma * G_t(x_t)

BSP-SGD distributes the computation:

G_{i,t}(x_t) = grad F_{i,t}(x_t; xi_{i,t})
x_{t+1}      = x_t - (gamma / n) * sum_{i=1..n} G_{i,t}(x_t)

Every algorithm in the survey is a deviation from this BSP-SGD baseline along one or more taxonomy dimensions.


2. Dimension 1 — Communication Synchronization (Section 3)

2.1 Taxonomy and timeline

                     | -- Computation --  | -- Communication -- | Update |
BSP   :  All workers wait at a barrier; identical models everywhere.
SSP   :  Bounded staleness s; faster workers may run ahead by up to s steps.
ASP   :  No barrier; PS updates whenever any worker arrives.
Local :  Each worker runs tau local steps, then averages.

(Reproducing Fig. 3 of the paper.)

2.2 Knobs vs. design choices in this dimension

Concept Type Rationale
Choice of BSP / SSP / ASP / Local [DESIGN] Set in the training script (e.g., DDP vs. async PS) — fixed before NCCL is invoked.
Staleness bound s in SSP [KNOB] Could be tuned at runtime (Chen et al. R2SP, backup workers).
Local steps tau in Local-SGD [KNOB] Survey shows tau in {2,4,8,16} — final accuracy roughly invariant; communication frequency varies linearly. Tunable.
Backup-worker count (Chen et al. [31]) [KNOB] Number of stragglers to drop.
FedAvg client-sampling fraction [KNOB] Federated-only setting; not relevant to HPC training.

2.3 Mathematical formulations

SSP update (Eq. 6):

x_{i,t+1} = x_0 - gamma * [ pre-window updates ] - gamma * [ in-window updates ] - gamma * [ read-my-writes ]

ASP update (Eq. 7):

x_{t+1} = x_t - gamma * sum_i G_{i, t-tau_{i,k}}(x_{i, t-tau_{k,i}})

Local-SGD (Eq. 8):

x_{i,t+1} = x_{i,t} - gamma * G_{i,t}(x_{i,t})                if t+1 not in I_T
            x_{i,t} - gamma * (1/n) * sum_i G_{i,t}(x_{i,t})  if t+1 in I_T

2.4 Empirical observations (Tables 2, 3, 4, 5)

Table 5 (relative-level summary):

Architecture Sync Model Consistency Comm. Frequency Comm. Congestion Convergence
PS BSP high high high stable
PS SSP normal high normal normal
PS ASP low high low unstable
PS Local normal low high unstable
All-Reduce BSP high high low easy
All-Reduce Local normal low low stable
Gossip BSP low high low stable
Gossip ASP low high low unstable
Gossip Local low low low stable

2.5 Open problems flagged


3. Dimension 2 — System Architecture (Section 4)

3.1 Three architectures

(a) Parameter Server          (b) All-Reduce             (c) Gossip
   +---------+                   +---+ -- +---+            +---+ ~ +---+
   | Servers |                   | W |    | W |            | W |   | W |
   +----+----+                   +-+-+    +-+-+            +-+-+   +-+-+
        |                          |        |                |       |
   +-+--+--+-+                     +--------+               (peer-to-peer
   |W|  |W| |W|                  (collective)                graph; symmetric
   +-+  +-+ +-+                                              doubly stochastic
                                                             matrix W)

3.2 Parameter Server (Section 4.1)

Concept Type
PS vs. All-Reduce vs. Gossip [DESIGN]
Number of parameter servers [KNOB]
Worker-relevance threshold (Wang) [KNOB]
In-network aggregation (switches) [DESIGN] (hardware feature)

3.3 All-Reduce (Section 4.2)

The survey's Table 6 — communication cost of representative All-Reduce algorithms for an N-dim vector across n nodes (alpha = latency, beta = inverse bandwidth):

Algorithm Latency Bandwidth
Binary tree 2 alpha log n 2 beta N log n
Recursive doubling alpha log n beta N log n
Ring 2(n-1) alpha 2(n-1)/n * beta * N

Algorithms covered:

Concept Type
Choice of Ring vs. Tree vs. Recursive [KNOB]
Choice of hierarchical vs. flat [KNOB]
Number of hierarchies / hierarchy mapping [KNOB]
2D-Torus vs. BCube vs. Fat-Tree topology [DESIGN] (physical)
Tree branching factor (binary vs. m-ary) [KNOB]

This is the cell where DynamICCL operates. The survey explicitly notes the trade-off:

"for some small messages or small-scale clusters, recursive doubling or ring-based algorithms would be better" which mirrors the algorithm-vs-message-size trade-off DynamICCL learns.

3.4 Gossip (Section 4.3)

Concept Type
Gossip vs. centralized arch [DESIGN]
Mixing matrix W [DESIGN]
Number of peers per round [KNOB]
Random vs. deterministic peer pick [KNOB]

3.5 Empirical comparison (Table 7)

BSP-SGD (PS) vs DP-SGD (Gossip) on ResNet-20:


4. Dimension 3 — Compression (Sections 5 and 6)

4.1 Quantization (Section 5)

Original 32-bit gradient  -> Quant() -> low-bit gradient -> Unquant() -> approximate gradient

Update rule (Eqs. 9–11):

G_quant_{i,t} = Quant( G_{i,t} + delta_{i,t} )
delta_{i,t}    = G_{i,t} - Unquant( G_quant_{i,t} )    (error feedback)
x_{t+1}        = x_t - gamma * (1/n) * sum_i G_quant_{i,t}

Quantization methods covered:

Method Bits Idea
1-bit SGD [Sei et al. 2014] 1 Sign + threshold; speech apps; 10x speedup
QSGD [Alistarh 2016] family Stochastic quantization unbiased estimator with bit-budget knob
TernGrad [Wen 2017] 2 bits Ternary {-1,0,+1} with scalar-sharing & layer-wise scalars
SignSGD / signSGD-MV [Bernstein] 1 Sign + majority vote; convergence proofs
DIANA [Mishchenko 2019] varies Block-wise quantization
SRQ + VLC [Suresh 2017] varies Random rotation + Huffman coding
Adaptive quant [Faghri / Jhunjhunwala] adaptive Adjust bits during training
Concept Type
Quantization vs. no compression [DESIGN] (chosen by user / framework)
Number of bits b [KNOB] (per-tensor or global)
Layer-wise vs. global scaling [KNOB]
Use of error feedback [KNOB]
Quantizer family (uniform, dither, ternary) [DESIGN]

Theoretical maximum compression is 32x (single-precision FP).

4.2 Sparsification (Section 6)

Goal: send only k of N coordinates. Compression up to 1000x reported.

Sub-taxonomy (Section 6 introduction):

Sparsification
   |
   +-- 6.1 Random sparsification (Random-k, Random Mask, Subsampling)
   +-- 6.2 Deterministic sparsification
   |       +-- 6.2.1 Fixed Threshold (Strom)
   |       +-- 6.2.2 Adaptive Threshold (Top-k, AdaComp, gTop-k, SBC, STC)
   +-- 6.3 Coordinate Descent (BCD, IBCD)
   +-- 6.4 Proximal methods (L0/L1 regularization)

Representative methods:

Method Idea
Random-k Pick k random indices each iteration
Random Mask Pre-defined random sparsity pattern, regenerated per iteration
Top-k [Aji&Heafield, Lin DGC] Pick k largest
gTop-k [Shi] Top-k applied a second time after global aggregation
AdaComp [Chen] Self-adapting compression rate per layer
SBC [Sattler] Sparse Binary Compression: sparsify + sign average + quantize
STC [Sattler] Sparse Ternary Compression — same idea for federated learning
Truncated grad [Langford] Threshold-based sparsity (online learning origin)
Concept Type
Sparsification vs. no compression [DESIGN]
k value or sparsity ratio [KNOB] (per-tensor or global)
Fixed-threshold value [KNOB]
Random-k vs. Top-k vs. Threshold [DESIGN]
Use of error feedback (EF-Top-k) [KNOB] (essential at high compression)
Layer-wise vs. global threshold [KNOB]

The survey notes (open problem 3): adaptive per-layer / per-peer compression ratios are an open research direction.

4.3 Empirical comparison (Table 8)

ResNet-20 / 32 workers / gamma=0.1:

Compression scheme Compression Ratio Final Accuracy
BSP, no compression 1 89.21%
BSP + quant (16-bit) 2 89.34%
BSP + quant (2-bit) 16 85.37%
BSP + Top-k 10 86.75%
BSP + Top-k 100 77.66%
BSP + Top-k 1000 61.98%
BSP + EF-Top-k 10 88.65%
BSP + EF-Top-k 100 88.08%
BSP + EF-Top-k 1000 87.76%
DPSGD (gossip, no comp) 1 88.99%
DCD-PSGD (gossip + comp 4) 4 85.78%
CHOCO-SGD (gossip + comp 100) 100 89.00%

Key empirical insight: error feedback rescues sparsification at extreme compression; gossip + compression (CHOCO-SGD) at 100x matches BSP at 1x.


5. Dimension 4 — Computation/Communication Parallelism (Section 7)

5.1 Pipelining

Backward layers:   |bwd_3|bwd_2|bwd_1|
Communication:           |c_3 |c_2 |c_1|   <- WFBP overlaps c_l with bwd_{l-1}

5.2 Scheduling

Concept Type
WFBP vs. blocking comm [DESIGN] (framework feature)
Merge-gradient (tensor fusion) threshold [KNOB]
Tensor partition size [KNOB]
Priority order of layers [KNOB]
Concurrent-collective scheduling [DESIGN]

5.3 Open issues


6. Convergence Analysis (Section 8)

The survey collects standard assumptions:

  1. L-Lipschitz gradient.
  2. Unbiased stochastic gradient.
  3. Bounded variance: E ||grad F - grad f||^2 <= sigma^2.
  4. Bounded second moment: E ||grad F||^2 <= M^2.
  5. (Optional) mu-strong convexity.

For gossip:

For compression:

Selected convergence rates (Table 9 of survey):

Architecture Sync Compression Convex Non-convex
PS / AR BSP None O(1/T) O(1/sqrt T)
PS / AR BSP Quant O(1/T) O(1/sqrt T)
PS / AR BSP Spars. O(1/T) O(1/sqrt T)
PS SSP None -- O(1/sqrt T)
PS ASP None O(1/T) O(1/sqrt T)
PS / AR LocalSGD None O(1/T) O(1/sqrt T)
Gossip BSP None -- O(1/sqrt T)
Gossip BSP Quant -- O(1/sqrt T)
Gossip ASP None -- O(1/sqrt T)

[INFO] All four taxonomy combinations recover the same asymptotic rate O(1/sqrt T) for non-convex problems — the difference is in the constant factors (variance, compression contraction d, staleness s, peer count, etc.) that govern the practically-important wall-clock time.


7. Auxiliary Technologies (Section 9)

These are orthogonal correctness/convergence helpers that plug into any compression scheme. All are [KNOB] or [DESIGN] choices that augment the base compression.

7.1 Error Accumulation (9.1)

Step recipe:

C_{i,t} = Sparse(v_{i,t-1} + grad_{i,t})
v_{i,t} = (v_{i,t-1} + grad_{i,t}) - C_{i,t}    [residual carried forward]
x_{t+1} = x_t - gamma * (1/n) * sum_i C_{i,t}

Used in 1-bit SGD, EF-SignSGD, ECQ-SGD, EF-Top-k, etc.

[KNOB]: turn on/off error feedback; controls whether high-compression schemes will converge.

7.2 Momentum Correction (9.2)

DGC-style momentum applied to the residual error vector:

u_{i,t} = m * u_{i,t-1} + grad_{i,t}
v_{i,t} = v_{i,t-1} + u_{i,t}
x_{t+1} = x_t - gamma * sum_i sparse(v_{i,t})

7.3 Low-rank Decomposition (9.3)

7.4 Local Gradient Clipping (9.4)

7.5 Warm-up Training (9.5)


8. Conclusion and Future Directions (Section 10)

The survey explicitly enumerates four open problems:

  1. Foundation model training: do current comm-efficient methods scale to GPT-3 / GShard / RecSys-class models?
  2. Higher compression level: above 1000x without accuracy loss?
  3. Adaptive compression: per-layer, per-tensor, or per-peer compression ratios chosen automatically.
  4. Fault-tolerant algorithms: handling stragglers, network congestion, and worker failures in heterogeneous deployments.

(3) and (4) are directly aligned with DynamICCL's research thesis: an RL agent that adapts collective configuration to heterogeneous, congested, runtime-variable conditions.


9. Knob-vs-Design Master Table (across all four dimensions)

This is the practical reference for DynamICCL design.

+---------------------+----------+------+--------------------------------+
| Decision            | Type     | Who  | Notes                          |
+---------------------+----------+------+--------------------------------+
| BSP/SSP/ASP/Local   | [DESIGN] | User | Set in training script (DDP=BSP)|
| s (staleness bound) | [KNOB]   | RL?  | Only in SSP regimes            |
| tau (local steps)   | [KNOB]   | RL?  | Only in Local-SGD regimes      |
| Architecture        | [DESIGN] | User | PS/AR/Gossip — DDP=AR          |
| AR algorithm        | [KNOB]   | RL   | Ring / Tree / Recursive        |
| AR protocol         | [KNOB]   | RL   | LL / LL128 / Simple            |
| nChannels           | [KNOB]   | RL   | NCCL channels                  |
| numThreads          | [KNOB]   | RL   | NCCL threads per channel       |
| chunkSize           | [KNOB]   | RL   | NCCL chunk granularity         |
| Hierarchical AR     | [KNOB]   | RL?  | Whether to use hierarchy       |
| numHierarchies      | [KNOB]   | RL?  | Levels of hierarchy            |
| Topology            | [DESIGN] | HW   | Torus / BCube / Fat-Tree       |
| Quantization on/off | [DESIGN] | User | Library / training script      |
| Quantization bits   | [KNOB]   | User | If on                          |
| Sparsification on/off| [DESIGN]| User | Library / training script      |
| Top-k k value       | [KNOB]   | User | If on                          |
| Error feedback      | [KNOB]   | User | If compression on              |
| WFBP overlap        | [DESIGN] | Frmwk| Built into Horovod / DDP       |
| Tensor fusion thresh| [KNOB]   | User | Horovod_FUSION_THRESHOLD       |
| Priority schedule   | [KNOB]   | User | BytePS / P3 / TicTac           |
+---------------------+----------+------+--------------------------------+

10. Mapping the Taxonomy to DynamICCL

DynamICCL's RL agent (Agent-2) outputs <algo, proto, nChannels, numThreads> for each NCCL collective call. In taxonomy terms:

DynamICCL operates ENTIRELY INSIDE this cell of the survey:

   Architecture  = All-Reduce         (FIXED by user)
   Synchronization = BSP              (FIXED by user)
   Compression  = None                (FIXED by user)
   Pipelining   = WFBP                (FIXED by framework)

   AR algorithm choice                <-- DynamICCL ACTION
   AR protocol choice                 <-- DynamICCL ACTION
   nChannels / numThreads             <-- DynamICCL ACTION
   chunkSize                          <-- adjacent action / next experiment

What DynamICCL gains from this survey

  1. Analytical reward shaping: Table 6 gives closed-form latency and bandwidth costs for Ring / Tree / Recursive. The agent's optimal policy should converge toward these analytically-predicted regions, with deviations capturing real runtime effects (congestion, contention) that the analytical model omits.

  2. Action-space justification: The survey explicitly notes Ring is bandwidth-optimal but latency-linear in n; Tree is logarithmic-latency. This justifies including BOTH in DynamICCL's action set rather than defaulting to one — exactly the point at which NCCL's heuristic switches between protocols and where the heuristic frequently mis-decides.

  3. Confirmed orthogonality: Synchronization (BSP), architecture (AR), compression (off), and pipelining (WFBP) are the user / framework's choice. DynamICCL does NOT need to model them — they are exogenous constants, simplifying the RL state space.

  4. State-feature ideas: Table 5 highlights "communication congestion" as a system-level signal that varies by architecture and sync. DynamICCL's state can include a real-time congestion estimate (per the LSTM detector in Saraswati's notes) without modeling the underlying architecture choice.

What this survey does NOT address (DynamICCL's research gap)

Future-extension targets within the taxonomy

If DynamICCL's action space expands beyond pure NCCL knobs:

Future axis Taxonomy cell unlocked
Tune tau for Local-SGD / FedAvg Sync dim — knob 2.2
Tune compression k per tensor Compression dim — open problem (3)
Tune tensor-fusion threshold Pipelining dim — knob 5.2
Tune number of AR hierarchies Architecture dim — knob 3.3
Switch BSP vs Local-SGD at runtime Sync dim (very ambitious)

The survey thus serves both as a map of where DynamICCL currently lives (the AR + BSP cell, with NCCL-specific sub-knobs the survey does not detail) and as a map of where DynamICCL could expand (the other taxonomy cells, each with their own knob-vs-design decomposition that the same RL framework could in principle absorb).