Communication-Efficient Data Parallel Distributed Deep Learning: A Comprehensive Survey — Detailed Summary

Zhenheng Tang (HKBU), Shaohuai Shi (HIT-Shenzhen), Wei Wang, Bo Li (HKUST), Xiaowen Chu (HKUST-GZ) | arXiv:2003.06307v2 | Sep 2023 | 35 pages | ACM CSUR-style submission

This summary is structured around the survey's central artifact: the four-dimension taxonomy of communication-efficient distributed DL. Within each branch I label each concept as either:

[KNOB] a parameter that an external optimizer (such as DynamICCL's RL agent) could plausibly set at runtime;
[DESIGN] a system-design choice baked into the framework / library that DynamICCL inherits and cannot change;
[INFO] a theoretical or empirical observation that informs the agent but is not actuated.

The final section maps the entire taxonomy to DynamICCL.

0. Paper-Level Summary

0.1 Abstract (verbatim distillation)

Distributed deep learning is bottlenecked by communication. The survey proposes a four-dimension taxonomy — communication synchronization, system architectures, compression techniques, and parallelism of communication and computing — investigates state-of-the-art works in each, compares convergence rates analytically and empirically, and offers extrapolated future directions.

0.2 What is new vs. prior surveys (Section 1.1)

Peteiro-Barral et al. (2013): general distributed ML for big data.
Xing et al. (2016): synchronization, scheduling, balancing, topologies.
Ben-Nun & Hoefler (2019): DNN operators and parallelism (concurrency-focused).
Guo (2018): quantized neural networks (quantization-only).
Zhang et al. (2020): brief overview of large-scale DDL systems.

This survey claims novelty in:

Demystifying communication compression in depth (largely missing in priors).
Comparing convergence bounds across all four taxonomy dimensions in a single table (Table 9).
Running uniform empirical benchmarks (FedML + MPI framework) over many algorithms with 4–32 workers.

0.3 Benchmark framework and configuration (Section 1.3)

Built on FedML and MPI for Python.
Tasks: CIFAR-10 / ResNet-20 (image classification) and Shakespeare / stacked character-level LSTM (NLP).
Worker count: 4, 8, 16, 32 RTX 2080 Ti, PyTorch v1.7.

1. Background: The Optimization Problem (Section 2)

The base optimization is:

min_x  E_xi~D [ F(x; xi) ]

solved by mini-batch SGD:

G_t(x_t) = grad F_t(x_t; xi_t)
x_{t+1}  = x_t - gamma * G_t(x_t)

BSP-SGD distributes the computation:

G_{i,t}(x_t) = grad F_{i,t}(x_t; xi_{i,t})
x_{t+1}      = x_t - (gamma / n) * sum_{i=1..n} G_{i,t}(x_t)

Every algorithm in the survey is a deviation from this BSP-SGD baseline along one or more taxonomy dimensions.

2. Dimension 1 — Communication Synchronization (Section 3)

2.1 Taxonomy and timeline

                     | -- Computation --  | -- Communication -- | Update |
BSP   :  All workers wait at a barrier; identical models everywhere.
SSP   :  Bounded staleness s; faster workers may run ahead by up to s steps.
ASP   :  No barrier; PS updates whenever any worker arrives.
Local :  Each worker runs tau local steps, then averages.

(Reproducing Fig. 3 of the paper.)

2.2 Knobs vs. design choices in this dimension

Concept	Type	Rationale
Choice of BSP / SSP / ASP / Local	[DESIGN]	Set in the training script (e.g., DDP vs. async PS) — fixed before NCCL is invoked.
Staleness bound `s` in SSP	[KNOB]	Could be tuned at runtime (Chen et al. R2SP, backup workers).
Local steps `tau` in Local-SGD	[KNOB]	Survey shows tau in {2,4,8,16} — final accuracy roughly invariant; communication frequency varies linearly. Tunable.
Backup-worker count (Chen et al. [31])	[KNOB]	Number of stragglers to drop.
FedAvg client-sampling fraction	[KNOB]	Federated-only setting; not relevant to HPC training.

2.3 Mathematical formulations

SSP update (Eq. 6):

x_{i,t+1} = x_0 - gamma * [ pre-window updates ] - gamma * [ in-window updates ] - gamma * [ read-my-writes ]

ASP update (Eq. 7):

x_{t+1} = x_t - gamma * sum_i G_{i, t-tau_{i,k}}(x_{i, t-tau_{k,i}})

Local-SGD (Eq. 8):

x_{i,t+1} = x_{i,t} - gamma * G_{i,t}(x_{i,t})                if t+1 not in I_T
            x_{i,t} - gamma * (1/n) * sum_i G_{i,t}(x_{i,t})  if t+1 in I_T

2.4 Empirical observations (Tables 2, 3, 4, 5)

BSP-SGD test accuracy on ResNet-20 (gamma=0.1): 4 workers 91.25 -> 32 workers 89.21 (mild large-batch generalization gap).
ASP-SGD diverges at 32 workers under most learning rates (drops to 0.00%).
Local-SGD with tau=2..16 is essentially indistinguishable in final accuracy.
FedAvg lags noticeably (62.41 -> 40.23 for gamma=0.001 / 4 -> 32 workers).

Table 5 (relative-level summary):

Architecture	Sync	Model Consistency	Comm. Frequency	Comm. Congestion	Convergence
PS	BSP	high	high	high	stable
PS	SSP	normal	high	normal	normal
PS	ASP	low	high	low	unstable
PS	Local	normal	low	high	unstable
All-Reduce	BSP	high	high	low	easy
All-Reduce	Local	normal	low	low	stable
Gossip	BSP	low	high	low	stable
Gossip	ASP	low	high	low	unstable
Gossip	Local	low	low	low	stable

2.5 Open problems flagged

Choosing s and tau adaptively under heterogeneous device speeds (the congestion/straggler problem).
ASP's accuracy collapse at scale is unresolved.

3. Dimension 2 — System Architecture (Section 4)

3.1 Three architectures

(a) Parameter Server          (b) All-Reduce             (c) Gossip
   +---------+                   +---+ -- +---+            +---+ ~ +---+
   | Servers |                   | W |    | W |            | W |   | W |
   +----+----+                   +-+-+    +-+-+            +-+-+   +-+-+
        |                          |        |                |       |
   +-+--+--+-+                     +--------+               (peer-to-peer
   |W|  |W| |W|                  (collective)                graph; symmetric
   +-+  +-+ +-+                                              doubly stochastic
                                                             matrix W)

3.2 Parameter Server (Section 4.1)

DistBelief, GeePS, ps-lite — extensively studied; main weakness is server congestion.
Server side accelerations: programmable switches doing in-network aggregation (Sapio et al.); user-defined filters (Li et al.).

Concept	Type
PS vs. All-Reduce vs. Gossip	[DESIGN]
Number of parameter servers	[KNOB]
Worker-relevance threshold (Wang)	[KNOB]
In-network aggregation (switches)	[DESIGN] (hardware feature)

3.3 All-Reduce (Section 4.2)

The survey's Table 6 — communication cost of representative All-Reduce algorithms for an N-dim vector across n nodes (alpha = latency, beta = inverse bandwidth):

Algorithm	Latency	Bandwidth
Binary tree	2 alpha log n	2 beta N log n
Recursive doubling	alpha log n	beta N log n
Ring	2(n-1) alpha	2(n-1)/n * beta * N

Algorithms covered:

Ring [Patarasuk & Yuan 2009]: bandwidth-optimal; latency O(n). Used in Gloo and earlier NCCL.
Double Binary Trees [Sanders 2009]: full bandwidth, O(log n) latency. Adopted in NCCL >= 2.4.
Hierarchical All-Reduce [Goyal 2017, Jia 2018]: reduces latency by factor equal to number of hierarchies.
2D-Torus All-Reduce [Jouppi et al., Mikami et al.]: latency reduction via torus topology.
BML [Wang 2018]: tailored to BCube.
BLink [Wang 2020], PLink [Luo 2020]: topology-aware adaptive All-Reduce for cloud / heterogeneous interconnects.

Concept	Type
Choice of Ring vs. Tree vs. Recursive	[KNOB]
Choice of hierarchical vs. flat	[KNOB]
Number of hierarchies / hierarchy mapping	[KNOB]
2D-Torus vs. BCube vs. Fat-Tree topology	[DESIGN] (physical)
Tree branching factor (binary vs. m-ary)	[KNOB]

This is the cell where DynamICCL operates. The survey explicitly notes the trade-off:

"for some small messages or small-scale clusters, recursive doubling or ring-based algorithms would be better" which mirrors the algorithm-vs-message-size trade-off DynamICCL learns.

3.4 Gossip (Section 4.3)

Each worker has a local model; communicates with neighbors per a graph W.
Trade-off: lower congestion at the cost of longer convergence ("consensus").
Methods: DPSGD [Lian et al.], SGP+PUSHSUM [Assran et al.], CHOCO-SGD [Koloskova], COLA [He et al.].
Asymmetric gossip (PUSHSUM) avoids deadlocks/symmetric-comm requirement.

Concept	Type
Gossip vs. centralized arch	[DESIGN]
Mixing matrix W	[DESIGN]
Number of peers per round	[KNOB]
Random vs. deterministic peer pick	[KNOB]

3.5 Empirical comparison (Table 7)

BSP-SGD (PS) vs DP-SGD (Gossip) on ResNet-20:

4 workers: 91.25 vs 91.08 (essentially equal).
32 workers: 89.21 vs 88.99 (gossip slightly behind).

4. Dimension 3 — Compression (Sections 5 and 6)

4.1 Quantization (Section 5)

Original 32-bit gradient  -> Quant() -> low-bit gradient -> Unquant() -> approximate gradient

Update rule (Eqs. 9–11):

G_quant_{i,t} = Quant( G_{i,t} + delta_{i,t} )
delta_{i,t}    = G_{i,t} - Unquant( G_quant_{i,t} )    (error feedback)
x_{t+1}        = x_t - gamma * (1/n) * sum_i G_quant_{i,t}

Quantization methods covered:

Method	Bits	Idea
1-bit SGD [Sei et al. 2014]	1	Sign + threshold; speech apps; 10x speedup
QSGD [Alistarh 2016]	family	Stochastic quantization unbiased estimator with bit-budget knob
TernGrad [Wen 2017]	2 bits	Ternary {-1,0,+1} with scalar-sharing & layer-wise scalars
SignSGD / signSGD-MV [Bernstein]	1	Sign + majority vote; convergence proofs
DIANA [Mishchenko 2019]	varies	Block-wise quantization
SRQ + VLC [Suresh 2017]	varies	Random rotation + Huffman coding
Adaptive quant [Faghri / Jhunjhunwala]	adaptive	Adjust bits during training

Concept	Type
Quantization vs. no compression	[DESIGN] (chosen by user / framework)
Number of bits `b`	[KNOB] (per-tensor or global)
Layer-wise vs. global scaling	[KNOB]
Use of error feedback	[KNOB]
Quantizer family (uniform, dither, ternary)	[DESIGN]

Theoretical maximum compression is 32x (single-precision FP).

4.2 Sparsification (Section 6)

Goal: send only k of N coordinates. Compression up to 1000x reported.

Sub-taxonomy (Section 6 introduction):

Sparsification
   |
   +-- 6.1 Random sparsification (Random-k, Random Mask, Subsampling)
   +-- 6.2 Deterministic sparsification
   |       +-- 6.2.1 Fixed Threshold (Strom)
   |       +-- 6.2.2 Adaptive Threshold (Top-k, AdaComp, gTop-k, SBC, STC)
   +-- 6.3 Coordinate Descent (BCD, IBCD)
   +-- 6.4 Proximal methods (L0/L1 regularization)

Representative methods:

Method	Idea
Random-k	Pick k random indices each iteration
Random Mask	Pre-defined random sparsity pattern, regenerated per iteration
Top-k [Aji&Heafield, Lin DGC]	Pick k largest
gTop-k [Shi]	Top-k applied a second time after global aggregation
AdaComp [Chen]	Self-adapting compression rate per layer
SBC [Sattler]	Sparse Binary Compression: sparsify + sign average + quantize
STC [Sattler]	Sparse Ternary Compression — same idea for federated learning
Truncated grad [Langford]	Threshold-based sparsity (online learning origin)

Concept	Type
Sparsification vs. no compression	[DESIGN]
`k` value or sparsity ratio	[KNOB] (per-tensor or global)
Fixed-threshold value	[KNOB]
Random-k vs. Top-k vs. Threshold	[DESIGN]
Use of error feedback (EF-Top-k)	[KNOB] (essential at high compression)
Layer-wise vs. global threshold	[KNOB]

The survey notes (open problem 3): adaptive per-layer / per-peer compression ratios are an open research direction.

4.3 Empirical comparison (Table 8)

ResNet-20 / 32 workers / gamma=0.1:

Compression scheme	Compression Ratio	Final Accuracy
BSP, no compression	1	89.21%
BSP + quant (16-bit)	2	89.34%
BSP + quant (2-bit)	16	85.37%
BSP + Top-k	10	86.75%
BSP + Top-k	100	77.66%
BSP + Top-k	1000	61.98%
BSP + EF-Top-k	10	88.65%
BSP + EF-Top-k	100	88.08%
BSP + EF-Top-k	1000	87.76%
DPSGD (gossip, no comp)	1	88.99%
DCD-PSGD (gossip + comp 4)	4	85.78%
CHOCO-SGD (gossip + comp 100)	100	89.00%

Key empirical insight: error feedback rescues sparsification at extreme compression; gossip + compression (CHOCO-SGD) at 100x matches BSP at 1x.

5. Dimension 4 — Computation/Communication Parallelism (Section 7)

5.1 Pipelining

Backward layers:   |bwd_3|bwd_2|bwd_1|
Communication:           |c_3 |c_2 |c_1|   <- WFBP overlaps c_l with bwd_{l-1}

WFBP (Wait-Free Backward Propagation) [Awan, Zhang]: overlap layer-l gradient comm with layer-(l-1) gradient computation.
MG-WFBP (merged-gradient WFBP) [Shi]: tensor fusion to amortize alpha startup; addresses small-tensor latency dominance at large scale.
For sparsified gradients: three-stage pipeline (compute, sparsify, communicate).
All-Reduce decomposed into two operations (reduce-scatter + all-gather) to enable interleaving with forward/backward compute [Wang et al.].

5.2 Scheduling

Priority scheduling + tensor partitioning [Peng et al.]: parallelize feed-forward computation with communication of earlier-layer gradients.
Communication-contention-aware scheduling [Wang]: avoids cross-job interference on shared GPU clusters.
DAG-aware ordering [Shi]: enforces a global tensor order across all workers to prevent deadlock and ensure correctness.

Concept	Type
WFBP vs. blocking comm	[DESIGN] (framework feature)
Merge-gradient (tensor fusion) threshold	[KNOB]
Tensor partition size	[KNOB]
Priority order of layers	[KNOB]
Concurrent-collective scheduling	[DESIGN]

5.3 Open issues

For very large tensors, the "long communication" can stall subsequent small tensors — exactly the problem MG-WFBP and tensor partitioning try to solve; no single solution dominates.
2022 benchmark [Agarwal et al.] shows WFBP-style overlap reduces training time substantially; without overlap, compression alone is much less useful.

6. Convergence Analysis (Section 8)

The survey collects standard assumptions:

L-Lipschitz gradient.
Unbiased stochastic gradient.
Bounded variance: E ||grad F - grad f||^2 <= sigma^2.
Bounded second moment: E ||grad F||^2 <= M^2.
(Optional) mu-strong convexity.

For gossip:

W is symmetric doubly stochastic.
Spectral gap rho < 1.

For compression:

k-contraction: E ||x - C(x)||^2 <= (1 - d/n) ||x||^2.
Unbiased compression (sometimes): E[C(x)] = x.

Selected convergence rates (Table 9 of survey):

Architecture	Sync	Compression	Convex	Non-convex
PS / AR	BSP	None	O(1/T)	O(1/sqrt T)
PS / AR	BSP	Quant	O(1/T)	O(1/sqrt T)
PS / AR	BSP	Spars.	O(1/T)	O(1/sqrt T)
PS	SSP	None	--	O(1/sqrt T)
PS	ASP	None	O(1/T)	O(1/sqrt T)
PS / AR	LocalSGD	None	O(1/T)	O(1/sqrt T)
Gossip	BSP	None	--	O(1/sqrt T)
Gossip	BSP	Quant	--	O(1/sqrt T)
Gossip	ASP	None	--	O(1/sqrt T)

[INFO] All four taxonomy combinations recover the same asymptotic rate O(1/sqrt T) for non-convex problems — the difference is in the constant factors (variance, compression contraction d, staleness s, peer count, etc.) that govern the practically-important wall-clock time.

7. Auxiliary Technologies (Section 9)

These are orthogonal correctness/convergence helpers that plug into any compression scheme. All are [KNOB] or [DESIGN] choices that augment the base compression.

7.1 Error Accumulation (9.1)

Step recipe:

C_{i,t} = Sparse(v_{i,t-1} + grad_{i,t})
v_{i,t} = (v_{i,t-1} + grad_{i,t}) - C_{i,t}    [residual carried forward]
x_{t+1} = x_t - gamma * (1/n) * sum_i C_{i,t}

Used in 1-bit SGD, EF-SignSGD, ECQ-SGD, EF-Top-k, etc.

[KNOB]: turn on/off error feedback; controls whether high-compression schemes will converge.

7.2 Momentum Correction (9.2)

DGC-style momentum applied to the residual error vector:

u_{i,t} = m * u_{i,t-1} + grad_{i,t}
v_{i,t} = v_{i,t-1} + u_{i,t}
x_{t+1} = x_t - gamma * sum_i sparse(v_{i,t})

7.3 Low-rank Decomposition (9.3)

ATOMO [Wang]: gradients sparsified in an atomic decomposition basis (entry-wise, SVD, Fourier). 1-bit-QSGD and TernGrad are special cases.
Spectral-ATOMO: SVD-based ATOMO; 2x and 3x faster than QSGD/TernGrad.
Count Sketch [Ivkin]: O((1/eps) log n) sketch approximates every coordinate and l2 norm; server recovers d largest after summing.

7.4 Local Gradient Clipping (9.4)

Standard gradient clipping is applied after aggregation; for sparsified BSP, clipping must be applied locally (and the threshold scaled by sqrt(N)) to recover variance equivalence.

7.5 Warm-up Training (9.5)

During warm-up: lower learning rate AND lower compression aggressiveness (sparsity grows exponentially toward final value).
Avoids extreme delayed-gradient effects in the first few epochs.

8. Conclusion and Future Directions (Section 10)

The survey explicitly enumerates four open problems:

Foundation model training: do current comm-efficient methods scale to GPT-3 / GShard / RecSys-class models?
Higher compression level: above 1000x without accuracy loss?
Adaptive compression: per-layer, per-tensor, or per-peer compression ratios chosen automatically.
Fault-tolerant algorithms: handling stragglers, network congestion, and worker failures in heterogeneous deployments.

(3) and (4) are directly aligned with DynamICCL's research thesis: an RL agent that adapts collective configuration to heterogeneous, congested, runtime-variable conditions.

9. Knob-vs-Design Master Table (across all four dimensions)

This is the practical reference for DynamICCL design.

+---------------------+----------+------+--------------------------------+
| Decision            | Type     | Who  | Notes                          |
+---------------------+----------+------+--------------------------------+
| BSP/SSP/ASP/Local   | [DESIGN] | User | Set in training script (DDP=BSP)|
| s (staleness bound) | [KNOB]   | RL?  | Only in SSP regimes            |
| tau (local steps)   | [KNOB]   | RL?  | Only in Local-SGD regimes      |
| Architecture        | [DESIGN] | User | PS/AR/Gossip — DDP=AR          |
| AR algorithm        | [KNOB]   | RL   | Ring / Tree / Recursive        |
| AR protocol         | [KNOB]   | RL   | LL / LL128 / Simple            |
| nChannels           | [KNOB]   | RL   | NCCL channels                  |
| numThreads          | [KNOB]   | RL   | NCCL threads per channel       |
| chunkSize           | [KNOB]   | RL   | NCCL chunk granularity         |
| Hierarchical AR     | [KNOB]   | RL?  | Whether to use hierarchy       |
| numHierarchies      | [KNOB]   | RL?  | Levels of hierarchy            |
| Topology            | [DESIGN] | HW   | Torus / BCube / Fat-Tree       |
| Quantization on/off | [DESIGN] | User | Library / training script      |
| Quantization bits   | [KNOB]   | User | If on                          |
| Sparsification on/off| [DESIGN]| User | Library / training script      |
| Top-k k value       | [KNOB]   | User | If on                          |
| Error feedback      | [KNOB]   | User | If compression on              |
| WFBP overlap        | [DESIGN] | Frmwk| Built into Horovod / DDP       |
| Tensor fusion thresh| [KNOB]   | User | Horovod_FUSION_THRESHOLD       |
| Priority schedule   | [KNOB]   | User | BytePS / P3 / TicTac           |
+---------------------+----------+------+--------------------------------+

10. Mapping the Taxonomy to DynamICCL

DynamICCL's RL agent (Agent-2) outputs <algo, proto, nChannels, numThreads> for each NCCL collective call. In taxonomy terms:

DynamICCL operates ENTIRELY INSIDE this cell of the survey:

   Architecture  = All-Reduce         (FIXED by user)
   Synchronization = BSP              (FIXED by user)
   Compression  = None                (FIXED by user)
   Pipelining   = WFBP                (FIXED by framework)

   AR algorithm choice                <-- DynamICCL ACTION
   AR protocol choice                 <-- DynamICCL ACTION
   nChannels / numThreads             <-- DynamICCL ACTION
   chunkSize                          <-- adjacent action / next experiment

What DynamICCL gains from this survey

Analytical reward shaping: Table 6 gives closed-form latency and bandwidth costs for Ring / Tree / Recursive. The agent's optimal policy should converge toward these analytically-predicted regions, with deviations capturing real runtime effects (congestion, contention) that the analytical model omits.
Action-space justification: The survey explicitly notes Ring is bandwidth-optimal but latency-linear in n; Tree is logarithmic-latency. This justifies including BOTH in DynamICCL's action set rather than defaulting to one — exactly the point at which NCCL's heuristic switches between protocols and where the heuristic frequently mis-decides.
Confirmed orthogonality: Synchronization (BSP), architecture (AR), compression (off), and pipelining (WFBP) are the user / framework's choice. DynamICCL does NOT need to model them — they are exogenous constants, simplifying the RL state space.
State-feature ideas: Table 5 highlights "communication congestion" as a system-level signal that varies by architecture and sync. DynamICCL's state can include a real-time congestion estimate (per the LSTM detector in Saraswati's notes) without modeling the underlying architecture choice.

What this survey does NOT address (DynamICCL's research gap)

Runtime configuration tuning of the All-Reduce primitive itself. The survey treats AR as a single algorithmic choice (Ring or Tree) — but modern NCCL exposes (algo, proto, nChannels, numThreads, chunkSize) as a configuration manifold with dozens of points per collective. The survey's cost model (Table 6) is too coarse to predict performance within this manifold; this is exactly the gap DynamICCL fills.
Adaptive selection per collective call. The survey only considers static algorithm choice across an entire training run. DynamICCL adapts per-collective, per-iteration, per-message-size — a finer granularity consistent with the survey's open problem (3) ("different compression ratios for different layers/tensors").
Online congestion-aware reconfiguration. The survey acknowledges fault-tolerance and congestion as open problem (4) but offers no mechanism. DynamICCL's RL+LSTM stack is one such mechanism.

Future-extension targets within the taxonomy

If DynamICCL's action space expands beyond pure NCCL knobs:

Future axis	Taxonomy cell unlocked
Tune `tau` for Local-SGD / FedAvg	Sync dim — knob 2.2
Tune compression `k` per tensor	Compression dim — open problem (3)
Tune tensor-fusion threshold	Pipelining dim — knob 5.2
Tune number of AR hierarchies	Architecture dim — knob 3.3
Switch BSP vs Local-SGD at runtime	Sync dim (very ambitious)

The survey thus serves both as a map of where DynamICCL currently lives (the AR + BSP cell, with NCCL-specific sub-knobs the survey does not detail) and as a map of where DynamICCL could expand (the other taxonomy cells, each with their own knob-vs-design decomposition that the same RL framework could in principle absorb).