Communication-Efficient Data Parallel Distributed Deep Learning: A Comprehensive Survey

Zhenheng Tang, Shaohuai Shi, Wei Wang, Bo Li, Xiaowen Chu | HKBU / HIT-Shenzhen / HKUST / HKUST-GZ | arXiv:2003.06307v2 (Sep 2023, ACM CSUR-style submission, 35 pages)

Problem

Distributed deep learning (DDL) reduces wall-clock training time by spreading work across many GPUs/TPUs/CPUs, but as device throughput grows much faster than interconnect bandwidth, the communication step becomes the dominant bottleneck. The training of modern foundation models such as GPT-3, GShard, and Baidu RecSys consumes orders-of-magnitude more memory and FLOPs than even the latest H100/A100 hardware can support, and naively scaling out across slow interconnects (PCIe, 10 GbE, WAN) yields very low hardware utilization. The field has produced a fragmented body of work — synchronization relaxations, new system architectures, gradient/model compression schemes, and communication-computation overlap — but no single resource cleanly maps the trade-off space, the convergence-vs-communication tension, or which knobs are worth tuning under what conditions.

Core Insight

A data-parallel SGD training algorithm can be decomposed into four orthogonal dimensions — communication synchronization, system architecture, compression technique, and computation/communication parallelism — and almost every communication-efficient algorithm in the literature is a particular point in this 4-D design space. Treating these dimensions as independent makes it possible to systematically combine them, reason about their joint convergence behavior, and benchmark them under uniform settings.

Method (Taxonomy)

The survey's central artifact is the four-dimension taxonomy of distributed SGD (Table 1 / Fig. 2 of the paper):

                    Distributed Data-Parallel SGD
                                |
   +------------+--------+------+------+--------------------+
   |            |               |                          |
Synchronization  Architecture   Compression       Comm/Compute Parallelism
   |            |               |                          |
   +-Synchronous +-Param Server +-Quantization (low-bit)   +-Pipelining (WFBP, MG-WFBP)
   +-Stale-Sync  +-All-Reduce   |  - 1-bit, QSGD, TernGrad  +-Scheduling
   +-Async       +-Gossip       |  - Sign-SGD, DIANA          (priority,
   +-Local-SGD                  +-Sparsification (top-k)       tensor partitioning)
                                   - Random-k
                                   - Top-k / gTop-k
                                   - Threshold (fixed/adaptive)
                                   - SBC, STC, AdaComp
                                +-Coordinate Descent (BCD, IBCD)
                                +-Proximal / low-variance

Two further axes are mentioned but not deeply treated: communication protocol (TCP/IP, RoCE, IB Verbs, RDMA) and physical network topology (Torus, BCube, Fat-Tree). The survey is explicitly algorithm-level, not protocol-level.

For each dimension the paper:

Formalizes the update rule mathematically (BSP, SSP, ASP, Local-SGD, QSGD, Top-k, etc.).
Surveys 5–20 representative methods.
States convergence rate and communication cost in big-O.
Reports its own benchmarks on CIFAR-10/ResNet-20 and Shakespeare/LSTM with 4, 8, 16, 32 workers.

A unified comparison table (Table 9) lists communication cost and convergence rate for every (architecture, sync, compression) combination. A second table (Table 5) summarizes architecture-vs-sync trade-offs in terms of model consistency, communication frequency, communication congestion, and convergence stability.

Auxiliary techniques (Section 9) — error accumulation, momentum correction, low-rank decomposition (ATOMO, Count Sketch), local gradient clipping, warm-up training — are presented as orthogonal correctness/convergence helpers that plug into any compression scheme.

Results (selected from the survey's own benchmarks)

The survey runs uniform experiments on ResNet-20 (CIFAR-10) and a character-level LSTM (Shakespeare) with 4–32 workers on RTX 2080 Ti.

Algorithm	4-worker accuracy	32-worker accuracy	Notes
BSP-SGD (PS)	91.25%	89.21%	Reference baseline
ASP-SGD	92.27%	0.00% (diverges)	Staleness collapses at 32 workers
Local-SGD (tau=4)	90.90%	89.44%	Insensitive to tau in 2..16
FedAvg	89.81%	83.45%	Notable drop at large N
BSP + 16x quant	86.44%	85.37%	16-bit quant tolerable
BSP + Top-k (1000x)	62.66%	61.98%	Heavy accuracy loss
BSP + EF-Top-k (1000x)	90.88%	87.76%	Error feedback recovers accuracy
DP-SGD (Gossip)	91.08%	88.99%	Comparable to BSP
CHOCO-SGD (Gossip + 100x comp)	91.27%	89.00%	High compression + decentralized works

Headline take-aways:

Error-feedback Top-k survives 1000x compression; vanilla Top-k does not.
ASP fails at scale (32 workers diverge under most learning rates).
Local-SGD is insensitive to local-step count in the [2, 16] range.
Quantization is bounded by the 32x ceiling; sparsification can reach 100x–1000x but only with auxiliary techniques.
All-Reduce algorithms have well-known cost models: Ring = 2(n-1)alpha + 2(n-1)/n * beta * N; Tree = 2 alpha log n + 2 beta N log n; hierarchical / 2D-Torus / BCube reduce the latency term further.

Limitations

Offline benchmarks only — uses a small CNN (ResNet-20) and LSTM, no large language models or modern transformer training.
Limited to data parallelism (model and pipeline parallelism are explicitly out of scope, with brief context).
Network-protocol and physical-topology layers (Fig 2 axes 5 and 6) are acknowledged but not surveyed.
No coverage of NCCL internals (algorithm/protocol/channel/numThreads), tuner-plugin APIs, or runtime configuration tuning.
Survey ends at Sep 2023; pre-dates many 2024–2025 works on collective scheduling and runtime tuning (AutoCCL, MSCCL++, NCCL Device API).

Relevance to DynamICCL

DynamICCL operates inside the All-Reduce / collective-comm cell of this taxonomy: the agent's action space (algorithm, protocol, nChannels, numThreads) is a configuration of the All-Reduce architecture (Section 4.2). The other three dimensions of the taxonomy (sync, compression, scheduling) are fixed by the user's framework (e.g. PyTorch DDP / Horovod choosing BSP, no compression, WFBP overlap), not by the NCCL tuner.

Concretely:

Taxonomy dimension	DynamICCL's relationship
Synchronization (BSP/SSP/ASP/Local)	Frozen by framework — DynamICCL inherits BSP-AllReduce
System architecture (PS/AR/Gossip)	Frozen as All-Reduce
Compression (quant/sparsification)	Out of scope — DynamICCL tunes uncompressed transfer
Pipelining/scheduling (WFBP, MG-WFBP)	Above DynamICCL — runs at framework / NCCL launch level
All-Reduce algorithm (Ring/Tree/2D)	Knob — DynamICCL action
All-Reduce protocol (LL/LL128/Simple)	Knob — DynamICCL action
nChannels / numThreads	Knob — DynamICCL action
chunkSize / numPipeOps	Knob — adjacent action space

The survey's communication-cost models for All-Reduce (Table 6: Ring vs Tree vs Recursive Doubling latency/bandwidth) provide a direct analytical reward shaping target for DynamICCL: the agent should learn to pick Ring at moderate n (bandwidth-optimal) and Tree at large n (logarithmic latency), matching the survey's analytical predictions while accounting for runtime factors the cost model omits (network congestion, GPU contention, message size).

The survey's open problem (3) — "Can different compression ratios be set for different layers/tensors in a deep model or for peers with varying network bandwidth to achieve optimal system throughput?" — is a per-tensor decision problem of the same family DynamICCL solves at the (algo, proto, channels) level. Once DynamICCL is mature, the same RL framework could extend to per-tensor compression-ratio tuning.

The survey's open problem (4) — fault-tolerance under stragglers, congestion, and worker failure — is exactly the regime where DynamICCL should adapt: the RL state can be augmented with congestion / latency-jitter signals (LSTM detector) to switch the collective configuration online.