Communication-Efficient Data Parallel Distributed Deep Learning: A Comprehensive Survey
Zhenheng Tang, Shaohuai Shi, Wei Wang, Bo Li, Xiaowen Chu | HKBU / HIT-Shenzhen / HKUST / HKUST-GZ | arXiv:2003.06307v2 (Sep 2023, ACM CSUR-style submission, 35 pages)
Problem
Distributed deep learning (DDL) reduces wall-clock training time by spreading work across many GPUs/TPUs/CPUs, but as device throughput grows much faster than interconnect bandwidth, the communication step becomes the dominant bottleneck. The training of modern foundation models such as GPT-3, GShard, and Baidu RecSys consumes orders-of-magnitude more memory and FLOPs than even the latest H100/A100 hardware can support, and naively scaling out across slow interconnects (PCIe, 10 GbE, WAN) yields very low hardware utilization. The field has produced a fragmented body of work — synchronization relaxations, new system architectures, gradient/model compression schemes, and communication-computation overlap — but no single resource cleanly maps the trade-off space, the convergence-vs-communication tension, or which knobs are worth tuning under what conditions.
Core Insight
A data-parallel SGD training algorithm can be decomposed into four orthogonal dimensions — communication synchronization, system architecture, compression technique, and computation/communication parallelism — and almost every communication-efficient algorithm in the literature is a particular point in this 4-D design space. Treating these dimensions as independent makes it possible to systematically combine them, reason about their joint convergence behavior, and benchmark them under uniform settings.
Method (Taxonomy)
The survey's central artifact is the four-dimension taxonomy of distributed SGD (Table 1 / Fig. 2 of the paper):
Distributed Data-Parallel SGD
|
+------------+--------+------+------+--------------------+
| | | |
Synchronization Architecture Compression Comm/Compute Parallelism
| | | |
+-Synchronous +-Param Server +-Quantization (low-bit) +-Pipelining (WFBP, MG-WFBP)
+-Stale-Sync +-All-Reduce | - 1-bit, QSGD, TernGrad +-Scheduling
+-Async +-Gossip | - Sign-SGD, DIANA (priority,
+-Local-SGD +-Sparsification (top-k) tensor partitioning)
- Random-k
- Top-k / gTop-k
- Threshold (fixed/adaptive)
- SBC, STC, AdaComp
+-Coordinate Descent (BCD, IBCD)
+-Proximal / low-variance
Two further axes are mentioned but not deeply treated: communication protocol (TCP/IP, RoCE, IB Verbs, RDMA) and physical network topology (Torus, BCube, Fat-Tree). The survey is explicitly algorithm-level, not protocol-level.
For each dimension the paper:
- Formalizes the update rule mathematically (BSP, SSP, ASP, Local-SGD, QSGD, Top-k, etc.).
- Surveys 5–20 representative methods.
- States convergence rate and communication cost in big-O.
- Reports its own benchmarks on CIFAR-10/ResNet-20 and Shakespeare/LSTM with 4, 8, 16, 32 workers.
A unified comparison table (Table 9) lists communication cost and convergence rate for every (architecture, sync, compression) combination. A second table (Table 5) summarizes architecture-vs-sync trade-offs in terms of model consistency, communication frequency, communication congestion, and convergence stability.
Auxiliary techniques (Section 9) — error accumulation, momentum correction, low-rank decomposition (ATOMO, Count Sketch), local gradient clipping, warm-up training — are presented as orthogonal correctness/convergence helpers that plug into any compression scheme.
Results (selected from the survey's own benchmarks)
The survey runs uniform experiments on ResNet-20 (CIFAR-10) and a character-level LSTM (Shakespeare) with 4–32 workers on RTX 2080 Ti.
| Algorithm | 4-worker accuracy | 32-worker accuracy | Notes |
|---|---|---|---|
| BSP-SGD (PS) | 91.25% | 89.21% | Reference baseline |
| ASP-SGD | 92.27% | 0.00% (diverges) | Staleness collapses at 32 workers |
| Local-SGD (tau=4) | 90.90% | 89.44% | Insensitive to tau in 2..16 |
| FedAvg | 89.81% | 83.45% | Notable drop at large N |
| BSP + 16x quant | 86.44% | 85.37% | 16-bit quant tolerable |
| BSP + Top-k (1000x) | 62.66% | 61.98% | Heavy accuracy loss |
| BSP + EF-Top-k (1000x) | 90.88% | 87.76% | Error feedback recovers accuracy |
| DP-SGD (Gossip) | 91.08% | 88.99% | Comparable to BSP |
| CHOCO-SGD (Gossip + 100x comp) | 91.27% | 89.00% | High compression + decentralized works |
Headline take-aways:
- Error-feedback Top-k survives 1000x compression; vanilla Top-k does not.
- ASP fails at scale (32 workers diverge under most learning rates).
- Local-SGD is insensitive to local-step count in the [2, 16] range.
- Quantization is bounded by the 32x ceiling; sparsification can reach 100x–1000x but only with auxiliary techniques.
- All-Reduce algorithms have well-known cost models: Ring = 2(n-1)alpha + 2(n-1)/n * beta * N; Tree = 2 alpha log n + 2 beta N log n; hierarchical / 2D-Torus / BCube reduce the latency term further.
Limitations
- Offline benchmarks only — uses a small CNN (ResNet-20) and LSTM, no large language models or modern transformer training.
- Limited to data parallelism (model and pipeline parallelism are explicitly out of scope, with brief context).
- Network-protocol and physical-topology layers (Fig 2 axes 5 and 6) are acknowledged but not surveyed.
- No coverage of NCCL internals (algorithm/protocol/channel/numThreads), tuner-plugin APIs, or runtime configuration tuning.
- Survey ends at Sep 2023; pre-dates many 2024–2025 works on collective scheduling and runtime tuning (AutoCCL, MSCCL++, NCCL Device API).
Relevance to DynamICCL
DynamICCL operates inside the All-Reduce / collective-comm cell of this taxonomy: the agent's action space (algorithm, protocol, nChannels, numThreads) is a configuration of the All-Reduce architecture (Section 4.2). The other three dimensions of the taxonomy (sync, compression, scheduling) are fixed by the user's framework (e.g. PyTorch DDP / Horovod choosing BSP, no compression, WFBP overlap), not by the NCCL tuner.
Concretely:
| Taxonomy dimension | DynamICCL's relationship |
|---|---|
| Synchronization (BSP/SSP/ASP/Local) | Frozen by framework — DynamICCL inherits BSP-AllReduce |
| System architecture (PS/AR/Gossip) | Frozen as All-Reduce |
| Compression (quant/sparsification) | Out of scope — DynamICCL tunes uncompressed transfer |
| Pipelining/scheduling (WFBP, MG-WFBP) | Above DynamICCL — runs at framework / NCCL launch level |
| All-Reduce algorithm (Ring/Tree/2D) | Knob — DynamICCL action |
| All-Reduce protocol (LL/LL128/Simple) | Knob — DynamICCL action |
| nChannels / numThreads | Knob — DynamICCL action |
| chunkSize / numPipeOps | Knob — adjacent action space |
The survey's communication-cost models for All-Reduce (Table 6: Ring vs Tree vs Recursive Doubling latency/bandwidth) provide a direct analytical reward shaping target for DynamICCL: the agent should learn to pick Ring at moderate n (bandwidth-optimal) and Tree at large n (logarithmic latency), matching the survey's analytical predictions while accounting for runtime factors the cost model omits (network congestion, GPU contention, message size).
The survey's open problem (3) — "Can different compression ratios be set for different layers/tensors in a deep model or for peers with varying network bandwidth to achieve optimal system throughput?" — is a per-tensor decision problem of the same family DynamICCL solves at the (algo, proto, channels) level. Once DynamICCL is mature, the same RL framework could extend to per-tensor compression-ratio tuning.
The survey's open problem (4) — fault-tolerance under stragglers, congestion, and worker failure — is exactly the regime where DynamICCL should adapt: the RL state can be augmented with congestion / latency-jitter signals (LSTM detector) to switch the collective configuration online.