Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey
Feng Liang, Zhen Zhang, Haifeng Lu, Victor C. M. Leung, Yanyi Guo, Xiping Hu | Shenzhen MSU-BIT, Lanzhou U., UBC, BIT | arXiv:2404.06114v1 [cs.DC] | April 2024 | 47 pages | covers 2018-2023
Problem
Modern distributed deep learning is no longer about scaling ResNet to 32 GPUs. The workload of interest is now trillion-parameter foundation models (GPT-class LLMs, recommendation models with TB-scale embedding tables) trained across thousands of GPUs in heterogeneous, fault-prone clusters. At this scale, communication is the dominant bottleneck, but the bottlenecks themselves are qualitatively different from earlier-generation DDL: model and pipeline parallelism become mandatory; intra-node interconnect (NVLink, NVSwitch) and inter-node fabric (PCIe, RDMA, programmable switches) must be co-designed with the algorithm; and trillion-parameter training cannot tolerate straggler-driven re-synchronization, so fault tolerance and adaptive scheduling become first-class concerns. Existing surveys (notably the 0025 survey by Tang et al., 2023, which this paper explicitly succeeds) cover the algorithmic side of comm-efficient data-parallel training but stop short of the infrastructure and 3D-parallelism layers that define large-scale practice.
Core Insight
Communication efficiency at large scale is a three-layer co-design problem: algorithms (synchronization protocols, gradient compression), frameworks (resource allocation, task and pipeline scheduling), and infrastructure (GPU interconnects, programmable network devices, collective libraries, physical topology). A survey scoped to any single layer underspecifies the design space; the practitioner must reason across all three. The paper's contribution is the first comprehensive treatment of all three layers in one document, anchored by an LLM-training case study that demonstrates how the layers interact in practice.
Method (Survey Structure)
The survey organizes ~340 references into six dimensions. Each dimension has its own taxonomy:
+------------------------------------------------------------------------+
| Layer 1 (Algorithm) |
| +-- Section III: Model Synchronization |
| | SyncSGD, AsyncSGD, SSP, Local-SGD, Decentralized, FL-tailored |
| +-- Section IV: Data Compression |
| Quantization, Sparsification, Hybrid (sparse+quant), Low-rank |
+------------------------------------------------------------------------+
| Layer 2 (Framework) |
| +-- Section V: Resource Allocation & Task Scheduling |
| GPU sharing (Gandiva, AntMan, Salus), Pipeline scheduling |
| (PipeDream, GPipe, Chimera, Varuna), Flow scheduling (Geryon), |
| Inference serving (spatial/temporal/hybrid sharing) |
+------------------------------------------------------------------------+
| Layer 3 (Infrastructure) <-- where DynamICCL acts |
| +-- Section VI: Communication Infrastructures |
| Intra-node interconnect: PCIe, NVLink, NVSwitch |
| Inter-node fabric: RDMA, InfiniBand |
| Programmable network devices: SwitchML, ATP, NetReduce, SmartNICs|
| Collective libraries: NCCL, MSCCLang, SCCL, TACCL |
| Topologies: Torus, BCube/BML, HammingMesh, Silicon-Photonic |
+------------------------------------------------------------------------+
| Section VII: LLM case study (3D parallelism, fault tolerance) |
+------------------------------------------------------------------------+
The survey uses tables to compare convergence rates (Local-SGD, Async-SGD) and qualitative cost/benefit summaries for each system class. There is no unified empirical benchmark (unlike Tang et al., 2023, which ran FedML+MPI ablations); the contribution here is breadth and infrastructure depth.
Headline Results / Empirical Numbers Cited
- LLM-class scaling math reproduced from the case study: a 100B-parameter pure-DP all-reduce takes ~1600s on a 1000Mbps inter-node link versus ~0.22s on NVLink — a 7000x ratio that motivates 3D parallelism over flat DP.
- Hardware bandwidths cited: PCIe v7 = 242 GB/s; NVLink v4 = 900 GB/s; NVSwitch (3rd gen) = 57.6 TB/s aggregate all-to-all.
- GPU utilization in shared clusters cited at 25-50%, motivating Gandiva, AntMan, Salus.
- Top-k sparsification routinely achieves 99% sparsity at acceptable accuracy (DGC, ScaleCom).
- Programmable-switch in-network aggregation (SwitchML, ATP) reduces traffic by an N-fold factor for n-way all-reduce.
Limitations / Open Problems Enumerated
The survey closes with five open directions, all of which are relevant to DynamICCL:
- Complexity of environment — real-time adaptation to dynamic data distribution and network topology.
- Finer-grained gradient assessment — balancing the cost of identifying significant gradients with the benefit.
- Hybrid compression optimization — choosing among quantization, sparsification, low-rank, and combinations — search space is huge.
- Co-design of infrastructure — integrating in-network aggregation with transport scheduling and workload balancing.
- Fault-tolerant LLM training — efficient recovery from frequent device failures in trillion-parameter sessions.
Items (1) and (4) are the closest match to DynamICCL's research thesis.
What Makes This Survey Different from the 0025 Survey (Tang et al. 2023)
| Axis | 0025 (Tang et al.) | 0026 (Liang et al., this paper) |
|---|---|---|
| Scope | Data-parallel comm-efficient SGD | All three layers: algorithm + framework + infrastructure |
| Year | arXiv 2003.06307v2, 2020-2023 | arXiv 2404.06114, 2018-2023 |
| Parallelism modes | DP only | DP + MP + PP + 3D parallelism |
| Foundation models | Mentioned as open problem | Dedicated case study (Section VII) on LLMs |
| Hardware coverage | Brief | Deep: PCIe / NVLink / NVSwitch / programmable switches / SmartNICs |
| Collective library coverage | Ring, Tree, Recursive Doubling formulas | Adds MSCCLang, SCCL, TACCL (synthesized collectives) |
| Topology coverage | Torus, BCube, Fat-Tree | Adds HammingMesh and silicon-photonic reconfigurable |
| In-network aggregation | One mention (Sapio) | Full subsection: SwitchML, ATP, NetReduce |
| Empirical benchmarks | Yes, FedML+MPI on 4-32 RTX 2080 Ti | No unified benchmark |
| Fault tolerance | Brief | First-class topic: SWARM, Oobleck, pipeline replicas |
The two surveys are complementary. 0025 is the right reference for the algorithm layer with empirical receipts; 0026 is the right reference for the infrastructure and framework layers and for the LLM-scale story.
Knobs vs. Design (the lens that matters for DynamICCL)
NCCL-actuatable knobs DynamICCL can change at runtime are a small slice of this survey. The vast majority of methods discussed are framework-fixed system choices made before NCCL is ever invoked.
NCCL-actuatable (DynamICCL action space)
- Collective algorithm: Ring / Tree / CollNet / NVLS
- Protocol: LL / LL128 / Simple
- nChannels, numThreads, chunkSize
- Whether to use synthesized collective code path (MSCCLang)
Framework-fixed (set in user training script, exogenous to DynamICCL)
- DP / MP / PP / 3D parallelism choice
- Quantization on/off and bit-width
- Sparsification on/off and k
- Synchronization protocol (BSP / SSP / ASP / Local-SGD)
- Pipeline schedule (1F1B / interleaved / Chimera)
- ZeRO stage / FSDP
Hardware-fixed (capital-expenditure decision, exogenous to DynamICCL)
- PCIe vs NVLink vs NVSwitch
- InfiniBand vs Ethernet vs Silicon-Photonic
- In-network aggregation (programmable switch presence)
- Topology (Fat-Tree / Torus / BCube / HammingMesh)
DynamICCL operates entirely inside the "NCCL-actuatable" row. The survey's contribution to DynamICCL is to enumerate everything outside that row that the agent must treat as fixed context, plus to flag a handful of NCCL-adjacent infrastructure features (CollNet/in-network aggregation, synthesized collectives via MSCCLang) that constitute future expansion targets for DynamICCL's action space.
Relevance to DynamICCL
DynamICCL (Tennessee Tech, Rajat Bisht) is an RL-based NCCL
configuration optimizer in which Agent-2 selects per-collective
<algorithm, protocol, nChannels, numThreads> to
minimize collective completion time on HPC GPU clusters. This survey
informs DynamICCL in five concrete ways:
Confirms the action-space scope. The paper treats the collective library (NCCL) as the lowest-level actuatable interface and cites synthesized collectives (MSCCLang/SCCL/TACCL) as the next frontier. This places DynamICCL squarely on the contemporary research path: NCCL knob tuning is the practical lever above the synthesis layer.
Validates "infrastructure-aware RL" as the open problem. Open problems (1) and (4) — adaptive runtime tuning under dynamic topology, and co-design of in-network compute with scheduling — are precisely DynamICCL's research thesis. The survey explicitly identifies this gap.
Provides the LLM workload context. Section VII gives the per-collective message-size distribution that DynamICCL will encounter in real foundation-model training: 3D parallelism produces a heterogeneous mix of small (TP all-reduce, ~MB) and large (DP all-reduce, ~GB) collectives in each step. DynamICCL's per-collective adaptation is exactly the granularity needed.
Bandwidth ratios bound the optimization. The 7000x ratio between slow-Ethernet DP and NVLink-DP communication for a 100B model defines the expected performance ceiling and the cost of misconfiguration on HPC clusters with mixed interconnect (e.g., Chameleon Cloud's 1GbE plus PCIe).
Maps adjacent expansion targets. When DynamICCL's action space eventually expands beyond NCCL knobs, this survey is the field guide: tensor-fusion thresholds (Prophet), priority schedules (Geryon), synthesized collective selection (MSCCLang), and CollNet/in-network aggregation toggles all become candidate dimensions.
This survey is the right "context document" to cite when justifying DynamICCL's scope in a research proposal or related-work section, and the right complement to 0025 for a complete picture of the comm-efficient DDL landscape.