Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey — Detailed Summary

Feng Liang, Zhen Zhang, Haifeng Lu, Victor C. M. Leung, Yanyi Guo, Xiping Hu | Shenzhen MSU-BIT University, Lanzhou U., UBC, Beijing IT | arXiv:2404.06114v1 [cs.DC] | April 2024 | 47 pages | covers literature 2018-2023

This summary follows the structure used for the companion survey 0025 (Tang et al., 2023) but emphasizes how this paper differs. Within each branch, concepts are tagged:

The final two sections give the knob/design master table and DynamICCL mapping.


0. Paper-Level Summary

0.1 Bibliographic and scope

This is the largest and most up-to-date survey of communication-efficient distributed deep learning at the time of publication. It is explicitly scoped to large-scale training and inference: thousands of GPUs, foundation models (LLMs and trillion-parameter recommender models), and heterogeneous inter-/intra-node interconnect. It supersedes the 0025 survey of Tang et al. (2023) along two axes: (a) it adds the framework and infrastructure layers that 0025 deliberately scopes out, and (b) it brings the literature window forward to 2023, capturing the LLM-era papers (PowerSGD, ScaleCom, GPT-class 3D parallelism, fault-tolerant pipeline systems).

0.2 Abstract distillation

The abstract identifies three new challenges introduced by the large-scale regime: (1) fault tolerance — long training runs across thousands of devices guarantee non-trivial failure rates, (2) scalability of algorithms and infrastructure — sublinear scaling becomes the rule rather than the exception, and (3) heterogeneity — datasets, model partitions, and resources are non-uniform. Communication is the bottleneck for all three. The survey covers the period 2018-2023 across three layers (algorithm, framework, infrastructure) and concludes with an LLM training case study.

0.3 What is new vs. prior surveys

Prior surveys explicitly compared:

Liang et al. claim novelty in:

  1. Joint coverage of the algorithm + framework + infrastructure stack.
  2. Treating LLM-scale training as a first-class scenario (Section VII).
  3. Including programmable network devices (in-network aggregation, SmartNICs) and synthesized collective libraries (SCCL, TACCL, MSCCLang) — both absent from 0025.
  4. Treating fault tolerance, GPU resource sharing, and inference serving alongside training.

0.4 No unified empirical benchmark

Unlike 0025, this survey does not run its own experiments. It cites benchmarks and numbers from the surveyed papers. The contribution is breadth plus the LLM case-study analysis, not new measurements.


1. Section I and II: Introduction and Fundamentals

1.1 Three large-scale challenges (Section I)

The paper organizes its motivation around three challenges that distinguish large-scale DDL from earlier-generation DDL:

  1. Algorithmic and architectural complexity. Search spaces (compression, parallelism, scheduling) explode at scale; static heuristics no longer suffice.
  2. Heterogeneity. Devices, networks, datasets, and model partitions are all non-uniform. Stragglers and load imbalance dominate wall-clock time.
  3. Foundation-model scale. GPT-class models exceed any single-device memory budget; pure data parallelism is no longer feasible.

1.2 Parallelism modes (Section II)

Data Parallelism (DP):   replicate model, shard data, all-reduce gradients.
Model Parallelism (MP):  shard model, route activations across shards.
                         Tensor Parallelism (TP) is intra-layer slicing.
Pipeline Parallelism (PP): partition by layer; overlap micro-batches.
3D Parallelism (DP+MP+PP): the standard for trillion-parameter LLMs.

[FX-FW] All four parallelism modes are training-script choices. DynamICCL inherits whichever combination the user has chosen and operates on the collectives produced by that choice. The collectives' message-size distribution differs sharply across modes:

Mode              Typical collective       Message size      Frequency
DP                AllReduce                 100MB - 10GB      once/step
TP (intra-layer)  AllReduce / ReduceScatter  1-100MB           many/step
PP (inter-layer)  Send/Recv (point-to-point) <1MB - 100MB      many/step
ZeRO-3 / FSDP     AllGather / ReduceScatter  10MB - 1GB        many/step

1.3 Paradigms

The survey distinguishes:

DynamICCL targets the first paradigm only.


2. Section III: Communication-Efficient Model Synchronization

This section is the closest analog of 0025 Section 3. Liang et al. cover the same six-class synchronization taxonomy with newer references, and add a federated-learning subsection.

2.1 Six-class synchronization taxonomy

Class Representative work Key idea
Synchronous SGD OSP [85] Overlap "unimportant" gradients with next-iter compute
Asynchronous SGD Downpour [86], PS+ [87] No barrier; eager pull; segment PS into shards
Stale Synchronous (SSP) SSP [88], R2P [90], HSP [91] Bounded staleness s; round-robin sync to dodge contention
Local SGD Stich, Yu, Haddadpour, Woodworth tau local steps before averaging
Adaptive Local AdaComm [93], Post-local [94], SlowMo [95] Adapt tau to network/loss state
Decentralized D-PSGD [99], D^2 [100] No central PS; gossip-style averaging

[FX-FW] all six classes — DynamICCL inherits the choice.

2.2 Convergence rate tables (Tables IV and V of the paper)

Local SGD (Table IV, simplified):

Reference Convergence rate Required sync rounds R Setting
Stich [105] O(G^2 / NT) Omega(N^{1/2} T^{1/2}) Bounded grad; strongly convex
Yu [106] O(G^2 / sqrt(NT)) Omega(N^{3/4} T^{3/4}) Bounded grad; non-convex
Haddadpour [107] O(1 / NT) Omega(N^{1/3} T^{1/3}) PL condition; non-convex
Woodworth [109] O(sqrt(1/NT) + 1/cbrt(TR)) Omega(N * polylog T) Convex

Asynchronous SGD (Table V):

Reference Required iterations Setting
Stich [112] Omega(sigma^2/eps^2 + tau_max/eps) Quasi-convex
Aviv [113] Omega(sigma^2/eps^2 + tau_avg/eps) Delay-adaptive LR
Cohen [114] Omega(sigma^2/eps^4 + tau_avg/eps^2) Smooth non-convex

[INFO] These tables generalize 0025's Table 9 to the large-scale regime — note the explicit dependence on tau_max and tau_avg (worst-case vs. mean delay), which is the practical handle for fault tolerance and straggler mitigation. DynamICCL's RL state can include observed delay statistics to predict convergence-region effects, but the survey treats this as a training-script tuning problem, not a runtime collective-config problem.

2.3 FL-specific synchronization

Federated-learning-specific issues are flagged as out-of-scope for HPC DDL but covered for completeness:

Not relevant to DynamICCL.


3. Section IV: Communication-Efficient Data Compression

This is the second algorithmic-layer section, also covered in 0025 but with newer methods.

3.1 Quantization

G  -- Quant_b() --> G_quant  -- send -->  Unquant() --> G_approx
delta = G - Unquant(G_quant)         (error feedback, optional)
Method Bits Idea
1-bit SGD [142] 1 Sign + threshold; speech apps
SignSGD [143] 1 Element-wise sign with majority vote
1-bit Adam [145] 1 Warmup phase, then quantize Adam momentum
QSGD [146] family Stochastic quantization, unbiased estimator
TernGrad 2 Ternary {-1,0,+1} with scalar sharing
PowerSGD [153] low-rank Power-iteration matrix decomposition
ATOMO [152] low-rank Atomic decomposition (SVD / Fourier)

PowerSGD is highlighted as the dominant low-rank technique at the LLM scale.

[FX-FW] On/off and bit-width are training-script choices.

3.2 Sparsification

Method Idea
Top-k [166] Largest-magnitude k coordinates; 99% sparsity feasible
DGC [168] Top-k + momentum correction + clipping + warmup
ScaleCom [172] Cyclic local Top-k; designed for large clusters
Global Top-k [171] Top-k applied after global aggregation
MSTopK [174] Multi-stage Top-k for scalability

ScaleCom and MSTopK are the large-scale-specific additions vs. 0025.

3.3 Hybrid compression

Combinations of sparsification and quantization:

Hybrid schemes raise survey open problem (3): how to choose the optimum quant-bits + sparsity-k pair from the combinatorial product space.

[KNOB] Ratio choice can in principle be tuned at runtime; in practice no production framework exposes this as a per-collective knob.


4. Section V: Resource Allocation and Task Scheduling

This section has no analog in 0025; it is an entirely new layer.

4.1 GPU sharing systems (training)

System Ref Idea
Gandiva [206] Time-share GPU; live migration of jobs
AntMan [207] Co-design DL framework with cluster scheduler
Salus [210] Fine-grained sharing via fast PyTorch context switch

Motivation: cited GPU utilization in shared clusters is 25-50%. Sharing reclaims the gap.

4.2 Pipeline scheduling

System Ref Schedule
GPipe [253] Synchronous pipeline; bubble at tail
PipeDream [252] 1F1B asynchronous pipeline; weight-version mismatch
Chimera [63] Bidirectional pipeline; halves the bubble
Varuna [333] Spot-instance-aware pipeline

4.3 Network flow scheduling

4.4 Inference serving

A subsection on inference: spatial sharing, temporal sharing, and hybrid sharing of GPUs across model-serving requests. Out of DynamICCL scope (the agent targets training collectives), but flagged as adjacent territory if DynamICCL ever extends to inference.

[FX-FW] All scheduling systems are framework-level — DynamICCL inherits.


5. Section VI: Communication Infrastructures

The most novel section relative to 0025 and the most directly relevant to DynamICCL's deployment substrate.

5.1 GPU interconnects

Hierarchy of interconnect, from fastest to slowest, with cited bandwidths:

  Intra-GPU memory:                ~3 TB/s (HBM)
  NVSwitch (3rd gen):             57.6 TB/s aggregate, all-to-all
  NVLink v4:                       900 GB/s point-to-point
  PCIe v7 (forthcoming):           242 GB/s
  PCIe v4 (current):               64  GB/s
  InfiniBand HDR:                  200 Gb/s (~25 GB/s) per port
  Ethernet 100 GbE:                100 Gb/s (~12.5 GB/s) per port
  Ethernet 1 GbE (Chameleon):      ~120 MB/s

[FX-HW] DynamICCL cannot change interconnect; it must read which is present (via topology probe at startup, or NCCL_DEBUG=INFO) and use that as state.

5.2 Programmable network devices

This subsection has no counterpart in 0025.

In-network aggregation (INA) executes a partial all-reduce sum inside the programmable switch fabric, eliminating the N-fold redundant traffic of ring/tree all-reduce.

System Ref Idea
SwitchML [282] Tofino-switch all-reduce with floating-point hack
ATP [283] Aggregation transport protocol over programmable switches
NetReduce [289] NIC-resident in-network aggregation
Sapio et al. (cited in 0025 too) -- First programmable-switch INA

SmartNIC offload moves work off the host CPU/GPU.

[FX-HW] Whether INA is available depends on the cluster's switch hardware. NCCL exposes a "CollNet" code path that uses INA when present; this becomes a candidate DynamICCL action when the cluster supports it.

5.3 Collective communication libraries

Library Ref Notes
NCCL [295] NVIDIA; tuned for NVLink/NVSwitch; closed-source-ish
Gloo -- CPU-friendly; used by PyTorch on CPU
MPI-based (OpenMPI, MPICH) -- HPC default; via UCX
MSCCLang [296] DSL for custom collectives over NCCL infrastructure
SCCL [303] SMT-solver synthesis of collective algorithms
TACCL [304] Topology-aware collective synthesis (extends SCCL)

The synthesized-collective work (MSCCLang/SCCL/TACCL) is the most important new development relative to 0025. These tools generate per-topology custom collective algorithms that outperform NCCL's hand-coded Ring/Tree/CollNet choices in many regimes. They are NOT yet runtime-tunable in production; they require offline synthesis. This is precisely the conceptual inverse of DynamICCL: they search the algorithm space offline for a fixed topology; DynamICCL searches the configuration space online for a fixed algorithm library (NCCL).

The two approaches are complementary: synthesized algorithms expand the algorithm slot of DynamICCL's action space.

5.4 Network topologies

Topology Refs Notes
Fat-Tree (classic) The HPC default
Torus / 2D-Torus (classic, Mikami et al.) Google TPU pods
BCube BML [305] DC-network rack-aware
HammingMesh [306] "Affordable" alternative to Fat-Tree at scale
Silicon Photonic (SiP) (cited) Reconfigurable optical fabric

[FX-HW] DynamICCL inherits topology. HammingMesh and SiP are flagged as trends — if SiP topologies become reconfigurable at run-time, DynamICCL's state space gains a "current logical topology" feature.


6. Section VII: Case Study on LLM Training

This section is the survey's narrative climax and the place where layers 1-3 are shown to interact.

6.1 The 100B-parameter math

For a 100B-parameter model in FP16 (~200 GB of gradients):

Pure DP all-reduce time:
  on 1 Gbps WAN:    ~1600 s per step (clearly infeasible)
  on 100 Gbps IB:   ~16   s per step (still too slow)
  on 900 GB/s NVLink: ~0.22 s per step (feasible)

This 7000x ratio — between cheap Ethernet and NVLink — is the survey's single most important number. It explains why LLM training requires intra-node NVLink and inter-node InfiniBand/RoCE at minimum, and why pure DP fails at this scale even on best-effort hardware: the model must be split (MP) and the work must be pipelined (PP).

6.2 3D parallelism

The standard recipe for trillion-parameter training:

+-----------------------------------------------------+
| 3D Parallelism = TP-within-node + PP-across-nodes + |
|                   DP-across-replicas                |
+-----------------------------------------------------+

Within a single 8-GPU node:
   Tensor Parallelism (Megatron-style) over NVLink/NVSwitch.
   Each GPU holds 1/8th of each layer's weights.
   AllReduce on activations + gradients per layer.

Across nodes (one pipeline stage = one node):
   Pipeline Parallelism (1F1B or interleaved).
   Send/Recv activations between adjacent stages.

Across pipeline replicas:
   Data Parallelism + ZeRO-3 / FSDP.
   AllGather weights, ReduceScatter gradients.

[FX-FW] All of this is configured in the user's training script via DeepSpeed / Megatron-LM / FSDP. DynamICCL acts on the resulting collectives.

6.3 Fault tolerance

System Ref Idea
SWARM [336] Stochastic 3D-parallel training; tolerates worker churn
Oobleck [337] Pipeline templates; pre-computed recovery plans
Bamboo (cited) Redundant pipeline stages on spot instances
Varuna [333] Same, on heterogeneous spot

[INFO] These systems sit at the framework layer. They affect the collective mix DynamICCL sees (recovery rebuilds collectives mid-training) but do not change the per-collective action space.

6.4 LLM unlearning and temporal heterogeneity

A short discussion of "unlearning" — removing a training example's influence — and non-stationary data. Not relevant to DynamICCL.


7. Section VIII: Open Problems

The survey closes with five explicitly enumerated open directions:

  1. Complexity of environment. Real-time adaptation to dynamic data distribution and network topology — exactly DynamICCL's pitch.
  2. Finer-grained gradient assessment. Cost-of-detection vs. benefit trade-off when identifying significant gradients.
  3. Hybrid compression optimization. Choosing optimal quant-bits + sparsity-k jointly under huge search space — RL territory.
  4. Co-design of infrastructure. Integrating in-network aggregation with transport scheduling and workload balancing — the layer-3 counterpart of DynamICCL's research thesis.
  5. Fault-tolerant LLM training. Sub-minute recovery in trillion-parameter training.

Items (1) and (4) are direct alignment points with DynamICCL; items (2) and (3) are adjacent.


8. Knob-vs-Design Master Table

This is the practical reference for DynamICCL design under the lens of this survey.

+---------------------------------+----------+--------+---------------------------------+
| Decision                        | Type     | Layer  | Notes                           |
+---------------------------------+----------+--------+---------------------------------+
| Parallelism mode (DP/TP/PP/3D)  | [FX-FW]  | Algo   | Training script (Megatron etc.) |
| Synchronization (BSP/SSP/ASP)   | [FX-FW]  | Algo   | DDP=BSP                         |
| Local-SGD tau                   | [KNOB]?  | Algo   | Only in Local-SGD regimes       |
| Compression on/off              | [FX-FW]  | Algo   | Library (DGC, ScaleCom)         |
| Compression bits / k            | [KNOB]?  | Algo   | Conceptually tunable            |
| Pipeline schedule (1F1B/Chimera)| [FX-FW]  | Frmwk  | DeepSpeed / Megatron config     |
| Tensor-fusion threshold         | [KNOB]   | Frmwk  | Horovod_FUSION_THRESHOLD        |
| Job-level scheduler             | [FX-FW]  | Frmwk  | Tiresias / Themis               |
| GPU sharing policy              | [FX-FW]  | Frmwk  | Gandiva / AntMan / Salus        |
+---------------------------------+----------+--------+---------------------------------+
| Collective library              | [FX-FW]  | Infra  | NCCL / Gloo / MSCCL             |
| AR algorithm (Ring/Tree/...)    | [KNOB]   | Infra  | <-- DynamICCL                   |
| AR protocol (LL/LL128/Simple)   | [KNOB]   | Infra  | <-- DynamICCL                   |
| nChannels                       | [KNOB]   | Infra  | <-- DynamICCL                   |
| numThreads                      | [KNOB]   | Infra  | <-- DynamICCL                   |
| chunkSize                       | [KNOB]   | Infra  | <-- DynamICCL (next axis)       |
| Synthesized collective on/off   | [KNOB]   | Infra  | If MSCCLang available           |
| CollNet / INA path              | [KNOB]   | Infra  | If switch supports              |
+---------------------------------+----------+--------+---------------------------------+
| Intra-node fabric (NVLink etc.) | [FX-HW]  | Infra  | Cluster-fixed                   |
| Inter-node fabric (IB/RoCE)     | [FX-HW]  | Infra  | Cluster-fixed                   |
| Programmable switch presence    | [FX-HW]  | Infra  | Determines INA availability     |
| Topology (Fat-Tree/Torus/...)   | [FX-HW]  | Infra  | Cluster-fixed (until SiP)       |
+---------------------------------+----------+--------+---------------------------------+

Compared to 0025's master table, this version explicitly distinguishes framework-fixed (FX-FW) vs. hardware-fixed (FX-HW) so the operator can see at a glance whether a given lever requires a code change or a hardware refresh. The "[KNOB]" rows that DynamICCL actually targets are a small slice — but they are precisely the slice that can be changed zero-cost at runtime.


9. Mapping the Taxonomy to DynamICCL

DynamICCL's RL agent (Agent-2) outputs <algo, proto, nChannels, numThreads> for each NCCL collective call. In this survey's three-layer model:

Layer 1 (Algorithm)        : exogenous to DynamICCL.
Layer 2 (Framework)        : exogenous to DynamICCL.
Layer 3 (Infrastructure)   : DynamICCL acts here, but only on the
                              NCCL-actuatable subset.

Concretely:

                            Survey scope
                  +--------------+--------------+
                  | Algorithm    | (Layer 1)    |
                  | Framework    | (Layer 2)    |
                  | Infrastructure (Layer 3)    |
                  |   - HW: NVLink, IB         |   <-- FX-HW (cluster)
                  |   - SW: NCCL libraries     |   <-- FX-FW (chosen)
                  |   - Synth: MSCCL/SCCL/TACCL|   <-- offline synthesis
                  |   - INA: SwitchML/ATP      |   <-- FX-HW
                  |   - NCCL knobs:            |
                  |       <algo, proto, nCh,   |   <-- DynamICCL ACTION
                  |        numTh, chunkSize>   |
                  +-----------------------------+

9.1 What DynamICCL gains from this survey

  1. Three-layer mental model. DynamICCL's research narrative gains the correct framing: "we tune the bottommost software-controllable layer of communication, complementary to higher-layer choices the user has already made." This positions DynamICCL alongside SCCL/TACCL/MSCCLang without overlapping their contribution.

  2. LLM workload context. Section VII gives the realistic message-size distribution DynamICCL will face on a foundation-model training run: per-iteration mix of TP all-reduce (small, frequent), PP send/recv (medium, frequent), DP all-reduce (large, less frequent). A per-collective adaptive policy is the right granularity for this mix — a static NCCL config would mis-tune at least one of the three.

  3. Validation that the open problem is real. Survey open problems (1) and (4) are the precise gap DynamICCL fills. Citing this paper lets DynamICCL claim a peer-reviewed survey-level identification of the research need.

  4. Adjacent expansion targets. When DynamICCL's action space eventually grows, the most natural next axes are:

    • chunkSize — already on the roadmap.
    • CollNet on/off — the in-network aggregation toggle, conditional on hardware support.
    • Synthesized collective selection — if the cluster ships MSCCLang/SCCL outputs, DynamICCL's algorithm slot can include them.
    • Tensor-fusion threshold — at the framework boundary; would require deeper integration with the user's training script.
  5. Scaling math as sanity check. The 7000x NVLink-vs-1Gbps ratio bounds what DynamICCL can achieve. On Chameleon Cloud's 1 GbE-only fabric, raw collective bandwidth is fundamentally limited; DynamICCL's wins will come from latency-side knobs (LL128, nChannels) and from avoiding pathological misconfigurations rather than from bandwidth optimization.

9.2 What this survey does NOT address (DynamICCL's gap)

9.3 Differentiation from 0025

The two surveys are best read in order:

  1. Read 0025 (Tang et al., 2023) for the algorithmic foundations: sync protocols, compression schemes, decentralized SGD, communication-cost formulas (Ring, Tree, Recursive Doubling), and FedML+MPI empirical receipts. This is where DynamICCL's state-feature ideas (congestion estimate, per-tensor gradient norm) come from.

  2. Read 0026 (this paper) for the system context: where NCCL sits in the three-layer stack, what large-scale workloads look like, how programmable switches and synthesized collectives extend the infrastructure layer, and what fault-tolerance and 3D-parallelism imply for the collective mix DynamICCL must handle.

For DynamICCL's related-work section, both surveys should be cited. 0025 justifies the algorithm-layer assumptions DynamICCL makes; 0026 justifies the infrastructure-layer scope DynamICCL operates within and the open problems it addresses.


10. Future-Extension Targets within the Taxonomy

If DynamICCL's action space expands beyond pure NCCL knobs, this survey maps the natural directions:

Future axis Survey cell Layer Difficulty
chunkSize Section VI / NCCL Infra Low — already an NCCL env var
CollNet (INA) toggle Section VI / SwitchML Infra Medium — needs hardware
Synthesized collective selection Section VI / MSCCLang Infra Medium — offline synthesis pipeline
Tensor-fusion threshold Section V / Prophet Frmwk Medium — Horovod/DDP boundary
Local-SGD tau Section III Algo High — touches training script
Compression bits / k Section IV Algo High — touches training script
Pipeline schedule Section V Frmwk Very high — Megatron/DeepSpeed level

The strict rule of thumb: items in the Infra row are zero-or-low training-script impact and are the natural near-term expansions. Frmwk items require coordinated tuning with the user's framework. Algo items require reasoning about convergence and accuracy — out of scope for a collective-config agent.


11. Closing Notes

This survey is the most current and most infrastructure-aware comm-efficient DDL survey available at the time of writing. Its weaknesses are (a) breadth-without-depth on the algorithmic side (0025 covers that better with its empirical benchmark) and (b) no unified evaluation. Its strengths are (a) the only survey that joins algorithm, framework, and infrastructure layers and (b) the only one that treats LLM-scale training as the primary scenario rather than the exception.

For DynamICCL specifically: this is the survey to cite when arguing that NCCL configuration tuning matters at the LLM scale, when justifying the three-layer mental model that puts DynamICCL at the lowest software layer, and when enumerating the open problems DynamICCL addresses.