Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey — Detailed Summary
Feng Liang, Zhen Zhang, Haifeng Lu, Victor C. M. Leung, Yanyi Guo, Xiping Hu | Shenzhen MSU-BIT University, Lanzhou U., UBC, Beijing IT | arXiv:2404.06114v1 [cs.DC] | April 2024 | 47 pages | covers literature 2018-2023
This summary follows the structure used for the companion survey 0025 (Tang et al., 2023) but emphasizes how this paper differs. Within each branch, concepts are tagged:
- [KNOB] a parameter that an external optimizer (such as DynamICCL's RL agent) could plausibly set at runtime;
- [FX-FW] framework-fixed choice baked into the training stack (PyTorch DDP / DeepSpeed / Megatron / Horovod) before NCCL is invoked;
- [FX-HW] hardware-fixed choice (capital expenditure: NVLink presence, switch programmability, fabric topology);
- [INFO] an empirical or analytical observation that informs the agent but is not actuated.
The final two sections give the knob/design master table and DynamICCL mapping.
0. Paper-Level Summary
0.1 Bibliographic and scope
This is the largest and most up-to-date survey of communication-efficient distributed deep learning at the time of publication. It is explicitly scoped to large-scale training and inference: thousands of GPUs, foundation models (LLMs and trillion-parameter recommender models), and heterogeneous inter-/intra-node interconnect. It supersedes the 0025 survey of Tang et al. (2023) along two axes: (a) it adds the framework and infrastructure layers that 0025 deliberately scopes out, and (b) it brings the literature window forward to 2023, capturing the LLM-era papers (PowerSGD, ScaleCom, GPT-class 3D parallelism, fault-tolerant pipeline systems).
0.2 Abstract distillation
The abstract identifies three new challenges introduced by the large-scale regime: (1) fault tolerance — long training runs across thousands of devices guarantee non-trivial failure rates, (2) scalability of algorithms and infrastructure — sublinear scaling becomes the rule rather than the exception, and (3) heterogeneity — datasets, model partitions, and resources are non-uniform. Communication is the bottleneck for all three. The survey covers the period 2018-2023 across three layers (algorithm, framework, infrastructure) and concludes with an LLM training case study.
0.3 What is new vs. prior surveys
Prior surveys explicitly compared:
- Ben-Nun & Hoefler (2019): operator-level concurrency in DNNs.
- Tang et al. (2020/2023, the companion 0025 survey): comm-efficient data-parallel SGD with empirical FedML+MPI benchmarks.
- Verbraeken et al. (2020): general distributed ML.
- Mayer & Jacobsen (2020): scalable DL frameworks.
Liang et al. claim novelty in:
- Joint coverage of the algorithm + framework + infrastructure stack.
- Treating LLM-scale training as a first-class scenario (Section VII).
- Including programmable network devices (in-network aggregation, SmartNICs) and synthesized collective libraries (SCCL, TACCL, MSCCLang) — both absent from 0025.
- Treating fault tolerance, GPU resource sharing, and inference serving alongside training.
0.4 No unified empirical benchmark
Unlike 0025, this survey does not run its own experiments. It cites benchmarks and numbers from the surveyed papers. The contribution is breadth plus the LLM case-study analysis, not new measurements.
1. Section I and II: Introduction and Fundamentals
1.1 Three large-scale challenges (Section I)
The paper organizes its motivation around three challenges that distinguish large-scale DDL from earlier-generation DDL:
- Algorithmic and architectural complexity. Search spaces (compression, parallelism, scheduling) explode at scale; static heuristics no longer suffice.
- Heterogeneity. Devices, networks, datasets, and model partitions are all non-uniform. Stragglers and load imbalance dominate wall-clock time.
- Foundation-model scale. GPT-class models exceed any single-device memory budget; pure data parallelism is no longer feasible.
1.2 Parallelism modes (Section II)
Data Parallelism (DP): replicate model, shard data, all-reduce gradients.
Model Parallelism (MP): shard model, route activations across shards.
Tensor Parallelism (TP) is intra-layer slicing.
Pipeline Parallelism (PP): partition by layer; overlap micro-batches.
3D Parallelism (DP+MP+PP): the standard for trillion-parameter LLMs.
[FX-FW] All four parallelism modes are training-script choices. DynamICCL inherits whichever combination the user has chosen and operates on the collectives produced by that choice. The collectives' message-size distribution differs sharply across modes:
Mode Typical collective Message size Frequency
DP AllReduce 100MB - 10GB once/step
TP (intra-layer) AllReduce / ReduceScatter 1-100MB many/step
PP (inter-layer) Send/Recv (point-to-point) <1MB - 100MB many/step
ZeRO-3 / FSDP AllGather / ReduceScatter 10MB - 1GB many/step
1.3 Paradigms
The survey distinguishes:
- Cluster / cloud-based DDL (HPC; relevant to DynamICCL).
- Federated learning (millions of edge devices).
- Peer-to-peer (gossip, Tor-like topologies).
DynamICCL targets the first paradigm only.
2. Section III: Communication-Efficient Model Synchronization
This section is the closest analog of 0025 Section 3. Liang et al. cover the same six-class synchronization taxonomy with newer references, and add a federated-learning subsection.
2.1 Six-class synchronization taxonomy
| Class | Representative work | Key idea |
|---|---|---|
| Synchronous SGD | OSP [85] | Overlap "unimportant" gradients with next-iter compute |
| Asynchronous SGD | Downpour [86], PS+ [87] | No barrier; eager pull; segment PS into shards |
| Stale Synchronous (SSP) | SSP [88], R2P [90], HSP [91] | Bounded staleness s; round-robin sync to dodge contention |
| Local SGD | Stich, Yu, Haddadpour, Woodworth | tau local steps before averaging |
| Adaptive Local | AdaComm [93], Post-local [94], SlowMo [95] | Adapt tau to network/loss state |
| Decentralized | D-PSGD [99], D^2 [100] | No central PS; gossip-style averaging |
[FX-FW] all six classes — DynamICCL inherits the choice.
2.2 Convergence rate tables (Tables IV and V of the paper)
Local SGD (Table IV, simplified):
| Reference | Convergence rate | Required sync rounds R | Setting |
|---|---|---|---|
| Stich [105] | O(G^2 / NT) | Omega(N^{1/2} T^{1/2}) | Bounded grad; strongly convex |
| Yu [106] | O(G^2 / sqrt(NT)) | Omega(N^{3/4} T^{3/4}) | Bounded grad; non-convex |
| Haddadpour [107] | O(1 / NT) | Omega(N^{1/3} T^{1/3}) | PL condition; non-convex |
| Woodworth [109] | O(sqrt(1/NT) + 1/cbrt(TR)) | Omega(N * polylog T) | Convex |
Asynchronous SGD (Table V):
| Reference | Required iterations | Setting |
|---|---|---|
| Stich [112] | Omega(sigma^2/eps^2 + tau_max/eps) | Quasi-convex |
| Aviv [113] | Omega(sigma^2/eps^2 + tau_avg/eps) | Delay-adaptive LR |
| Cohen [114] | Omega(sigma^2/eps^4 + tau_avg/eps^2) | Smooth non-convex |
[INFO] These tables generalize 0025's Table 9 to the large-scale regime — note the explicit dependence on tau_max and tau_avg (worst-case vs. mean delay), which is the practical handle for fault tolerance and straggler mitigation. DynamICCL's RL state can include observed delay statistics to predict convergence-region effects, but the survey treats this as a training-script tuning problem, not a runtime collective-config problem.
2.3 FL-specific synchronization
Federated-learning-specific issues are flagged as out-of-scope for HPC DDL but covered for completeness:
- FedAvg [117]: baseline; equivalent to Local-SGD with client subsampling.
- FedProx [122]: adds proximal term to handle non-IID drift.
- FedNova [123]: normalizes client updates by local-step count.
Not relevant to DynamICCL.
3. Section IV: Communication-Efficient Data Compression
This is the second algorithmic-layer section, also covered in 0025 but with newer methods.
3.1 Quantization
G -- Quant_b() --> G_quant -- send --> Unquant() --> G_approx
delta = G - Unquant(G_quant) (error feedback, optional)
| Method | Bits | Idea |
|---|---|---|
| 1-bit SGD [142] | 1 | Sign + threshold; speech apps |
| SignSGD [143] | 1 | Element-wise sign with majority vote |
| 1-bit Adam [145] | 1 | Warmup phase, then quantize Adam momentum |
| QSGD [146] | family | Stochastic quantization, unbiased estimator |
| TernGrad | 2 | Ternary {-1,0,+1} with scalar sharing |
| PowerSGD [153] | low-rank | Power-iteration matrix decomposition |
| ATOMO [152] | low-rank | Atomic decomposition (SVD / Fourier) |
PowerSGD is highlighted as the dominant low-rank technique at the LLM scale.
[FX-FW] On/off and bit-width are training-script choices.
3.2 Sparsification
| Method | Idea |
|---|---|
| Top-k [166] | Largest-magnitude k coordinates; 99% sparsity feasible |
| DGC [168] | Top-k + momentum correction + clipping + warmup |
| ScaleCom [172] | Cyclic local Top-k; designed for large clusters |
| Global Top-k [171] | Top-k applied after global aggregation |
| MSTopK [174] | Multi-stage Top-k for scalability |
ScaleCom and MSTopK are the large-scale-specific additions vs. 0025.
3.3 Hybrid compression
Combinations of sparsification and quantization:
- 3LC [192]: 3-level lossy compression, sparse + quant + entropy code.
- SBC, STC (Sattler): sparse binary / ternary compression.
Hybrid schemes raise survey open problem (3): how to choose the optimum quant-bits + sparsity-k pair from the combinatorial product space.
[KNOB] Ratio choice can in principle be tuned at runtime; in practice no production framework exposes this as a per-collective knob.
4. Section V: Resource Allocation and Task Scheduling
This section has no analog in 0025; it is an entirely new layer.
4.1 GPU sharing systems (training)
| System | Ref | Idea |
|---|---|---|
| Gandiva | [206] | Time-share GPU; live migration of jobs |
| AntMan | [207] | Co-design DL framework with cluster scheduler |
| Salus | [210] | Fine-grained sharing via fast PyTorch context switch |
Motivation: cited GPU utilization in shared clusters is 25-50%. Sharing reclaims the gap.
4.2 Pipeline scheduling
| System | Ref | Schedule |
|---|---|---|
| GPipe | [253] | Synchronous pipeline; bubble at tail |
| PipeDream | [252] | 1F1B asynchronous pipeline; weight-version mismatch |
| Chimera | [63] | Bidirectional pipeline; halves the bubble |
| Varuna | [333] | Spot-instance-aware pipeline |
4.3 Network flow scheduling
- Geryon [257]: priority-aware flow scheduling for tensor sync.
- Tiresias [244]: job-level GPU scheduling, no DL knowledge required.
- Sched2 [250], OSDL [249]: topology-aware scheduling.
4.4 Inference serving
A subsection on inference: spatial sharing, temporal sharing, and hybrid sharing of GPUs across model-serving requests. Out of DynamICCL scope (the agent targets training collectives), but flagged as adjacent territory if DynamICCL ever extends to inference.
[FX-FW] All scheduling systems are framework-level — DynamICCL inherits.
5. Section VI: Communication Infrastructures
The most novel section relative to 0025 and the most directly relevant to DynamICCL's deployment substrate.
5.1 GPU interconnects
Hierarchy of interconnect, from fastest to slowest, with cited bandwidths:
Intra-GPU memory: ~3 TB/s (HBM)
NVSwitch (3rd gen): 57.6 TB/s aggregate, all-to-all
NVLink v4: 900 GB/s point-to-point
PCIe v7 (forthcoming): 242 GB/s
PCIe v4 (current): 64 GB/s
InfiniBand HDR: 200 Gb/s (~25 GB/s) per port
Ethernet 100 GbE: 100 Gb/s (~12.5 GB/s) per port
Ethernet 1 GbE (Chameleon): ~120 MB/s
[FX-HW] DynamICCL cannot change interconnect; it must read which is present (via topology probe at startup, or NCCL_DEBUG=INFO) and use that as state.
5.2 Programmable network devices
This subsection has no counterpart in 0025.
In-network aggregation (INA) executes a partial all-reduce sum inside the programmable switch fabric, eliminating the N-fold redundant traffic of ring/tree all-reduce.
| System | Ref | Idea |
|---|---|---|
| SwitchML | [282] | Tofino-switch all-reduce with floating-point hack |
| ATP | [283] | Aggregation transport protocol over programmable switches |
| NetReduce | [289] | NIC-resident in-network aggregation |
| Sapio et al. (cited in 0025 too) | -- | First programmable-switch INA |
SmartNIC offload moves work off the host CPU/GPU.
- Guo et al. [290]: SmartNIC caches DLRM embedding tables to reduce all-to-all.
[FX-HW] Whether INA is available depends on the cluster's switch hardware. NCCL exposes a "CollNet" code path that uses INA when present; this becomes a candidate DynamICCL action when the cluster supports it.
5.3 Collective communication libraries
| Library | Ref | Notes |
|---|---|---|
| NCCL | [295] | NVIDIA; tuned for NVLink/NVSwitch; closed-source-ish |
| Gloo | -- | CPU-friendly; used by PyTorch on CPU |
| MPI-based (OpenMPI, MPICH) | -- | HPC default; via UCX |
| MSCCLang | [296] | DSL for custom collectives over NCCL infrastructure |
| SCCL | [303] | SMT-solver synthesis of collective algorithms |
| TACCL | [304] | Topology-aware collective synthesis (extends SCCL) |
The synthesized-collective work (MSCCLang/SCCL/TACCL) is the most important new development relative to 0025. These tools generate per-topology custom collective algorithms that outperform NCCL's hand-coded Ring/Tree/CollNet choices in many regimes. They are NOT yet runtime-tunable in production; they require offline synthesis. This is precisely the conceptual inverse of DynamICCL: they search the algorithm space offline for a fixed topology; DynamICCL searches the configuration space online for a fixed algorithm library (NCCL).
The two approaches are complementary: synthesized algorithms expand the algorithm slot of DynamICCL's action space.
5.4 Network topologies
| Topology | Refs | Notes |
|---|---|---|
| Fat-Tree | (classic) | The HPC default |
| Torus / 2D-Torus | (classic, Mikami et al.) | Google TPU pods |
| BCube | BML [305] | DC-network rack-aware |
| HammingMesh | [306] | "Affordable" alternative to Fat-Tree at scale |
| Silicon Photonic (SiP) | (cited) | Reconfigurable optical fabric |
[FX-HW] DynamICCL inherits topology. HammingMesh and SiP are flagged as trends — if SiP topologies become reconfigurable at run-time, DynamICCL's state space gains a "current logical topology" feature.
6. Section VII: Case Study on LLM Training
This section is the survey's narrative climax and the place where layers 1-3 are shown to interact.
6.1 The 100B-parameter math
For a 100B-parameter model in FP16 (~200 GB of gradients):
Pure DP all-reduce time:
on 1 Gbps WAN: ~1600 s per step (clearly infeasible)
on 100 Gbps IB: ~16 s per step (still too slow)
on 900 GB/s NVLink: ~0.22 s per step (feasible)
This 7000x ratio — between cheap Ethernet and NVLink — is the survey's single most important number. It explains why LLM training requires intra-node NVLink and inter-node InfiniBand/RoCE at minimum, and why pure DP fails at this scale even on best-effort hardware: the model must be split (MP) and the work must be pipelined (PP).
6.2 3D parallelism
The standard recipe for trillion-parameter training:
+-----------------------------------------------------+
| 3D Parallelism = TP-within-node + PP-across-nodes + |
| DP-across-replicas |
+-----------------------------------------------------+
Within a single 8-GPU node:
Tensor Parallelism (Megatron-style) over NVLink/NVSwitch.
Each GPU holds 1/8th of each layer's weights.
AllReduce on activations + gradients per layer.
Across nodes (one pipeline stage = one node):
Pipeline Parallelism (1F1B or interleaved).
Send/Recv activations between adjacent stages.
Across pipeline replicas:
Data Parallelism + ZeRO-3 / FSDP.
AllGather weights, ReduceScatter gradients.
[FX-FW] All of this is configured in the user's training script via DeepSpeed / Megatron-LM / FSDP. DynamICCL acts on the resulting collectives.
6.3 Fault tolerance
| System | Ref | Idea |
|---|---|---|
| SWARM | [336] | Stochastic 3D-parallel training; tolerates worker churn |
| Oobleck | [337] | Pipeline templates; pre-computed recovery plans |
| Bamboo | (cited) | Redundant pipeline stages on spot instances |
| Varuna | [333] | Same, on heterogeneous spot |
[INFO] These systems sit at the framework layer. They affect the collective mix DynamICCL sees (recovery rebuilds collectives mid-training) but do not change the per-collective action space.
6.4 LLM unlearning and temporal heterogeneity
A short discussion of "unlearning" — removing a training example's influence — and non-stationary data. Not relevant to DynamICCL.
7. Section VIII: Open Problems
The survey closes with five explicitly enumerated open directions:
- Complexity of environment. Real-time adaptation to dynamic data distribution and network topology — exactly DynamICCL's pitch.
- Finer-grained gradient assessment. Cost-of-detection vs. benefit trade-off when identifying significant gradients.
- Hybrid compression optimization. Choosing optimal quant-bits + sparsity-k jointly under huge search space — RL territory.
- Co-design of infrastructure. Integrating in-network aggregation with transport scheduling and workload balancing — the layer-3 counterpart of DynamICCL's research thesis.
- Fault-tolerant LLM training. Sub-minute recovery in trillion-parameter training.
Items (1) and (4) are direct alignment points with DynamICCL; items (2) and (3) are adjacent.
8. Knob-vs-Design Master Table
This is the practical reference for DynamICCL design under the lens of this survey.
+---------------------------------+----------+--------+---------------------------------+
| Decision | Type | Layer | Notes |
+---------------------------------+----------+--------+---------------------------------+
| Parallelism mode (DP/TP/PP/3D) | [FX-FW] | Algo | Training script (Megatron etc.) |
| Synchronization (BSP/SSP/ASP) | [FX-FW] | Algo | DDP=BSP |
| Local-SGD tau | [KNOB]? | Algo | Only in Local-SGD regimes |
| Compression on/off | [FX-FW] | Algo | Library (DGC, ScaleCom) |
| Compression bits / k | [KNOB]? | Algo | Conceptually tunable |
| Pipeline schedule (1F1B/Chimera)| [FX-FW] | Frmwk | DeepSpeed / Megatron config |
| Tensor-fusion threshold | [KNOB] | Frmwk | Horovod_FUSION_THRESHOLD |
| Job-level scheduler | [FX-FW] | Frmwk | Tiresias / Themis |
| GPU sharing policy | [FX-FW] | Frmwk | Gandiva / AntMan / Salus |
+---------------------------------+----------+--------+---------------------------------+
| Collective library | [FX-FW] | Infra | NCCL / Gloo / MSCCL |
| AR algorithm (Ring/Tree/...) | [KNOB] | Infra | <-- DynamICCL |
| AR protocol (LL/LL128/Simple) | [KNOB] | Infra | <-- DynamICCL |
| nChannels | [KNOB] | Infra | <-- DynamICCL |
| numThreads | [KNOB] | Infra | <-- DynamICCL |
| chunkSize | [KNOB] | Infra | <-- DynamICCL (next axis) |
| Synthesized collective on/off | [KNOB] | Infra | If MSCCLang available |
| CollNet / INA path | [KNOB] | Infra | If switch supports |
+---------------------------------+----------+--------+---------------------------------+
| Intra-node fabric (NVLink etc.) | [FX-HW] | Infra | Cluster-fixed |
| Inter-node fabric (IB/RoCE) | [FX-HW] | Infra | Cluster-fixed |
| Programmable switch presence | [FX-HW] | Infra | Determines INA availability |
| Topology (Fat-Tree/Torus/...) | [FX-HW] | Infra | Cluster-fixed (until SiP) |
+---------------------------------+----------+--------+---------------------------------+
Compared to 0025's master table, this version explicitly distinguishes framework-fixed (FX-FW) vs. hardware-fixed (FX-HW) so the operator can see at a glance whether a given lever requires a code change or a hardware refresh. The "[KNOB]" rows that DynamICCL actually targets are a small slice — but they are precisely the slice that can be changed zero-cost at runtime.
9. Mapping the Taxonomy to DynamICCL
DynamICCL's RL agent (Agent-2) outputs
<algo, proto, nChannels, numThreads> for each NCCL
collective call. In this survey's three-layer model:
Layer 1 (Algorithm) : exogenous to DynamICCL.
Layer 2 (Framework) : exogenous to DynamICCL.
Layer 3 (Infrastructure) : DynamICCL acts here, but only on the
NCCL-actuatable subset.
Concretely:
Survey scope
+--------------+--------------+
| Algorithm | (Layer 1) |
| Framework | (Layer 2) |
| Infrastructure (Layer 3) |
| - HW: NVLink, IB | <-- FX-HW (cluster)
| - SW: NCCL libraries | <-- FX-FW (chosen)
| - Synth: MSCCL/SCCL/TACCL| <-- offline synthesis
| - INA: SwitchML/ATP | <-- FX-HW
| - NCCL knobs: |
| <algo, proto, nCh, | <-- DynamICCL ACTION
| numTh, chunkSize> |
+-----------------------------+
9.1 What DynamICCL gains from this survey
Three-layer mental model. DynamICCL's research narrative gains the correct framing: "we tune the bottommost software-controllable layer of communication, complementary to higher-layer choices the user has already made." This positions DynamICCL alongside SCCL/TACCL/MSCCLang without overlapping their contribution.
LLM workload context. Section VII gives the realistic message-size distribution DynamICCL will face on a foundation-model training run: per-iteration mix of TP all-reduce (small, frequent), PP send/recv (medium, frequent), DP all-reduce (large, less frequent). A per-collective adaptive policy is the right granularity for this mix — a static NCCL config would mis-tune at least one of the three.
Validation that the open problem is real. Survey open problems (1) and (4) are the precise gap DynamICCL fills. Citing this paper lets DynamICCL claim a peer-reviewed survey-level identification of the research need.
Adjacent expansion targets. When DynamICCL's action space eventually grows, the most natural next axes are:
- chunkSize — already on the roadmap.
- CollNet on/off — the in-network aggregation toggle, conditional on hardware support.
- Synthesized collective selection — if the cluster ships MSCCLang/SCCL outputs, DynamICCL's algorithm slot can include them.
- Tensor-fusion threshold — at the framework boundary; would require deeper integration with the user's training script.
Scaling math as sanity check. The 7000x NVLink-vs-1Gbps ratio bounds what DynamICCL can achieve. On Chameleon Cloud's 1 GbE-only fabric, raw collective bandwidth is fundamentally limited; DynamICCL's wins will come from latency-side knobs (LL128, nChannels) and from avoiding pathological misconfigurations rather than from bandwidth optimization.
9.2 What this survey does NOT address (DynamICCL's gap)
Per-collective runtime tuning of the All-Reduce primitive itself. The survey treats NCCL as a single fixed library. The ~4-dimensional configuration manifold within an NCCL collective call (algo x proto x nChannels x numThreads x chunkSize) is not enumerated. This is the unmodeled space DynamICCL fills.
Online adaptation to congestion. The survey lists this as open problem (1) but offers no mechanism. DynamICCL's RL stack is one such mechanism.
Probe-vs-production tensor gap. The mismatch between offline-tuned configs (e.g., ncclTuner) and runtime-observed congestion is not discussed. This is also a DynamICCL contribution.
9.3 Differentiation from 0025
The two surveys are best read in order:
Read 0025 (Tang et al., 2023) for the algorithmic foundations: sync protocols, compression schemes, decentralized SGD, communication-cost formulas (Ring, Tree, Recursive Doubling), and FedML+MPI empirical receipts. This is where DynamICCL's state-feature ideas (congestion estimate, per-tensor gradient norm) come from.
Read 0026 (this paper) for the system context: where NCCL sits in the three-layer stack, what large-scale workloads look like, how programmable switches and synthesized collectives extend the infrastructure layer, and what fault-tolerance and 3D-parallelism imply for the collective mix DynamICCL must handle.
For DynamICCL's related-work section, both surveys should be cited. 0025 justifies the algorithm-layer assumptions DynamICCL makes; 0026 justifies the infrastructure-layer scope DynamICCL operates within and the open problems it addresses.
10. Future-Extension Targets within the Taxonomy
If DynamICCL's action space expands beyond pure NCCL knobs, this survey maps the natural directions:
| Future axis | Survey cell | Layer | Difficulty |
|---|---|---|---|
| chunkSize | Section VI / NCCL | Infra | Low — already an NCCL env var |
| CollNet (INA) toggle | Section VI / SwitchML | Infra | Medium — needs hardware |
| Synthesized collective selection | Section VI / MSCCLang | Infra | Medium — offline synthesis pipeline |
| Tensor-fusion threshold | Section V / Prophet | Frmwk | Medium — Horovod/DDP boundary |
| Local-SGD tau | Section III | Algo | High — touches training script |
| Compression bits / k | Section IV | Algo | High — touches training script |
| Pipeline schedule | Section V | Frmwk | Very high — Megatron/DeepSpeed level |
The strict rule of thumb: items in the Infra row are zero-or-low training-script impact and are the natural near-term expansions. Frmwk items require coordinated tuning with the user's framework. Algo items require reasoning about convergence and accuracy — out of scope for a collective-config agent.
11. Closing Notes
This survey is the most current and most infrastructure-aware comm-efficient DDL survey available at the time of writing. Its weaknesses are (a) breadth-without-depth on the algorithmic side (0025 covers that better with its empirical benchmark) and (b) no unified evaluation. Its strengths are (a) the only survey that joins algorithm, framework, and infrastructure layers and (b) the only one that treats LLM-scale training as the primary scenario rather than the exception.
For DynamICCL specifically: this is the survey to cite when arguing that NCCL configuration tuning matters at the LLM scale, when justifying the three-layer mental model that puts DynamICCL at the lowest software layer, and when enumerating the open problems DynamICCL addresses.