A Survey on Distributed Machine Learning — Detailed Summary

Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, Jan S. Rellermeyer | TU Delft + imec/Ghent | ACM Computing Surveys 53(2), Article 30, 2020 | DOI 10.1145/3377454

Per-section summary organized by the survey's headings. Each section flags whether its content is in scope (intra-cluster sync DDL with NCCL — relevant to DynamICCL) or out of scope (federated, gossip, async PS, etc.).


Abstract


1. Introduction

DynamICCL relevance: sets the framing — DynamICCL targets the training side of this picture, specifically the communication phase.


2. Machine Learning — A High-Performance Computing Challenge?

2.1 Scaling Up (hardware acceleration)

2.2 Scaling Out (distributed systems)

2.3 Discussion

DynamICCL relevance: Sec. 2.1's GPU material and Sec. 2.2's communication discussion are foundational. NCCL is the layer that makes scale-out work for GPU clusters; DynamICCL tunes that layer.


3. A Reference Architecture for Distributed Machine Learning

3.1 Machine Learning Algorithms

Three orthogonal taxonomies:

3.2 Hyperparameter Optimization

3.3 Combining Multiple Algorithms: Ensemble Methods

3.4 Topologies

Four topology classes (Figure 3 in the paper):

(a) Centralized          (b) Tree (all-reduce-style)
        Aggregator              root
       /    |    \              /  \
      W     W     W           W     W
                              / \   / \
                             W   W W   W

(c) Parameter Server      (d) Peer-to-Peer
   PS1   PS2  PS3            W -- W -- W
   |  \ / |  / |             |  X  |  X |
   W   W  W   W              W -- W -- W

DynamICCL relevance (high): Tree (b) is exactly the family of topologies NCCL implements as Ring and Tree algorithms. (c) PS and (d) gossip are out of scope.

3.5 Communication

3.5.1 Computation Time vs. Communication vs. Accuracy

3.5.2 Bridging Computation and Communication — synchronization models

Model Behavior Pros Cons
BSP (Bulk Synchronous Parallel) global barrier each step guaranteed correctness, simple straggler-bound
SSP (Stale Synchronous Parallel) fastest worker may be at most s steps ahead of slowest bounded error, faster than BSP tuning s; complex
ASP (Approximate Synchronous Parallel) sync only when parameter delta is significant adaptive, skips unimportant updates hard to bound error
BAP/TAP (Barrierless / Total Async) no barriers at all fastest error grows with delay; convergence risk

3.5.3 Communication Strategies

DynamICCL relevance: BSP is DynamICCL's regime. WFBP-style overlap is already done by NCCL+framework integration; DynamICCL's per-collective tuning is a complementary lever. Compression and async modes are out of scope.


4. The Distributed Machine Learning Ecosystem

4.1 General Purpose Distributed Computing Frameworks

DynamICCL relevance: out of scope (wrong abstraction layer).

4.2 Natively Distributed Machine Learning Systems

4.2.1 Distributed Ensemble Learning

4.2.2 Parallel Synchronous SGD

DynamICCL relevance (very high): these are precisely the frameworks under which DynamICCL operates. NCCL is named explicitly as the GPU collective backend. DynamICCL tunes the layer Horovod and Caffe2 call into.

4.2.3 Parallel Asynchronous SGD and Parameter Servers

DynamICCL relevance: out of scope for the async PS variants. TensorFlow's all-reduce mode is in scope (it uses NCCL underneath).

4.2.4 Parallel Stale-Synchronous SGD

DynamICCL relevance: out of scope.

4.2.5 Parallel Hybrid-Synchronous SGD

DynamICCL relevance: the all-reduce sub-component is in scope; the PS component is not.

4.3 Machine Learning in the Cloud

DynamICCL relevance: orthogonal — DynamICCL targets bare-metal HPC (Chameleon Cloud) but the same NCCL tuning applies in cloud GPU instances that expose NCCL.


5. Conclusions and Current Challenges

5.1 Performance

DynamICCL relevance: directly motivates DynamICCL — the survey identifies the unsolved problem DynamICCL addresses.

5.2 Fault Tolerance

DynamICCL relevance: flag — DynamICCL inherits NCCL's fault-tolerance limitations.

5.3 Privacy (Federated Learning)

DynamICCL relevance: out of scope.

5.4 Portability

DynamICCL relevance: out of scope.


Tables / Figures


Survey-Level Limitations (acknowledged or evident)


Scope Filter for DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects, per collective call, the tuple (algorithm, protocol, nChannels, numThreads) to minimize collective completion time on a synchronous data-parallel GPU cluster. Mapping the survey's design space to DynamICCL:

Survey axis DynamICCL's choice In/Out of scope
Parallelism Data-parallel In scope
Topology Tree / all-reduce (NCCL Ring, Tree) In scope
Synchronization BSP In scope
Communication strategy NCCL collectives (per-call config) In scope; this is the lever
Compression None (DynamICCL doesn't compress) Out of scope
Framework layer Horovod / PyTorch DDP / TF Out of scope (above DynamICCL)
Hardware GPU clusters with NVLink + IB/Ethernet In scope
Cloud vs. bare-metal Bare-metal Chameleon Cloud Orthogonal

In scope for DynamICCL (cite directly):

  1. Sec. 3.4 tree / all-reduce topology — the family DynamICCL tunes.
  2. Sec. 3.5.1 compute-vs-communication trade-off — DynamICCL's reward function operationalizes this.
  3. Sec. 3.5.3 WFBP and HybComm — dynamic strategy switching at the gradient layer; DynamICCL does the same idea at the NCCL config layer.
  4. Sec. 4.2.2 Horovod, Caffe2 — explicitly use NCCL; the production environment DynamICCL plugs into.
  5. Sec. 5.1 performance challenge — survey-identified open problem that DynamICCL directly addresses.

Out of scope for DynamICCL (cite for boundary-setting):

  1. Federated learning (Sec. 5.3).
  2. Peer-to-peer / gossip / SFB (Sec. 3.4(d)).
  3. Asynchronous parameter servers (Sec. 4.2.3, DistBelief Downpour SGD).
  4. Stale-synchronous parallel (Sec. 4.2.4, Petuum).
  5. General-purpose frameworks (Sec. 4.1, Hadoop, Spark).
  6. Gradient compression (1-bit SGD).
  7. Privacy / portability (Sec. 5.3, 5.4).

Recommended use of this survey in the DynamICCL paper: cite once in the Background section to anchor the design space, then point at the specific cell DynamICCL occupies. The survey is too broad to be a direct comparison target but is ideal for explaining what DynamICCL is not doing and why that is a deliberate scoping choice.