A Survey on Distributed Machine Learning — Detailed Summary

Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, Jan S. Rellermeyer | TU Delft + imec/Ghent | ACM Computing Surveys 53(2), Article 30, 2020 | DOI 10.1145/3377454

Per-section summary organized by the survey's headings. Each section flags whether its content is in scope (intra-cluster sync DDL with NCCL — relevant to DynamICCL) or out of scope (federated, gossip, async PS, etc.).

Abstract

AI demand has surged thanks to ML advances and hardware acceleration.
Larger training datasets and models have outstripped single-machine compute, forcing workloads to be distributed across many machines.
Distribution introduces two core challenges: efficient parallelization of training, and producing a coherent global model.
The survey provides an extensive overview of state-of-the-art techniques and systems, framed against conventional ML.

1. Introduction

Frames the question of why ML is now an HPC problem: training-data growth and parameter counts are outpacing Moore's Law.
Establishes the distinction between training (compute-heavy, distributed) and inference/prediction (latency-sensitive, often single-node).
Sets up the reference architecture used in later sections.

DynamICCL relevance: sets the framing — DynamICCL targets the training side of this picture, specifically the communication phase.

2. Machine Learning — A High-Performance Computing Challenge?

2.1 Scaling Up (hardware acceleration)

GPUs: Nvidia Tesla / V100 lineage; programmable many-thread architectures that have shifted from rigid SIMD to more flexible SIMT.
ASICs: Google's TPU with a systolic array for matrix multiply, approximated as MIMD; DianNao family with Neuro-Functional Units.
CPU vector instructions: AVX-512 brings ML-relevant throughput to general-purpose CPUs.
Other accelerators: Adapteva Epiphany (many-core MIMD), FPGAs deployed by Microsoft Azure (Project Brainwave).
Limits: memory bandwidth, energy density, and die area cap how far scale-up can go for very large models.

2.2 Scaling Out (distributed systems)

Once a single node's accelerators are saturated, multi-node distribution is the only path forward.
Communication latency and bandwidth become first-order costs; high-performance interconnects (InfiniBand, NVLink) matter as much as compute.
Introduces the tension that drives the rest of the paper: more nodes give more compute but also more communication.

2.3 Discussion

Scaling up and scaling out are complementary; modern systems do both (multi-GPU per node, many nodes per cluster).
The "right" mix depends on model size, batch size, and interconnect.

DynamICCL relevance: Sec. 2.1's GPU material and Sec. 2.2's communication discussion are foundational. NCCL is the layer that makes scale-out work for GPU clusters; DynamICCL tunes that layer.

3. A Reference Architecture for Distributed Machine Learning

3.1 Machine Learning Algorithms

Three orthogonal taxonomies:

Feedback type: supervised, unsupervised, semi-supervised, reinforcement.
Purpose: anomaly detection, classification, clustering, regression, dimensionality reduction.
Method: evolutionary algorithms, SGD-based methods, SVMs, ANNs, RNNs, decision trees.

3.2 Hyperparameter Optimization

Outer loop over the training process: learning rate, batch size, architecture choices.
Naturally distributable (embarrassingly parallel across configurations) but expensive at scale.

3.3 Combining Multiple Algorithms: Ensemble Methods

Bagging, boosting, stacking — each can be trivially distributed because ensemble members train independently.

3.4 Topologies

Four topology classes (Figure 3 in the paper):

(a) Centralized          (b) Tree (all-reduce-style)
        Aggregator              root
       /    |    \              /  \
      W     W     W           W     W
                              / \   / \
                             W   W W   W

(c) Parameter Server      (d) Peer-to-Peer
   PS1   PS2  PS3            W -- W -- W
   |  \ / |  / |             |  X  |  X |
   W   W  W   W              W -- W -- W

(a) Centralized: one aggregator combines independently trained models; hierarchical ensembling.
(b) Tree: workers communicate up-and-down a tree; the natural shape of reduction collectives. All-reduce is described here as accumulating gradients up the tree and broadcasting back down.
(c) Parameter Server (PS): sharded master nodes hold the global model as a key-value store; workers push updates and pull current parameters.
(d) Peer-to-Peer: every node holds a full copy of parameters; no central point of failure. Examples: Sufficient Factor Broadcasting (SFB), gossip learning.

DynamICCL relevance (high): Tree (b) is exactly the family of topologies NCCL implements as Ring and Tree algorithms. (c) PS and (d) gossip are out of scope.

3.5 Communication

3.5.1 Computation Time vs. Communication vs. Accuracy

Three-way trade-off: more synchronization improves accuracy but increases communication; more compute per step amortizes communication but slows iteration; lossy compression cuts communication but degrades accuracy.

3.5.2 Bridging Computation and Communication — synchronization models

Model	Behavior	Pros	Cons
BSP (Bulk Synchronous Parallel)	global barrier each step	guaranteed correctness, simple	straggler-bound
SSP (Stale Synchronous Parallel)	fastest worker may be at most s steps ahead of slowest	bounded error, faster than BSP	tuning s; complex
ASP (Approximate Synchronous Parallel)	sync only when parameter delta is significant	adaptive, skips unimportant updates	hard to bound error
BAP/TAP (Barrierless / Total Async)	no barriers at all	fastest	error grows with delay; convergence risk

3.5.3 Communication Strategies

Continuous communication: updates streamed as they are produced.
Wait-Free Backpropagation (WFBP): during backward pass, send gradients for already-computed (top) layers while still computing lower layers — classic compute/communication overlap.
HybComm (Hybrid Communication): dynamically switch between PS and SFB per parameter based on sparsity.
Gradient compression: 1-bit SGD in CNTK quantizes each gradient to a single bit with error feedback.

DynamICCL relevance: BSP is DynamICCL's regime. WFBP-style overlap is already done by NCCL+framework integration; DynamICCL's per-collective tuning is a complementary lever. Compression and async modes are out of scope.

4. The Distributed Machine Learning Ecosystem

4.1 General Purpose Distributed Computing Frameworks

Apache Hadoop: MapReduce + HDFS. Foundational but disk-bound; not suitable for iterative ML.
Apache Spark: RDDs, MLlib; in-memory iterative compute fixes Hadoop's iteration cost. Used for classical ML at scale; not the dominant choice for deep learning.

DynamICCL relevance: out of scope (wrong abstraction layer).

4.2 Natively Distributed Machine Learning Systems

4.2.1 Distributed Ensemble Learning

Embarrassingly parallel ensemble members; simplest form of distribution.

4.2.2 Parallel Synchronous SGD

Baidu's all-reduce SGD: ring all-reduce on top of MPI; brought HPC-style collectives to deep learning.
Horovod: Uber's open-source library; uses MPI + NVIDIA NCCL to perform all-reduce across multi-GPU multi-node clusters. Headline framework for sync DDL.
Caffe2: uses NCCL and Facebook's Gloo library for collectives.
CNTK: Microsoft; pioneered 1-bit SGD for gradient compression.

DynamICCL relevance (very high): these are precisely the frameworks under which DynamICCL operates. NCCL is named explicitly as the GPU collective backend. DynamICCL tunes the layer Horovod and Caffe2 call into.

4.2.3 Parallel Asynchronous SGD and Parameter Servers

DistBelief (Google): introduced Downpour SGD (async PS) and Distributed L-BFGS.
TensorFlow: evolution of DistBelief; dataflow graph with placement on CPUs/GPUs/TPUs; supports both PS and all-reduce paradigms.
MXNet: KVStore abstraction; supports sync/async PS modes; flexible imperative + symbolic graphs.

DynamICCL relevance: out of scope for the async PS variants. TensorFlow's all-reduce mode is in scope (it uses NCCL underneath).

4.2.4 Parallel Stale-Synchronous SGD

Petuum: SSP execution model with bounded staleness; dynamic scheduling for very large models that don't fit on one node.

DynamICCL relevance: out of scope.

4.2.5 Parallel Hybrid-Synchronous SGD

MXNet-MPI: combines MPI all-reduce within groups and PS across groups.

DynamICCL relevance: the all-reduce sub-component is in scope; the PS component is not.

4.3 Machine Learning in the Cloud

AWS SageMaker, Google Cloud AI, Azure ML are commoditizing distributed ML.
The cloud is the dominant deployment model for non-hyperscaler users.
Frameworks compete on managed infrastructure as much as on algorithms.

DynamICCL relevance: orthogonal — DynamICCL targets bare-metal HPC (Chameleon Cloud) but the same NCCL tuning applies in cloud GPU instances that expose NCCL.

5. Conclusions and Current Challenges

5.1 Performance

Communication remains the dominant performance lever at scale.
Compute-communication overlap, gradient compression, topology-aware collective design, and synchronization-model relaxation are all active fronts.
Cross-framework benchmarking at scale (10k+ nodes) is largely missing.

DynamICCL relevance: directly motivates DynamICCL — the survey identifies the unsolved problem DynamICCL addresses.

5.2 Fault Tolerance

Synchronous all-reduce systems (MPI/NCCL) lack production-grade fault tolerance: a single failed node stalls the collective.
Async PS systems are naturally more resilient because workers can drop in and out.
Open problem: fault-tolerant all-reduce.

DynamICCL relevance: flag — DynamICCL inherits NCCL's fault-tolerance limitations.

5.3 Privacy (Federated Learning)

Federated learning trains across mutually-distrusting devices without centralizing data.
Differential privacy was a hoped-for solution; Hitaj et al. showed GAN-based attacks recover record-level info despite record-level DP.
Most production frameworks have no built-in privacy primitives.

DynamICCL relevance: out of scope.

5.4 Portability

Models are typically locked to their training framework.
ONNX is a partial answer.
Cross-framework portability of trained models, training infrastructure, and datasets remain open.

DynamICCL relevance: out of scope.

Tables / Figures

Figure 1: Conceptual view: training (optimization + hyperparameters) vs. prediction (inference).
Figure 2: Side-by-side data parallelism (replicated model, split data) vs. model parallelism (replicated data, split model).
Figure 3: Topology taxonomy: (a) centralized, (b) tree, (c) parameter server, (d) peer-to-peer.
Figure 4: Distributed ML ecosystem Venn: cloud, general-purpose frameworks, natively distributed systems.

Survey-Level Limitations (acknowledged or evident)

2020 cutoff — misses Megatron-LM, ZeRO, modern pipeline parallelism (GPipe is current; later schemes are not), PowerSGD-class compression refinements, modern federated systems.
Pipeline parallelism gets shallow treatment relative to data/model.
No quantitative cross-framework benchmarking.
Benchmarking depth at >1k-node scale is acknowledged as missing.

Scope Filter for DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects, per collective call, the tuple (algorithm, protocol, nChannels, numThreads) to minimize collective completion time on a synchronous data-parallel GPU cluster. Mapping the survey's design space to DynamICCL:

Survey axis	DynamICCL's choice	In/Out of scope
Parallelism	Data-parallel	In scope
Topology	Tree / all-reduce (NCCL Ring, Tree)	In scope
Synchronization	BSP	In scope
Communication strategy	NCCL collectives (per-call config)	In scope; this is the lever
Compression	None (DynamICCL doesn't compress)	Out of scope
Framework layer	Horovod / PyTorch DDP / TF	Out of scope (above DynamICCL)
Hardware	GPU clusters with NVLink + IB/Ethernet	In scope
Cloud vs. bare-metal	Bare-metal Chameleon Cloud	Orthogonal

In scope for DynamICCL (cite directly):

Sec. 3.4 tree / all-reduce topology — the family DynamICCL tunes.
Sec. 3.5.1 compute-vs-communication trade-off — DynamICCL's reward function operationalizes this.
Sec. 3.5.3 WFBP and HybComm — dynamic strategy switching at the gradient layer; DynamICCL does the same idea at the NCCL config layer.
Sec. 4.2.2 Horovod, Caffe2 — explicitly use NCCL; the production environment DynamICCL plugs into.
Sec. 5.1 performance challenge — survey-identified open problem that DynamICCL directly addresses.

Out of scope for DynamICCL (cite for boundary-setting):

Federated learning (Sec. 5.3).
Peer-to-peer / gossip / SFB (Sec. 3.4(d)).
Asynchronous parameter servers (Sec. 4.2.3, DistBelief Downpour SGD).
Stale-synchronous parallel (Sec. 4.2.4, Petuum).
General-purpose frameworks (Sec. 4.1, Hadoop, Spark).
Gradient compression (1-bit SGD).
Privacy / portability (Sec. 5.3, 5.4).

Recommended use of this survey in the DynamICCL paper: cite once in the Background section to anchor the design space, then point at the specific cell DynamICCL occupies. The survey is too broad to be a direct comparison target but is ideal for explaining what DynamICCL is not doing and why that is a deliberate scoping choice.