A Survey on Distributed Machine Learning — Detailed Summary
Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, Jan S. Rellermeyer | TU Delft + imec/Ghent | ACM Computing Surveys 53(2), Article 30, 2020 | DOI 10.1145/3377454
Per-section summary organized by the survey's headings. Each section flags whether its content is in scope (intra-cluster sync DDL with NCCL — relevant to DynamICCL) or out of scope (federated, gossip, async PS, etc.).
Abstract
- AI demand has surged thanks to ML advances and hardware acceleration.
- Larger training datasets and models have outstripped single-machine compute, forcing workloads to be distributed across many machines.
- Distribution introduces two core challenges: efficient parallelization of training, and producing a coherent global model.
- The survey provides an extensive overview of state-of-the-art techniques and systems, framed against conventional ML.
1. Introduction
- Frames the question of why ML is now an HPC problem: training-data growth and parameter counts are outpacing Moore's Law.
- Establishes the distinction between training (compute-heavy, distributed) and inference/prediction (latency-sensitive, often single-node).
- Sets up the reference architecture used in later sections.
DynamICCL relevance: sets the framing — DynamICCL targets the training side of this picture, specifically the communication phase.
2. Machine Learning — A High-Performance Computing Challenge?
2.1 Scaling Up (hardware acceleration)
- GPUs: Nvidia Tesla / V100 lineage; programmable many-thread architectures that have shifted from rigid SIMD to more flexible SIMT.
- ASICs: Google's TPU with a systolic array for matrix multiply, approximated as MIMD; DianNao family with Neuro-Functional Units.
- CPU vector instructions: AVX-512 brings ML-relevant throughput to general-purpose CPUs.
- Other accelerators: Adapteva Epiphany (many-core MIMD), FPGAs deployed by Microsoft Azure (Project Brainwave).
- Limits: memory bandwidth, energy density, and die area cap how far scale-up can go for very large models.
2.2 Scaling Out (distributed systems)
- Once a single node's accelerators are saturated, multi-node distribution is the only path forward.
- Communication latency and bandwidth become first-order costs; high-performance interconnects (InfiniBand, NVLink) matter as much as compute.
- Introduces the tension that drives the rest of the paper: more nodes give more compute but also more communication.
2.3 Discussion
- Scaling up and scaling out are complementary; modern systems do both (multi-GPU per node, many nodes per cluster).
- The "right" mix depends on model size, batch size, and interconnect.
DynamICCL relevance: Sec. 2.1's GPU material and Sec. 2.2's communication discussion are foundational. NCCL is the layer that makes scale-out work for GPU clusters; DynamICCL tunes that layer.
3. A Reference Architecture for Distributed Machine Learning
3.1 Machine Learning Algorithms
Three orthogonal taxonomies:
- Feedback type: supervised, unsupervised, semi-supervised, reinforcement.
- Purpose: anomaly detection, classification, clustering, regression, dimensionality reduction.
- Method: evolutionary algorithms, SGD-based methods, SVMs, ANNs, RNNs, decision trees.
3.2 Hyperparameter Optimization
- Outer loop over the training process: learning rate, batch size, architecture choices.
- Naturally distributable (embarrassingly parallel across configurations) but expensive at scale.
3.3 Combining Multiple Algorithms: Ensemble Methods
- Bagging, boosting, stacking — each can be trivially distributed because ensemble members train independently.
3.4 Topologies
Four topology classes (Figure 3 in the paper):
(a) Centralized (b) Tree (all-reduce-style)
Aggregator root
/ | \ / \
W W W W W
/ \ / \
W W W W
(c) Parameter Server (d) Peer-to-Peer
PS1 PS2 PS3 W -- W -- W
| \ / | / | | X | X |
W W W W W -- W -- W
- (a) Centralized: one aggregator combines independently trained models; hierarchical ensembling.
- (b) Tree: workers communicate up-and-down a tree; the natural shape of reduction collectives. All-reduce is described here as accumulating gradients up the tree and broadcasting back down.
- (c) Parameter Server (PS): sharded master nodes hold the global model as a key-value store; workers push updates and pull current parameters.
- (d) Peer-to-Peer: every node holds a full copy of parameters; no central point of failure. Examples: Sufficient Factor Broadcasting (SFB), gossip learning.
DynamICCL relevance (high): Tree (b) is exactly the family of topologies NCCL implements as Ring and Tree algorithms. (c) PS and (d) gossip are out of scope.
3.5 Communication
3.5.1 Computation Time vs. Communication vs. Accuracy
- Three-way trade-off: more synchronization improves accuracy but increases communication; more compute per step amortizes communication but slows iteration; lossy compression cuts communication but degrades accuracy.
3.5.2 Bridging Computation and Communication — synchronization models
| Model | Behavior | Pros | Cons |
|---|---|---|---|
| BSP (Bulk Synchronous Parallel) | global barrier each step | guaranteed correctness, simple | straggler-bound |
| SSP (Stale Synchronous Parallel) | fastest worker may be at most s steps ahead of slowest | bounded error, faster than BSP | tuning s; complex |
| ASP (Approximate Synchronous Parallel) | sync only when parameter delta is significant | adaptive, skips unimportant updates | hard to bound error |
| BAP/TAP (Barrierless / Total Async) | no barriers at all | fastest | error grows with delay; convergence risk |
3.5.3 Communication Strategies
- Continuous communication: updates streamed as they are produced.
- Wait-Free Backpropagation (WFBP): during backward pass, send gradients for already-computed (top) layers while still computing lower layers — classic compute/communication overlap.
- HybComm (Hybrid Communication): dynamically switch between PS and SFB per parameter based on sparsity.
- Gradient compression: 1-bit SGD in CNTK quantizes each gradient to a single bit with error feedback.
DynamICCL relevance: BSP is DynamICCL's regime. WFBP-style overlap is already done by NCCL+framework integration; DynamICCL's per-collective tuning is a complementary lever. Compression and async modes are out of scope.
4. The Distributed Machine Learning Ecosystem
4.1 General Purpose Distributed Computing Frameworks
- Apache Hadoop: MapReduce + HDFS. Foundational but disk-bound; not suitable for iterative ML.
- Apache Spark: RDDs, MLlib; in-memory iterative compute fixes Hadoop's iteration cost. Used for classical ML at scale; not the dominant choice for deep learning.
DynamICCL relevance: out of scope (wrong abstraction layer).
4.2 Natively Distributed Machine Learning Systems
4.2.1 Distributed Ensemble Learning
- Embarrassingly parallel ensemble members; simplest form of distribution.
4.2.2 Parallel Synchronous SGD
- Baidu's all-reduce SGD: ring all-reduce on top of MPI; brought HPC-style collectives to deep learning.
- Horovod: Uber's open-source library; uses MPI + NVIDIA NCCL to perform all-reduce across multi-GPU multi-node clusters. Headline framework for sync DDL.
- Caffe2: uses NCCL and Facebook's Gloo library for collectives.
- CNTK: Microsoft; pioneered 1-bit SGD for gradient compression.
DynamICCL relevance (very high): these are precisely the frameworks under which DynamICCL operates. NCCL is named explicitly as the GPU collective backend. DynamICCL tunes the layer Horovod and Caffe2 call into.
4.2.3 Parallel Asynchronous SGD and Parameter Servers
- DistBelief (Google): introduced Downpour SGD (async PS) and Distributed L-BFGS.
- TensorFlow: evolution of DistBelief; dataflow graph with placement on CPUs/GPUs/TPUs; supports both PS and all-reduce paradigms.
- MXNet: KVStore abstraction; supports sync/async PS modes; flexible imperative + symbolic graphs.
DynamICCL relevance: out of scope for the async PS variants. TensorFlow's all-reduce mode is in scope (it uses NCCL underneath).
4.2.4 Parallel Stale-Synchronous SGD
- Petuum: SSP execution model with bounded staleness; dynamic scheduling for very large models that don't fit on one node.
DynamICCL relevance: out of scope.
4.2.5 Parallel Hybrid-Synchronous SGD
- MXNet-MPI: combines MPI all-reduce within groups and PS across groups.
DynamICCL relevance: the all-reduce sub-component is in scope; the PS component is not.
4.3 Machine Learning in the Cloud
- AWS SageMaker, Google Cloud AI, Azure ML are commoditizing distributed ML.
- The cloud is the dominant deployment model for non-hyperscaler users.
- Frameworks compete on managed infrastructure as much as on algorithms.
DynamICCL relevance: orthogonal — DynamICCL targets bare-metal HPC (Chameleon Cloud) but the same NCCL tuning applies in cloud GPU instances that expose NCCL.
5. Conclusions and Current Challenges
5.1 Performance
- Communication remains the dominant performance lever at scale.
- Compute-communication overlap, gradient compression, topology-aware collective design, and synchronization-model relaxation are all active fronts.
- Cross-framework benchmarking at scale (10k+ nodes) is largely missing.
DynamICCL relevance: directly motivates DynamICCL — the survey identifies the unsolved problem DynamICCL addresses.
5.2 Fault Tolerance
- Synchronous all-reduce systems (MPI/NCCL) lack production-grade fault tolerance: a single failed node stalls the collective.
- Async PS systems are naturally more resilient because workers can drop in and out.
- Open problem: fault-tolerant all-reduce.
DynamICCL relevance: flag — DynamICCL inherits NCCL's fault-tolerance limitations.
5.3 Privacy (Federated Learning)
- Federated learning trains across mutually-distrusting devices without centralizing data.
- Differential privacy was a hoped-for solution; Hitaj et al. showed GAN-based attacks recover record-level info despite record-level DP.
- Most production frameworks have no built-in privacy primitives.
DynamICCL relevance: out of scope.
5.4 Portability
- Models are typically locked to their training framework.
- ONNX is a partial answer.
- Cross-framework portability of trained models, training infrastructure, and datasets remain open.
DynamICCL relevance: out of scope.
Tables / Figures
- Figure 1: Conceptual view: training (optimization + hyperparameters) vs. prediction (inference).
- Figure 2: Side-by-side data parallelism (replicated model, split data) vs. model parallelism (replicated data, split model).
- Figure 3: Topology taxonomy: (a) centralized, (b) tree, (c) parameter server, (d) peer-to-peer.
- Figure 4: Distributed ML ecosystem Venn: cloud, general-purpose frameworks, natively distributed systems.
Survey-Level Limitations (acknowledged or evident)
- 2020 cutoff — misses Megatron-LM, ZeRO, modern pipeline parallelism (GPipe is current; later schemes are not), PowerSGD-class compression refinements, modern federated systems.
- Pipeline parallelism gets shallow treatment relative to data/model.
- No quantitative cross-framework benchmarking.
- Benchmarking depth at >1k-node scale is acknowledged as missing.
Scope Filter for DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that selects, per collective call, the tuple (algorithm, protocol, nChannels, numThreads) to minimize collective completion time on a synchronous data-parallel GPU cluster. Mapping the survey's design space to DynamICCL:
| Survey axis | DynamICCL's choice | In/Out of scope |
|---|---|---|
| Parallelism | Data-parallel | In scope |
| Topology | Tree / all-reduce (NCCL Ring, Tree) | In scope |
| Synchronization | BSP | In scope |
| Communication strategy | NCCL collectives (per-call config) | In scope; this is the lever |
| Compression | None (DynamICCL doesn't compress) | Out of scope |
| Framework layer | Horovod / PyTorch DDP / TF | Out of scope (above DynamICCL) |
| Hardware | GPU clusters with NVLink + IB/Ethernet | In scope |
| Cloud vs. bare-metal | Bare-metal Chameleon Cloud | Orthogonal |
In scope for DynamICCL (cite directly):
- Sec. 3.4 tree / all-reduce topology — the family DynamICCL tunes.
- Sec. 3.5.1 compute-vs-communication trade-off — DynamICCL's reward function operationalizes this.
- Sec. 3.5.3 WFBP and HybComm — dynamic strategy switching at the gradient layer; DynamICCL does the same idea at the NCCL config layer.
- Sec. 4.2.2 Horovod, Caffe2 — explicitly use NCCL; the production environment DynamICCL plugs into.
- Sec. 5.1 performance challenge — survey-identified open problem that DynamICCL directly addresses.
Out of scope for DynamICCL (cite for boundary-setting):
- Federated learning (Sec. 5.3).
- Peer-to-peer / gossip / SFB (Sec. 3.4(d)).
- Asynchronous parameter servers (Sec. 4.2.3, DistBelief Downpour SGD).
- Stale-synchronous parallel (Sec. 4.2.4, Petuum).
- General-purpose frameworks (Sec. 4.1, Hadoop, Spark).
- Gradient compression (1-bit SGD).
- Privacy / portability (Sec. 5.3, 5.4).
Recommended use of this survey in the DynamICCL paper: cite once in the Background section to anchor the design space, then point at the specific cell DynamICCL occupies. The survey is too broad to be a direct comparison target but is ideal for explaining what DynamICCL is not doing and why that is a deliberate scoping choice.