A Survey on Distributed Machine Learning

Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, Jan S. Rellermeyer | TU Delft + imec/Ghent | ACM Computing Surveys 53(2), Article 30, 2020

Problem

Modern ML training data and model sizes have outpaced single-machine compute, making distribution mandatory. The literature, however, is fragmented across many sub-communities — HPC, parameter-server systems, federated learning, gossip protocols, ensemble learning, accelerator design — each with its own vocabulary, trade-offs, and assumed system model. There was no comprehensive map covering how parallelism, topology, synchronization, and communication strategy compose into the design space of a distributed ML system, nor how the major production frameworks (TensorFlow, MXNet, Horovod, Caffe2, Petuum) instantiate that space.

Core Insight

Distributed ML systems can be cleanly decomposed along four orthogonal axes — parallelism (data, model, hybrid), topology (centralized, tree, parameter server, peer-to-peer), synchronization (BSP, SSP, ASP, BAP/TAP), and communication strategy (WFBP, hybrid PS+SFB, compression) — and every existing framework is an instantiation of choices along these axes.

Method

This is a survey, not a system. The authors construct a reference architecture that enumerates the design dimensions of distributed ML and then place every major framework into it. They proceed bottom-up:

Hardware (Sec. 2): scaling up via GPUs, TPUs, FPGAs, DianNao, AVX-512 — bandwidth and arithmetic-density limits motivate scaling out.
Algorithm taxonomy (Sec. 3.1–3.3): by feedback (sup/unsup/RL), by purpose (classification, clustering, anomaly), by method (SGD, SVM, ANN, RNN, evolutionary), plus hyperparameter optimization and ensembles.
Topology (Sec. 3.4): centralized aggregator, tree (all-reduce-like), parameter server (sharded KV master), peer-to-peer (Sufficient Factor Broadcasting, gossip).
Synchronization (Sec. 3.5.2): BSP (barrier per step), SSP (bounded staleness), ASP (bounded parameter error), BAP/TAP (no barrier).
Communication strategy (Sec. 3.5.3): Wait-Free Backpropagation overlaps gradient send with backward pass; HybComm switches between PS and SFB based on sparsity; 1-bit SGD compresses gradients.
Ecosystem (Sec. 4): general-purpose (Hadoop, Spark/MLlib) vs. native distributed ML (Horovod+NCCL, Caffe2+NCCL+Gloo, CNTK, DistBelief/Downpour SGD, TensorFlow, MXNet, Petuum, MXNet-MPI).
Open challenges (Sec. 5): performance, fault tolerance, privacy (federated learning, GAN-based attacks on differential privacy), portability (ONNX).

Results

The survey produces no measured numbers of its own; it summarizes the landscape. Key qualitative conclusions:

Scaling out beats scaling up at scale, but communication is the dominant cost; bridging compute and communication (overlap, scheduling, compression) is where most engineering effort goes.
BSP is correct but straggler-bound; ASP/BAP is fast but risks divergence; SSP is the practical middle ground.
All-reduce (NCCL/MPI) excels in tightly-coupled GPU clusters but lacks production fault tolerance; parameter servers are more naturally resilient but add a stateful tier.
Cloud (SageMaker, Google Cloud AI, Azure ML) is commoditizing distributed ML, and frameworks are converging toward the same middle ground regardless of their starting point (general-purpose vs. single-machine).
Federated/private learning is open: record-level differential privacy was shown breakable by GAN-based attacks (Hitaj et al.), and most production frameworks ship without built-in privacy primitives.

Limitations

Survey is from 2020; misses post-2020 developments — large-scale pipeline-parallel systems (GPipe was current; Megatron-LM and ZeRO get limited or no treatment), modern federated systems, gradient-compression refinements (PowerSGD, etc.), and recent NCCL evolution.
Coverage of pipeline parallelism is thin compared to data and model parallelism.
Quantitative cross-framework benchmarking is acknowledged as an open problem the survey itself does not provide.
Benchmarks reviewed are mostly at the few-hundred-node scale; behavior at DistBelief-scale (10k+ nodes) is largely unmapped.

Relevance to DynamICCL

DynamICCL operates inside one specific cell of this survey's design space: data-parallel, all-reduce topology, BSP synchronization, NCCL communication on tightly-coupled GPU clusters. Most of the survey is therefore out of scope, but the in-scope material is highly relevant.

In scope (directly relevant):

Sec. 3.4 tree/all-reduce topology — describes the family DynamICCL tunes; NCCL Ring and Tree algorithms are concrete instances of these topology classes.
Sec. 3.5.1 compute-vs-communication trade-off — exactly DynamICCL's optimization objective. The survey frames the trade-off generically; DynamICCL operationalizes it via per-collective config selection.
Sec. 3.5.3 Wait-Free Backpropagation and HybComm — overlap and strategy-switching ideas worth adopting at the NCCL configuration level (e.g., switching algorithm/protocol per collective, analogous to switching between PS and SFB per parameter).
Sec. 4.2.2 Horovod (MPI + NCCL all-reduce) and Caffe2 (NCCL + Gloo) — the production frameworks DynamICCL tunes underneath. The survey confirms NCCL is the de facto GPU collective layer.
Sec. 5.1 performance challenge — the survey explicitly flags communication tuning as the dominant unsolved performance lever, validating DynamICCL's premise.

Out of scope (DynamICCL does not address):

Federated learning (Sec. 5.3) — cross-device, non-IID, privacy-driven; different system model entirely.
Peer-to-peer / gossip protocols (Sec. 3.4) — Sufficient Factor Broadcasting and gossip target loosely-coupled or WAN settings.
Asynchronous parameter servers (Sec. 4.2.3, DistBelief Downpour SGD, MXNet KVStore async mode) — DynamICCL targets synchronous all-reduce.
Stale-synchronous parallel (Sec. 4.2.4, Petuum) — a different correctness regime; orthogonal to DynamICCL's scope.
General-purpose frameworks (Sec. 4.1, Hadoop, Spark) — wrong abstraction layer.

Take-away framing for the DynamICCL writeup: cite this survey to position the work as targeting one specific axis of the design space (NCCL configuration within data-parallel synchronous all-reduce), and note that prior work has surveyed the broader space without addressing per-collective parameter selection.