Architecture & Design-Space Analysis
A Survey on Distributed Machine Learning
Source: Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, Jan S. Rellermeyer — Delft University of Technology / Ghent University, ACM Computing Surveys Vol. 53, No. 2, Article 30, March 2020 (33 pp., DOI 10.1145/3377454). Analyst: Vishwakarma Date: 2026-04-28
Table of Contents
- Survey Scope and Organizing Frame
- Master Taxonomy Tree (paper-grounded)
- The Reference Architecture (Section 3 — three-layer model)
- Axis A — ML Algorithm Family (Section 3.1)
- Axis B — Parallelism Strategy (Section 3, Fig 2)
- Axis C — Topology / Architecture (Section 3.4, Fig 3)
- Axis D — Bridging Sync Model (Section 3.5.2)
- Axis E — Communication Strategy (Section 3.5.3)
- The Distributed ML Ecosystem (Section 4, Fig 4)
- In-Scope vs Out-of-Scope Partition for DynamICCL
- Per-Axis Trade-off Tables
- Cross-Axis Design-Space Matrix
- What to Borrow for DynamICCL
- Summary Table of Borrowed Patterns
- Analogy
1. Survey Scope and Organizing Frame
The paper's load-bearing claim, stated in Section 1 and crystallized in Section 3 plus Fig. 2 and Fig. 3, is that distributed machine learning systems can be characterized along three architectural layers (machine-learning algorithm, parallelism, topology) bound together by a fourth concern — communication. The authors explicitly state these layers are not independent: their combining factor is "the amount of communication required to train the model" (end of Section 3, just before §3.1).
┌────────────────────────────────────────────────────────────────────┐
│ The reference architecture (Section 3, Fig. 1 and Fig. 2) │
│ │
│ Layer 1: ML ALGORITHM │
│ {feedback type, purpose, method} │
│ "what is being learned, by what optimizer?" │
│ │
│ Layer 2: PARALLELISM │
│ {Data-Parallel, Model-Parallel, Ensembling} │
│ "what gets split across machines?" │
│ │
│ Layer 3: TOPOLOGY │
│ {Centralized, Tree, Parameter Server, P2P} │
│ "how are nodes connected?" │
│ │
│ Cross-cutting: COMMUNICATION (Section 3.5) │
│ sync regime {BSP, SSP, ASP, BAP/TAP} │
│ comm strategy {WFBP, HybComm, SFB, continuous} │
└────────────────────────────────────────────────────────────────────┘
▲ Fig 1: The three-layer reference architecture the paper proposes
in Section 3, plus the cross-cutting communication concern. Every
surveyed system instantiates a specific choice on each layer.
DynamICCL Agent-2 currently selects (algo, proto, nChannels, numThreads) within a single combination of these layers — namely (SGD-based DNN, Data-Parallel, Tree/Ring, BSP, WFBP). The survey reveals that decisions made in Layers 1-3 (and on the sync axis) generate the workload that Agent-2 must serve.
Important honesty note on scope: unlike the more recent communication-efficient DDL surveys (e.g., 0025), this 2020 survey does not name pipeline parallelism or tensor parallelism as top-level categories — it only distinguishes Data-Parallel and Model-Parallel (Section 3, Fig. 2), with DistBelief's coarse-grain model partitioning as the canonical model-parallel example. It also does not treat federated learning as a first-class topology; McMahan's FedAvg is cited only once, in §5.3 Privacy. Compression is mentioned only as 1-bit SGD and block-momentum SGD inside the CNTK system description (§4.2.2). I will not invent coverage the paper does not provide.
2. Master Taxonomy Tree (paper-grounded)
Distributed Machine Learning (Survey 0029)
|
+----------------------+----------------------+----------+
| | | |
Layer 1 Layer 2 Layer 3 Section 5
ML Algorithm Parallelism Topology Challenges
(Sec 3.1) (Sec 3, Fig 2) (Sec 3.4, (perf, FT,
Fig 3) privacy,
portability)
| | |
+---+---+ +---+---+ +--------+------+
| | | | | | |
Feedback Method Data- Model- Centralized PS P2P
+Purpose +(EA,SGD, Parallel Parallel (ensembl- (sharded (Gossip
(super, RBML,TM, ing / master) Learning,
unsuper, MF) tree) SFB)
semi, |
RL) v
AllReduce
(ring,
recursive
halving-doubling,
hierarchical)
Cross-cutting Layer 4: COMMUNICATION
+----------------+----------------+
| | |
Sync Model Comm Strategy Hardware Concerns
(Sec 3.5.2) (Sec 3.5.3) (Sec 2: scale-up
GPU/TPU/ASIC vs
scale-out cluster)
| |
BSP / SSP / WFBP / HybComm /
ASP / BAP-TAP Continuous / SFB
Section 4: ECOSYSTEM (Fig 4)
+----------------------+-----------------------+---------------+
| | | |
General-purpose Native distributed Cloud ML Single-
(Hadoop, Spark, (Caffe2, CNTK, DistBelief,(GCP, Azure, machine
MapReduce, Flink) DIANNE, Tensorflow, AWS Sage- libs
MXNet, Petuum, Horovod, Maker, IBM (Theano,
Baidu AllReduce, MXNet- Watson) Caffe,
MPI, DMTK) Sklearn,
MLPack,
NCCL/
Tensor-
flow link)
Section 5: ChALLENGES (open problems)
+-------------+-------------+-------------+-------------+
| | | | |
Performance Fault Privacy Portability (no separate
(efficiency Tolerance (DP, FedAvg, (ONNX, Core section but
vs wall- (AllReduce GAN attacks) ML, NNEF, sub-implied)
clock, lacks FT, x86 vs ARM
SDG bench- ASP toler- vs ASIC)
marks miss ates strag-
large-scale) glers)
▲ Fig 2: Master taxonomy reconstructed strictly from Sections 2-5
and Figures 1-4. Every leaf is an algorithm, system, or concept
the paper actually cites. Items the paper does NOT cover (pipeline
parallelism, tensor parallelism, ZeRO, Top-k sparsification,
QSGD, gossip-SGD theory) are deliberately omitted.
3. The Reference Architecture (Section 3 — three-layer model)
┌─────────────────────────────────────────────────────────────────┐
│ Verbraeken et al. reference stack (Section 3) │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Layer 1: ML ALGORITHM │ │
│ │ feedback ∈ {supervised, unsupervised, semi-, RL} │ │
│ │ purpose ∈ {anomaly, classify, cluster, dim-reduce, │ │
│ │ representation, regression} │ │
│ │ method ∈ {EA, SGD-based, RBML, Topic-Model, │ │
│ │ Matrix-Factorization} │ │
│ │ sub-cat (DNN family): DNN, CNN, RNN, Hopfield, SOM, │ │
│ │ stochastic NN, autoencoder, GAN │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ defines comm pattern │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Layer 2: PARALLELISM (Fig. 2) │ │
│ │ Data-Parallel: same model M on disjoint data D^(i) │ │
│ │ Model-Parallel: shard M across nodes; same data D │ │
│ │ (Ensembling: independent models, late aggregate; │ │
│ │ covered separately in §3.3) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ defines who-talks-to-whom │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Layer 3: TOPOLOGY (Fig. 3) │ │
│ │ (a) Centralized (ensembling root) │ │
│ │ (b) Decentralized — Tree (AllReduce up + broadcast) │ │
│ │ (c) Decentralized — Parameter Server (sharded master) │ │
│ │ (d) Fully Distributed — P2P (Gossip / SFB) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ binds to wire-level transport │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Layer 4 (cross-cutting): COMMUNICATION (§3.5) │ │
│ │ sync model: BSP / SSP / ASP / BAP-TAP │ │
│ │ overlap: WFBP, HybComm, continuous comm │ │
│ │ payload: SFB (factorized), 1-bit SGD │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Hardware substrate (§2): GPU/TPU/ASIC scale-up vs │ │
│ │ commodity HPC scale-out via MPI/InfiniBand/Ethernet │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
▲ Fig 3: The reference architecture as a top-down stack. Each layer
constrains the layer below: the ML algorithm dictates what data
must move; the parallelism strategy dictates how often; the
topology dictates the path; the communication choice dictates
the wire format. DynamICCL operates *inside* Layer 4, parameter-
ising NCCL collectives that implement Layer 3's tree/AllReduce.
The paper's framing places NCCL — and therefore DynamICCL — inside Layer 4 (Communication), specifically realising the decentralized tree topology (Fig. 3b) under BSP with WFBP overlap. The authors explicitly link NCCL to this position in Section 4.2.2: "Horovod uses the NVIDIA Collective Communications Library (NCCL) for increased efficiency... when training on (Nvidia) GPUs." and in §4.2.2 on Caffe2: "Caffe2 distributes ML through, once again, AllReduce algorithms. It does this by using NCCL between GPUs...".
4. Axis A — ML Algorithm Family (Section 3.1)
The paper defines an algorithm by three orthogonal characteristics: feedback × purpose × method (§3.1).
FEEDBACK PURPOSE METHOD
──────── ─────── ──────
Supervised Anomaly detection Evolutionary (EA, GA)
Unsupervised Classification SGD-based
Semi-supervised Clustering (SVM, Perceptron, ANN,
Reinforcement Dim. reduction DNN, CNN, RNN, Hopfield,
Representation SOM, Stochastic NN,
Regression Autoencoder, GAN)
Rule-Based ML (RBML)
(Association rules, CART)
Topic Models
(LDA, LSA/LSI, Naïve Bayes,
PLSA/PLSI)
Matrix Factorization
▲ Fig 4: Axis A — the ML-algorithm cube from §3.1. DNN training is
one cell: supervised + classification/regression + SGD-based ANN.
This is the cell DynamICCL targets; collective patterns differ
for other cells (e.g., LDA/MF have different traffic structure).
The crucial architectural takeaway from §3.1 is that SGD is the common case that drives the vast majority of distributed-ML communication patterns the rest of the survey addresses. RBML, EA, TM and MF have qualitatively different communication shapes — but the paper still routes them through the same Layer-2/3/4 stack. For DynamICCL, the input distribution Agent-2 sees is dominated by SGD-based DNN training (Section 4.2 systems), but the agent should not break for matrix-factorization or topic-model jobs that occasionally share the cluster.
5. Axis B — Parallelism Strategy (Section 3, Fig 2)
The paper's parallelism axis has only two top-level branches plus ensembling. It is narrower than the 0025 survey's 4-axis taxonomy.
Parallelism (§3, Fig. 2)
│
┌────────────────────┼────────────────────┐
│ │ │
DATA-PARALLEL MODEL-PARALLEL ENSEMBLING
(most common) (rare; needed (independent
for huge models) models, late
aggregate)
│ │ │
D^(1)→ M D → M^(1) D^(1)→ M_1 ┐
D^(2)→ M → AllReduce D → M^(2) D^(2)→ M_2 │ vote/
D^(3)→ M / SUM / D → M^(3) D^(3)→ M_3 │ avg /
avg (one shared D; ... │ stack
sub-models ┘
communicate at (§3.3 also covers
graph cut points; bagging, boosting,
DistBelief is bucketing, RF,
the canonical stacking, LCS)
example)
Reqts: i.i.d. data Reqts: model has Reqts: data can be
over samples; full decomposable partitioned without
model fits in node compute-graph biasing trained models
with sparse cross
dependencies
▲ Fig 5: Axis B — parallelism types as the survey defines them.
Note: pipeline parallelism, tensor parallelism, and ZeRO/sharded
optimizer state are NOT named in this 2020 survey. Sufficient
Factor Broadcasting (SFB) appears separately under topology §3.4
as a P2P comm-reduction technique.
DistBelief (§4.2.3) is presented as the model-parallel reference: it "defines a model as a computation graph where each node implements an operation" and "locally connected models lend themselves better for model-parallelism because of limited cross-partition communication". This is a precursor to what later papers call tensor parallelism, but the survey does not formalize it as a separate axis.
6. Axis C — Topology / Architecture (Section 3.4, Fig 3)
Fig. 3 of paper — the four canonical topologies
────────────────────────────────────────────────
(a) CENTRALIZED (b) DECENTRALIZED — TREE
(ensembling root) (AllReduce up; broadcast down)
[Trained model] [ML node]
▲ ▲ ▲
[Ensembling] [aggregate] [aggregate]
▲ ▲ ▲ ▲ ▲ ▲ ▲
[n] [n] [n] [n][n] [n] [n] [n]
│ │ │ │ │ │ │ │
data data data data data data data data
(c) DECENTRALIZED — PARAMETER SERVER (d) FULLY DISTRIBUTED — P2P
[PS] -- [PS] -- [PS] [n]──[n]
▲ ▲ ▲ \ │ \
│ │ │ [n]──[n]──[n]
[n] [n] [n] │
│ │ │ [n]
data data data
(sharded model; (each node owns its own model copy;
workers pull/push gossip / random walks; SFB to
per-shard params) reduce broadcast volume)
▲ Fig 6: Section 3.4 / Fig. 3. Four topologies organized by *degree
of distribution*: from single-root ensembling to full P2P. NCCL
occupies (b) — decentralized tree under AllReduce. The survey
treats P2P / Gossip as a peer first-class option.
The paper's Section 3.4 enumerates the topology primitives:
Trees: scale, simple parent/child links, AllReduce-friendly
Rings: when broadcast is unsupported; minimal comm overhead;
"commonly used between multiple GPUs on the same
machine" (§3.4) ← directly names the NCCL ring case
Parameter
Server: sharded master; workers read/write KV-store (Mu Li
et al. [92,93,94] cited)
Peer-to-Peer: each node owns its own copy; full broadcast
expensive; SFB ([94 — Xie '15]) decomposes parameter
matrix into two factors; Gossip Learning ([112])
does random-walk model exchange
The bandwidth/latency trade-off is not given as big-O cost formulas in this survey (unlike 0025 Table 6). The paper handles costs qualitatively through Section 3.5.1 ("Computation Time vs. Communication vs. Accuracy") — the trade-off is named but not formalised.
7. Axis D — Bridging Sync Model (Section 3.5.2)
Section 3.5.2 enumerates four sync regimes, ordered by relaxation of the consistency constraint:
Synchronization spectrum (§3.5.2)
─────────────────────────────────────────────────────────────────
BSP (Bulk Synchronous Parallel)
Workers synchronize between every comp+comm phase.
Cited example: MapReduce as a BSP engine ([161] Xing et al.).
Pro: serializable; provably correct convergence.
Con: slowest worker stalls everyone (§3.5.2 calls out [34]).
SSP (Stale Synchronous Parallel)
Faster workers may move ahead by ≤ s iterations; barrier
fires only when staleness bound is hit.
Cited example: Petuum (§4.2.4) — "uses stale synchronicity to
exploit inherent tolerance of ML against errors".
Pro: strong convergence guarantees retained.
Con: convergence degrades sharply if staleness too large.
ASP (Approximate Synchronous Parallel)
Limits how INACCURATE a parameter can be (vs SSP which limits
STALENESS). Server delays sync until aggregated update is
significant.
Pro: indefinite delay possible.
Con: hard to set the "significance" threshold ([73] cited).
BAP / TAP (Barrierless / Total Asynchronous Parallel)
Workers communicate without waiting; highest possible speedup.
Pro: best wall-clock on heterogeneous clusters.
Con: error grows with delay; convergence not guaranteed
([65] Han & Daudjee on Giraph Unchained).
▲ Fig 7: The four sync regimes from §3.5.2, listed top-to-bottom in
decreasing strictness. The paper places this trade-off as
"fast/correct convergence (top) vs faster/fresher updates (bottom)".
The paper's Section 3.5.3 then enumerates communication strategies that overlap with sync rather than replace it:
Continuous communication: prevent bursts after a mapper finishes
(Bösen [155] cited as state of the art)
WFBP (Wait-free Backprop): top-layer gradients sent during lower-
layer compute ([171] Poseidon)
HybComm: combine PS + SFB depending on tensor sparsity ([171])
SFB-style payload reduction: send factorized form, not full update
8. Axis E — Communication Strategy (Section 3.5.3)
Although the paper folds this into §3.5, it is a distinct architectural concern. Strategy choices the survey calls out:
Strategy | Mechanism | Cited in
─────────────────────|───────────────────────────────────|──────────
Continuous comm | Spread comm over time to avoid | Bösen [155]
| bursts after map phase |
WFBP | Send top-layer grads during | Poseidon
| bottom-layer backprop compute | [171]
HybComm | Switch PS↔SFB by tensor sparsity | Poseidon
| | [171]
1-bit SGD | Quantize gradients to one bit per | CNTK
| value; reduces comm volume | (§4.2.2,
| | Seide [130])
Block-momentum SGD | Per-block split + average + apply | CNTK
| block-level momentum | (§4.2.2,
| | [31])
SFB | Decompose parameter matrix into | §3.4 P2P,
| two sufficient-factor vectors | [94 Xie'15]
▲ Fig 8: Communication-strategy menu the paper lists. This is
shorter than the dedicated 0025 compression survey because 0029
treats compression as a system-specific feature, not as a first-
class axis. Still, the four overlap/payload tactics are explicit.
9. The Distributed ML Ecosystem (Section 4, Fig 4)
┌───────────────────────────────────────────────────────────────┐
│ Distributed Machine Learning Ecosystem │
│ (Fig. 4) │
│ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ General-Purpose│ ← Mahout / MLlib → │ Single-Machine │ │
│ │ Distributed │ │ ML Systems │ │
│ │ Frameworks │ │ │ │
│ │ │ │ - Theano │ │
│ │ - Apache Hadoop│ │ - Caffe │ │
│ │ - Apache Spark │ │ - Sklearn │ │
│ │ - Apache Flink │ │ - MLPack │ │
│ └───────┬────────┘ │ - NVIDIA libs │ │
│ │ ┌──────────────┐ └───────┬────────┘ │
│ │ Hadoop+ │ Natively │ Keras/ │ │
│ │ Spark+ │ Distributed │ NCCL link │ │
│ └──AllReduce─► ML Systems ◄────────────────── │ │
│ │ │ │
│ │ - Caffe2 │ │
│ │ - CNTK │ │
│ │ - DistBelief │ │
│ │ - DIANNE │ │
│ │ - Tensorflow │ │
│ │ - MXNet │ │
│ │ - Petuum │ │
│ │ - Horovod │ │
│ │ - Baidu AR │ │
│ │ - DMTK │ │
│ │ - MXNet-MPI │ │
│ └──────┬───────┘ │
│ │ │
│ ┌──────▼───────┐ │
│ │ Cloud ML │ │
│ │ │ │
│ │ - Google CAI │ │
│ │ - Azure ML │ │
│ │ - AWS Sage- │ │
│ │ Maker │ │
│ │ - IBM Watson │ │
│ └──────────────┘ │
└───────────────────────────────────────────────────────────────┘
▲ Fig 9: Section 4 ecosystem map. Three flows feed into native
distributed-ML systems: general-purpose Hadoop/Spark via
Mahout/MLlib; single-machine libraries via NCCL/Keras bindings;
cloud-as-a-service consuming all of the above. NCCL is named
explicitly as the interconnect linking single-machine GPU libs
into native distributed-ML systems (Caffe2, Horovod-on-TF).
The survey catalogs each native system by its (parallelism × topology × sync) signature:
System | Parallelism | Topology | Sync | Section
──────────────────┼──────────────┼──────────────┼──────────┼────────
Distributed | data | centralized | n/a | §4.2.1
Ensemble | | | |
Baidu AllReduce | data | tree (ring) | BSP | §4.2.2
Horovod (NCCL) | data | tree (ring + | BSP | §4.2.2
| | recursive | |
| | halve-dbl) | |
Caffe2 (NCCL+Gloo)| data | ring | BSP | §4.2.2
CNTK | data | ring + 1-bit | BSP | §4.2.2
| | SGD opt. | |
DistBelief | data + model | parameter | ASP | §4.2.3
| | server | (Down- |
| | | pour) |
DIANNE | model + data | PS | ASP | §4.2.3
Tensorflow | data + model | PS or AR | flexible | §4.2.3
| | | |
MXNet (KVStore) | data | PS | ASP | §4.2.3
DMTK Multiverso | data | PS | ASP | §4.2.3
Petuum | data + model | PS + SSP | SSP | §4.2.4
| | | (bounded |
| | | stale) |
MXNet-MPI | data | hybrid AR+PS | hybrid | §4.2.5
▲ Fig 10: Cross-tabulation of native distributed-ML systems by
their three-axis signature. NCCL appears as the AR transport in
Horovod (§4.2.2) and Caffe2 (§4.2.2). DynamICCL acts inside the
AR row.
10. In-Scope vs Out-of-Scope Partition for DynamICCL
This is the most important diagram for a DynamICCL architect: where in the survey's design space does Agent-2 actually live, and what should it ignore?
┌──────────────────────────────────────────────────────────────────┐
│ DynamICCL Scope Partition over Survey 0029 │
│ │
│ ╔════════════════════════════════════════════════════════════╗ │
│ ║ IN-SCOPE FOR AGENT-2 ║ │
│ ║ ║ │
│ ║ Layer 4: COMMUNICATION (Section 3.5) ║ │
│ ║ ► Sync = BSP (the only regime AllReduce supports) ║ │
│ ║ ► Comm strategy: WFBP overlap (default in PT-DDP) ║ │
│ ║ ║ │
│ ║ Layer 3: TOPOLOGY (Section 3.4, Fig 3b) ║ │
│ ║ ► Decentralized — Tree ║ │
│ ║ ► Tree primitive = NCCL `algo=tree` ║ │
│ ║ ► Ring primitive = NCCL `algo=ring` ║ │
│ ║ ► Hierarchical AR = NCCL `algo=collnet*` ║ │
│ ║ ║ │
│ ║ NCCL knobs Agent-2 controls today: ║ │
│ ║ algo ∈ {ring, tree, collnet_chain, collnet_direct, ║ │
│ ║ nvls, nvls_tree} ║ │
│ ║ proto ∈ {LL, LL128, Simple} ║ │
│ ║ nChannels ∈ {1..32} ║ │
│ ║ numThreads (default 1024) ║ │
│ ╚════════════════════════════════════════════════════════════╝ │
│ │
│ ╔════════════════════════════════════════════════════════════╗ │
│ ║ IN-SCOPE AS STATE FEATURES ONLY ║ │
│ ║ (observe; do not control) ║ │
│ ║ ║ │
│ ║ Layer 1 ML Algorithm §3.1 ║ │
│ ║ s_method ∈ {SGD, EA, RBML, TM, MF} ║ │
│ ║ s_dnn_family ∈ {DNN, CNN, RNN, GAN, ...} ║ │
│ ║ → controls msg-size distribution & frequency ║ │
│ ║ ║ │
│ ║ Layer 2 Parallelism §3, Fig 2 ║ │
│ ║ s_parallel ∈ {data, model, ensembling, hybrid} ║ │
│ ║ → defines whether AllReduce is even invoked ║ │
│ ║ ║ │
│ ║ Layer 3 Topology §3.4 ║ │
│ ║ s_arch ∈ {centralized, tree, PS, P2P} ║ │
│ ║ → if PS or P2P, NCCL AllReduce isn't the right ║ │
│ ║ primitive; agent should defer or no-op ║ │
│ ║ ║ │
│ ║ Layer 4 Sync model §3.5.2 ║ │
│ ║ s_sync ∈ {BSP, SSP, ASP, BAP/TAP} ║ │
│ ║ → SSP/ASP/TAP imply non-AR transport at framework lvl ║ │
│ ║ ║ │
│ ║ Comm strategy §3.5.3 ║ │
│ ║ s_overlap ∈ {WFBP, HybComm, continuous} ║ │
│ ║ → WFBP shapes msg-size mix ║ │
│ ║ ║ │
│ ║ Hardware substrate §2 ║ │
│ ║ s_compute ∈ {GPU, TPU, ASIC, CPU/AVX-512} ║ │
│ ║ s_topo_phys ∈ {NVLink-only, NVL+IB, IB-only, ║ │
│ ║ RoCE, TCP/IP} ║ │
│ ║ → SHM/NVLink vs IB drastically changes α and β ║ │
│ ╚════════════════════════════════════════════════════════════╝ │
│ │
│ ╔════════════════════════════════════════════════════════════╗ │
│ ║ OUT-OF-SCOPE ║ │
│ ║ (controlled by application/framework, not NCCL) ║ │
│ ║ ║ │
│ ║ Choosing data-parallel vs model-parallel (§3, Fig 2)║ │
│ ║ Choosing topology family (PS vs AR vs P2P) (§3.4) ║ │
│ ║ Choosing sync regime (BSP vs SSP vs ASP) (§3.5.2) ║ │
│ ║ Choosing comm strategy (WFBP vs continuous) (§3.5.3) ║ │
│ ║ 1-bit SGD / block-momentum SGD (§4.2.2) ║ │
│ ║ SFB / Sufficient Factor Broadcasting (§3.4 P2P)║ │
│ ║ Ensembling aggregation logic (§3.3) ║ │
│ ║ Hyperparameter optimization (§3.2) ║ │
│ ║ Fault tolerance protocol (checkpoint, replay) (§5.2) ║ │
│ ║ Privacy mechanisms (DP, FedAvg, GAN attacks) (§5.3) ║ │
│ ║ Model portability (ONNX, Core ML, NNEF) (§5.4) ║ │
│ ║ Storage layer (GFS/HDFS) and scheduler (§4.1) ║ │
│ ║ Hardware substrate selection (GPU vs TPU vs (§2) ║ │
│ ║ ASIC) ║ │
│ ╚════════════════════════════════════════════════════════════╝ │
└──────────────────────────────────────────────────────────────────┘
▲ Fig 11: Three-zone partition. Agent-2 ACTS on the inner box; it
READS the middle box as state features; it MUST NOT try to control
the outer box. The boundary between the inner and middle box is
the NCCL plugin API surface (`getCollInfo`, `tuneCollInfo`).
The middle ring is the architecturally interesting one — these are the features Agent-2 needs to OBSERVE to make its inner-box decisions correctly, but cannot CHANGE.
11. Per-Axis Trade-off Tables
11.1 Parallelism strategy trade-off (Section 3, Fig 2)
| Property | Data-Parallel | Model-Parallel | Ensembling | Winner (DynamICCL) |
|---|---|---|---|---|
| Per-iteration comm volume | one AR per step | low (graph cuts) | none until end | Data-Parallel |
| Memory per node | full model | partial model | full model | Model-Parallel |
| Algorithmic constraint | i.i.d. data | decomposable G | i.i.d. partition | Data-Parallel |
| AR-friendly | Yes | Partial | No | Data-Parallel |
| Fault tolerance | low (barrier) | low | high (independ) | Ensembling |
| Operating regime for NCCL | primary | rare | n/a | Data-Parallel |
For DynamICCL, the Data-Parallel cell is the home cell. Model-Parallel generates much smaller, sparser collectives at unpredictable points in the graph; ensembling generates one collective at the end of training. Agent-2's training distribution comes overwhelmingly from data-parallel SGD.
11.2 Topology trade-off (Section 3.4, Fig 3)
| Dimension | Centralized | Tree (AR) | Parameter Server | P2P (Gossip) | Winner (DynamICCL) |
|---|---|---|---|---|---|
| Single point of failure | Yes (root) | None | PS shard | None | Tree / P2P |
| Easy to scale | No | Yes | Yes (shard) | Yes | Tree |
| Sync-mode flexibility | n/a | BSP only | All four | All except SSP | PS |
| Bandwidth efficiency | low | high (ring) | medium (server) | low (broadcast) | Tree |
| Latency for small msg | n/a | log N (tree) | 2 RTT (push/pull) | O(diameter) | Tree |
| NCCL-applicable | No | YES | No | No | Tree |
| Heterogeneity tolerance | n/a | Low | High | High | PS |
The survey itself draws these conclusions (§3.4 + §5.2): "Synchronous AllReduce-based approaches seem to scale significantly better than the parameter server approach (up to a certain cluster size), but suffer from a lack of fault-tolerance". This is exactly the AR-vs-PS trade-off Agent-2 inherits as a constraint, not a choice.
11.3 Sync model trade-off (Section 3.5.2)
| Property | BSP | SSP | ASP | BAP / TAP | Winner (DynamICCL) |
|---|---|---|---|---|---|
| Convergence guarantee | provable | provable s≤bound | weak (heuristic) | none guaranteed | BSP |
| Wall-clock on heterogeneous | poor | better | best (no wait) | best | ASP/TAP |
| Compatible with AllReduce | YES | partial | NO | NO | BSP |
| Straggler sensitivity | high | bounded | none | none | TAP |
| Cited example system | MapReduce | Petuum (§4.2.4) | Downpour (§4.2.3) | Giraph Unchained | — |
| State-feature for Agent-2 | observe | observe | observe | observe | s_sync |
For DynamICCL, BSP is the operating regime — and the only one compatible with AllReduce. SSP/ASP/TAP are out-of-scope as actions but in-scope as state features (the agent must know if its caller has switched to PS+ASP and the AllReduce path is therefore being bypassed).
11.4 Communication-strategy trade-off (Section 3.5.3)
| Property | Continuous | WFBP | HybComm | 1-bit SGD | SFB | Winner (DynamICCL) |
|---|---|---|---|---|---|---|
| Hides comm behind compute | partial | yes (deep) | yes | no | no | WFBP |
| Reduces payload volume | no | no | no | 32× | factorized | 1-bit SGD |
| Standard in PT-DDP | partial | YES | no | no | no | WFBP |
| Affects msg-size mix | smooths | many small | mixed | 32× smaller | small | observe |
| Layer of implementation | framework | framework | framework | caller | caller | (all out-of-scope) |
| Effect on Agent-2 | smooths | shapes mix | shapes mix | shrinks msg | shrinks msg | feature: msg_size |
All of these techniques live above NCCL. Agent-2 cannot enable or disable them — but their presence shows up as the effective message size distribution the agent has to optimise over.
11.5 Hardware substrate trade-off (Section 2)
| Dimension | GPU (Volta/Turing) | TPU | ASIC (DianNao) | CPU AVX-512 | Winner (DynamICCL) |
|---|---|---|---|---|---|
| Branch divergence cost | high (SIMD) | low (MIMD) | low (fixed func) | low | TPU/ASIC |
| Peak throughput | 47× CPU [107] | 70× CPU [128] | 1000× CPU [32] | 1× baseline | ASIC |
| Energy efficiency | medium | 200×/W vs CPU | 20× vs DianNao | low | ASIC |
| NCCL availability | YES (NVLink+IB) | partial (TF) | no | no | GPU |
| Operating regime today | dominant for DL | Google-internal | niche | niche | GPU |
For DynamICCL, GPU is the substrate. The agent's policy is keyed to the (NVLink, IB, RoCE) transport hierarchy that GPU clusters use. The survey catalogs other substrates, but they are out-of-scope for Agent-2 because they do not run NCCL.
12. Cross-Axis Design-Space Matrix
A reconstructed (parallelism × topology × sync) matrix from Section 4's system catalog:
│ BSP │ SSP │ ASP/TAP
─────────────────┼──────────────────┼──────────────────┼────────────────
Centralized │ (ensembling │ — │ —
(root) │ training) │ │
─────────────────┼──────────────────┼──────────────────┼────────────────
Tree / AllReduce │ Baidu AR │ (rarely paired) │ INCOMPATIBLE
│ Horovod (NCCL) │ │ (AR requires
│ Caffe2 (NCCL) │ │ barrier)
│ CNTK (ring + │ │
│ 1bit SGD) │ │
│ ◄ DynamICCL ► │ │
─────────────────┼──────────────────┼──────────────────┼────────────────
Parameter │ (rare; usually │ Petuum │ DistBelief
Server │ async) │ │ Downpour
│ │ │ TF (PS mode)
│ │ │ MXNet KVStore
│ │ │ DMTK Multiverso
─────────────────┼──────────────────┼──────────────────┼────────────────
P2P (Gossip) │ SFB-broadcast │ — │ Gossip Learning
│ (synchronous │ │ (random walk)
│ p2p) │ │
─────────────────┴──────────────────┴──────────────────┴────────────────
Hybrid │ MXNet-MPI: AR within group, PS across groups
│ Tensorflow: configurable both modes
▲ Fig 12: Cross-axis matrix populated from Section 4's system list.
AR + ASP is incompatible (the survey is consistent with 0025 here).
DynamICCL's home cell is bold-italic: AR + BSP. Petuum populates
the unique PS+SSP cell that the rest of the literature largely
ignores. The right column (ASP) is exclusively PS or P2P.
13. What to Borrow for DynamICCL
Agent-2 selects (algo, proto, nChannels, numThreads). The survey's value to DynamICCL is in its system-level observations about where collective communication sits in the broader stack. The borrows below are tighter than the 0025 borrows because this survey is broader and shallower; many of its lessons are about what NOT to absorb into Agent-2.
13.1 Three-layer reference architecture as state-vector schema
Survey's pattern (§3, Fig. 1+2+3). Distributed-ML systems decompose into Layer 1 (algorithm) × Layer 2 (parallelism) × Layer 3 (topology), bound by Layer 4 (communication).
DynamICCL application. Agent-2's state vector should mirror this schema with one slot per layer:
s = ( s_algo_method = SGD | EA | RBML | TM | MF # L1
, s_dnn_family = DNN | CNN | RNN | GAN | ... # L1
, s_parallel = data | model | ensembling | hybrid # L2
, s_arch = centralized | tree | PS | P2P # L3
, s_sync = BSP | SSP | ASP | TAP # L4
, s_overlap = none | WFBP | HybComm | continuous # L4
, s_payload_compress = none | 1bit-SGD | block-momentum |
SFB | other # L4
, s_topo_phys = NVLink-only | NVL+IB | IB-only |
RoCE | TCP # HW
, s_compute = GPU | TPU | ASIC | CPU # HW
, s_msg_size_log = log_2(effective_bytes) # derived
, s_history_lstm_h = LSTM hidden state of past collectives # derived
)
The schema is paper-grounded: every slot maps to a concrete section
in the survey. The agent's policy gates on (s_arch, s_sync)
to verify it is even in the AR + BSP cell before emitting any NCCL
config; otherwise it falls through to a no-op.
13.2 NCCL is the AR transport — no expansion of action space
Survey's pattern (§4.2.2). NCCL is named only as the AllReduce transport in Horovod, Caffe2, and (via NCCL 2.0) cross-node DDP. The survey does NOT propose any framework-level knobs that NCCL itself exposes — the AllReduce primitive is treated as monolithic.
DynamICCL application. The action space stays at (algo, proto, nChannels, numThreads). The survey provides no evidence that adding a fifth or sixth NCCL-internal action (e.g., chunk-size, aggregation factor) would map to a surveyed technique. This is the opposite finding to 0025 (HiCCL/HiCCL-style hierarchical decomposition would expand the action space) and is consistent with this survey being broader and less NCCL-specific.
13.3 BSP is the only sync regime AllReduce supports — gate on it
Survey's pattern (§3.5.2 + §4.2.2). The four sync regimes are mutually exclusive. AR is only compatible with BSP. SSP is Petuum's unique cell (PS-based). ASP belongs to DistBelief / TF-PS / MXNet KVStore (PS-based). TAP belongs to Giraph Unchained (graph-engine).
DynamICCL application. Agent-2's first decision rule should be:
if s_sync ≠ BSP:
return NO_OP (the framework is using PS or P2P; NCCL AR
is not on the critical path)
if s_arch ≠ tree:
return NO_OP (collective primitive is not AllReduce)
if s_parallel = ensembling:
return NO_OP (no per-step collective; one collective at end)
Only after these gates does the agent emit an (algo, proto, nCh, nT) selection. This prevents wasting policy capacity on cells the agent does not control.
13.4 Effective message size dominates over nominal tensor size
Survey's pattern (§3.5.3 + §4.2.2). 1-bit SGD (Seide et al. [130] in CNTK) and SFB ([94] Xie '15) both shrink the AllReduce payload without changing the tensor's nominal shape. WFBP ([171] Poseidon) shifts the per-collective size distribution toward many small messages.
DynamICCL application. Agent-2's
s_msg_size_log feature must be the
post-compression, post-WFBP-fusion
byte count, not the raw tensor count. A 1 GB tensor compressed by 1-bit
SGD is a 31 MB collective; the right algorithm for 31 MB (likely
tree+LL128) is different from the right algorithm for 1 GB (likely
ring+Simple). This is exactly the same pattern borrowed from 0025
§11.4.1, and this survey reinforces it with two additional concrete
techniques (1-bit SGD and SFB) that produce the same effect.
13.5 Topology is exogenous — encode it as an ordinal feature
Survey's pattern (§3.4). Topology is a deployment-time decision ($A) the cluster admin makes, not an inference-time choice. The paper enumerates four canonical topologies (centralized, tree, PS, P2P).
DynamICCL application. s_arch is set
once per training job and remains constant across all of Agent-2's
collectives in that job. It is a cluster-level feature, not a
per-collective feature. The same applies to s_topo_phys
(NVLink-only vs NVL+IB vs IB-only). Both should be passed as one-hot
inputs at agent initialization and held constant; only
s_msg_size_log, s_history_lstm_h, and
s_congestion_signal vary per collective.
13.6 Performance benchmarks generalize poorly above ~hundreds of GPUs
Survey's pattern (§5.1). "...most of these benchmarks test at most a few hundred machines, whereas the scale at which, e.g., DistBelief, is demonstrated, can be two orders of magnitude larger." The survey's own conclusion is that the published evaluation regime of distributed-ML papers does not reflect production scale.
DynamICCL application. Agent-2's training data must include multi-node IB-RDMA traces, not just NVLink-only intra-node traces. The Pensieve borrow ("FCC + HSDPA + Verizon traces for generalization") is reinforced here: train on at least three cluster topologies (NVLink-only DGX, NVL+IB DGX-A100 cluster, IB-only HPC) and use the topology feature to discriminate. Single-cluster training will not transfer.
13.7 Fault tolerance is the unsolved problem — Agent-2 must
notice failures, not cause them
Survey's pattern (§5.2). "Common implementations of these HPC- inspired patterns, such as MPI and NCCL, lack fault-tolerance completely... Failure of a single machine blocks the entire training process." AllReduce is a barrier; one rank's hung collective hangs all ranks.
DynamICCL application. Two design constraints follow:
1. Never select a config that requires more channels than the
cluster supports — a config that fails at NCCL init kills all
ranks.
2. The agent's congestion detector (Phase-2 Trigger Agent) must
distinguish "slow" from "hung". A hung rank is not a congestion
signal; it is a fault that requires job restart, not a config
change.
This is a negative borrow: the agent must not interpret long
collective times during a hang as "this config is bad" and switch
configs, because no config will work after a rank is dead. The state
machine must include a FAULTED terminal state that
suppresses all agent action until the job framework restarts.
13.8 Privacy and DP / GAN attacks are out-of-scope
Survey's pattern (§5.3). Federated learning, differential privacy, GAN-based reconstruction attacks (Hitaj et al. [71]) are application- layer privacy concerns. The agent does not see plaintext gradients and cannot influence privacy properties.
DynamICCL application. Agent-2 must not log gradient values, only sizes and latencies. This is a soft architectural constraint that keeps the agent privacy-neutral: it observes the AllReduce envelope (bytes-in, time-out) but never the contents.
13.9 Portability — the ONNX precedent for Agent-2 deployment
Survey's pattern (§5.4). ONNX, Core ML, NNEF are framework- independent model interchange formats. The survey identifies portability as a real constraint: a model trained in TF cannot be deployed in PyTorch without conversion.
DynamICCL application. The Agent-2 policy network
should be serialized in a portable format (ONNX or TorchScript) so that
the same policy weights work whether the NCCL plugin is loaded by
PyTorch DDP, Horovod, or DeepSpeed. The plugin's input pipeline
(building s from getCollInfo) is the only
framework-specific shim; the policy itself is portable.
13.10 Cloud delivery model — Agent-2 as a "managed tuner"
Survey's pattern (§4.3). "The cloud-based delivery model is becoming more important, as it reduces the burden of entry into designing smart applications that facilitate machine learning techniques." SageMaker, Azure ML, GCP all wrap distributed-ML behind managed APIs.
DynamICCL application. A long-term deployment story for DynamICCL is as a hosted policy server: the cluster's NCCL plugin queries a central Agent-2 instance over a thin RPC, returning (algo, proto, nCh, nT). This separates the policy update path (managed, centralized, retrained nightly) from the inference path (per-cluster, local, low-latency). The Pensieve server-side deployment pattern maps directly. Caveat: the per-collective RPC latency must remain below the smallest collective's completion time, which is ~10 µs intra-node — likely too tight for a remote server. So the realistic deployment is a local plugin with periodic policy pulls, not per-collective remote inference.
14. Summary Table of Borrowed Patterns
| Pattern | Survey origin | DynamICCL application |
|---|---|---|
| 3-layer + 1 cross-cutting reference arch | §3, Figs 1-3 | State-vector schema with one slot per layer |
| AR + BSP is the only valid cell | §3.5.2 + Fig 12 | Gate-rule: emit no-op outside (AR, BSP, data-parallel) cell |
| Topology and parallelism are exogenous | §3.4, §3 Fig 2 | s_arch, s_parallel as init-time constants, not per-call inputs |
| Effective msg size dominates | §3.5.3, §4.2.2 | s_msg_size = post-compression, post-WFBP-fusion byte count |
| AllReduce lacks fault tolerance | §5.2 | FAULTED state suppresses agent action; never confuse hang w/ slow |
| Benchmarks underestimate large scale | §5.1 | Train policy on ≥3 cluster topologies (NVL, NVL+IB, IB-only) |
| NCCL is named transport, no new knobs | §4.2.2 | Action space stays at (algo, proto, nCh, nT) — no expansion |
| Privacy and DP are application-layer | §5.3 | Agent logs sizes/latencies only, never gradient contents |
| Portability via interchange formats | §5.4 | Serialize Agent-2 policy as ONNX/TorchScript; framework-agnostic |
| Cloud / managed delivery | §4.3 | Long-term: hosted policy server + local inference |
| Hardware substrate selection | §2 | s_compute, s_topo_phys as one-hot init features |
| ML-algo family shapes traffic | §3.1 | s_algo_method, s_dnn_family as policy-conditioning context |
Analogy
The survey's three-layer reference architecture (algorithm × parallelism × topology) plus the cross-cutting communication concern is structurally a modern factory floor. Layer 1 is what product is being made (the ML algorithm — phones, cars, or pharmaceuticals each demand different production lines). Layer 2 is how the work is split among workers (data-parallel = each worker assembles complete units from their slice of incoming materials; model-parallel = an assembly line where each worker performs one specialized step on a single unit moving past them; ensembling = each worker builds a complete competing prototype, judges decide which wins). Layer 3 is the factory's communication topology — hierarchical foreman tree, centralized whiteboard with a clerk (parameter server), or peer talkback radios (P2P). Layer 4 is the shop-floor protocol — strict shift-bell synchronization (BSP), "finish your task within 5 minutes of others" (SSP), "post your update whenever you're done" (ASP), or "shout it across the room" (TAP).
Agent-2's role is the shop-floor PLC controller that decides which conveyor belt speed and which packaging line to use for each batch leaving Layer 3. It does not choose what to build, how to split the work, or which topology to install — those are the plant manager's decisions made before the shift starts. It only optimises the throughput of one specific subsystem (the AllReduce conveyor) given everything else is fixed. The survey's contribution is the floor plan that tells the controller exactly which subsystem it owns and which it must not touch.