Architecture & Design-Space Analysis

A Survey on Distributed Machine Learning

Source: Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, Jan S. Rellermeyer — Delft University of Technology / Ghent University, ACM Computing Surveys Vol. 53, No. 2, Article 30, March 2020 (33 pp., DOI 10.1145/3377454). Analyst: Vishwakarma Date: 2026-04-28


Table of Contents

  1. Survey Scope and Organizing Frame
  2. Master Taxonomy Tree (paper-grounded)
  3. The Reference Architecture (Section 3 — three-layer model)
  4. Axis A — ML Algorithm Family (Section 3.1)
  5. Axis B — Parallelism Strategy (Section 3, Fig 2)
  6. Axis C — Topology / Architecture (Section 3.4, Fig 3)
  7. Axis D — Bridging Sync Model (Section 3.5.2)
  8. Axis E — Communication Strategy (Section 3.5.3)
  9. The Distributed ML Ecosystem (Section 4, Fig 4)
  10. In-Scope vs Out-of-Scope Partition for DynamICCL
  11. Per-Axis Trade-off Tables
  12. Cross-Axis Design-Space Matrix
  13. What to Borrow for DynamICCL
  14. Summary Table of Borrowed Patterns
  15. Analogy

1. Survey Scope and Organizing Frame

The paper's load-bearing claim, stated in Section 1 and crystallized in Section 3 plus Fig. 2 and Fig. 3, is that distributed machine learning systems can be characterized along three architectural layers (machine-learning algorithm, parallelism, topology) bound together by a fourth concern — communication. The authors explicitly state these layers are not independent: their combining factor is "the amount of communication required to train the model" (end of Section 3, just before §3.1).

┌────────────────────────────────────────────────────────────────────┐
│      The reference architecture (Section 3, Fig. 1 and Fig. 2)     │
│                                                                    │
│   Layer 1: ML ALGORITHM                                            │
│            {feedback type, purpose, method}                        │
│            "what is being learned, by what optimizer?"             │
│                                                                    │
│   Layer 2: PARALLELISM                                             │
│            {Data-Parallel, Model-Parallel, Ensembling}             │
│            "what gets split across machines?"                      │
│                                                                    │
│   Layer 3: TOPOLOGY                                                │
│            {Centralized, Tree, Parameter Server, P2P}              │
│            "how are nodes connected?"                              │
│                                                                    │
│   Cross-cutting: COMMUNICATION (Section 3.5)                       │
│            sync regime {BSP, SSP, ASP, BAP/TAP}                    │
│            comm strategy {WFBP, HybComm, SFB, continuous}          │
└────────────────────────────────────────────────────────────────────┘
▲ Fig 1: The three-layer reference architecture the paper proposes
  in Section 3, plus the cross-cutting communication concern. Every
  surveyed system instantiates a specific choice on each layer.

DynamICCL Agent-2 currently selects (algo, proto, nChannels, numThreads) within a single combination of these layers — namely (SGD-based DNN, Data-Parallel, Tree/Ring, BSP, WFBP). The survey reveals that decisions made in Layers 1-3 (and on the sync axis) generate the workload that Agent-2 must serve.

Important honesty note on scope: unlike the more recent communication-efficient DDL surveys (e.g., 0025), this 2020 survey does not name pipeline parallelism or tensor parallelism as top-level categories — it only distinguishes Data-Parallel and Model-Parallel (Section 3, Fig. 2), with DistBelief's coarse-grain model partitioning as the canonical model-parallel example. It also does not treat federated learning as a first-class topology; McMahan's FedAvg is cited only once, in §5.3 Privacy. Compression is mentioned only as 1-bit SGD and block-momentum SGD inside the CNTK system description (§4.2.2). I will not invent coverage the paper does not provide.


2. Master Taxonomy Tree (paper-grounded)

              Distributed Machine Learning (Survey 0029)
                              |
       +----------------------+----------------------+----------+
       |                      |                      |          |
   Layer 1               Layer 2                Layer 3      Section 5
   ML Algorithm          Parallelism            Topology     Challenges
   (Sec 3.1)             (Sec 3, Fig 2)         (Sec 3.4,    (perf, FT,
                                                 Fig 3)       privacy,
                                                              portability)
       |                      |                      |
   +---+---+              +---+---+         +--------+------+
   |       |              |       |         |        |       |
 Feedback Method        Data-   Model-   Centralized  PS    P2P
 +Purpose +(EA,SGD,    Parallel Parallel (ensembl-  (sharded (Gossip
 (super,  RBML,TM,                         ing /     master) Learning,
  unsuper, MF)                              tree)             SFB)
  semi,                                        |
  RL)                                          v
                                          AllReduce
                                          (ring,
                                           recursive
                                           halving-doubling,
                                           hierarchical)

                Cross-cutting Layer 4: COMMUNICATION
                +----------------+----------------+
                |                |                |
            Sync Model      Comm Strategy    Hardware Concerns
            (Sec 3.5.2)     (Sec 3.5.3)      (Sec 2: scale-up
                                              GPU/TPU/ASIC vs
                                              scale-out cluster)
                |                |
        BSP / SSP /          WFBP / HybComm /
        ASP / BAP-TAP        Continuous / SFB

                       Section 4: ECOSYSTEM (Fig 4)
   +----------------------+-----------------------+---------------+
   |                      |                       |               |
General-purpose      Native distributed        Cloud ML        Single-
(Hadoop, Spark,      (Caffe2, CNTK, DistBelief,(GCP, Azure,    machine
 MapReduce, Flink)    DIANNE, Tensorflow,       AWS Sage-      libs
                      MXNet, Petuum, Horovod,   Maker, IBM     (Theano,
                      Baidu AllReduce, MXNet-   Watson)         Caffe,
                      MPI, DMTK)                                Sklearn,
                                                                MLPack,
                                                                NCCL/
                                                                Tensor-
                                                                flow link)

                Section 5: ChALLENGES (open problems)
   +-------------+-------------+-------------+-------------+
   |             |             |             |             |
Performance   Fault         Privacy      Portability    (no separate
(efficiency  Tolerance     (DP, FedAvg,  (ONNX, Core    section but
 vs wall-    (AllReduce    GAN attacks)  ML, NNEF,       sub-implied)
 clock,       lacks FT,                   x86 vs ARM
 SDG bench-   ASP toler-                  vs ASIC)
 marks miss   ates strag-
 large-scale)  glers)
▲ Fig 2: Master taxonomy reconstructed strictly from Sections 2-5
  and Figures 1-4. Every leaf is an algorithm, system, or concept
  the paper actually cites. Items the paper does NOT cover (pipeline
  parallelism, tensor parallelism, ZeRO, Top-k sparsification,
  QSGD, gossip-SGD theory) are deliberately omitted.

3. The Reference Architecture (Section 3 — three-layer model)

  ┌─────────────────────────────────────────────────────────────────┐
  │          Verbraeken et al. reference stack (Section 3)          │
  │                                                                 │
  │  ┌──────────────────────────────────────────────────────────┐   │
  │  │  Layer 1: ML ALGORITHM                                   │   │
  │  │   feedback ∈ {supervised, unsupervised, semi-, RL}       │   │
  │  │   purpose  ∈ {anomaly, classify, cluster, dim-reduce,    │   │
  │  │              representation, regression}                  │   │
  │  │   method   ∈ {EA, SGD-based, RBML, Topic-Model,          │   │
  │  │              Matrix-Factorization}                        │   │
  │  │   sub-cat (DNN family): DNN, CNN, RNN, Hopfield, SOM,     │   │
  │  │              stochastic NN, autoencoder, GAN              │   │
  │  └──────────────────────────────────────────────────────────┘   │
  │                              │                                  │
  │                              ▼ defines comm pattern             │
  │  ┌──────────────────────────────────────────────────────────┐   │
  │  │  Layer 2: PARALLELISM (Fig. 2)                           │   │
  │  │    Data-Parallel: same model M on disjoint data D^(i)    │   │
  │  │    Model-Parallel: shard M across nodes; same data D     │   │
  │  │    (Ensembling: independent models, late aggregate;      │   │
  │  │     covered separately in §3.3)                          │   │
  │  └──────────────────────────────────────────────────────────┘   │
  │                              │                                  │
  │                              ▼ defines who-talks-to-whom        │
  │  ┌──────────────────────────────────────────────────────────┐   │
  │  │  Layer 3: TOPOLOGY (Fig. 3)                              │   │
  │  │    (a) Centralized (ensembling root)                     │   │
  │  │    (b) Decentralized — Tree (AllReduce up + broadcast)   │   │
  │  │    (c) Decentralized — Parameter Server (sharded master) │   │
  │  │    (d) Fully Distributed — P2P (Gossip / SFB)            │   │
  │  └──────────────────────────────────────────────────────────┘   │
  │                              │                                  │
  │                              ▼ binds to wire-level transport    │
  │  ┌──────────────────────────────────────────────────────────┐   │
  │  │  Layer 4 (cross-cutting): COMMUNICATION (§3.5)           │   │
  │  │    sync model: BSP / SSP / ASP / BAP-TAP                 │   │
  │  │    overlap:    WFBP, HybComm, continuous comm            │   │
  │  │    payload:    SFB (factorized), 1-bit SGD               │   │
  │  └──────────────────────────────────────────────────────────┘   │
  │                              │                                  │
  │                              ▼                                  │
  │  ┌──────────────────────────────────────────────────────────┐   │
  │  │  Hardware substrate (§2): GPU/TPU/ASIC scale-up vs        │   │
  │  │  commodity HPC scale-out via MPI/InfiniBand/Ethernet      │   │
  │  └──────────────────────────────────────────────────────────┘   │
  └─────────────────────────────────────────────────────────────────┘
▲ Fig 3: The reference architecture as a top-down stack. Each layer
  constrains the layer below: the ML algorithm dictates what data
  must move; the parallelism strategy dictates how often; the
  topology dictates the path; the communication choice dictates
  the wire format. DynamICCL operates *inside* Layer 4, parameter-
  ising NCCL collectives that implement Layer 3's tree/AllReduce.

The paper's framing places NCCL — and therefore DynamICCL — inside Layer 4 (Communication), specifically realising the decentralized tree topology (Fig. 3b) under BSP with WFBP overlap. The authors explicitly link NCCL to this position in Section 4.2.2: "Horovod uses the NVIDIA Collective Communications Library (NCCL) for increased efficiency... when training on (Nvidia) GPUs." and in §4.2.2 on Caffe2: "Caffe2 distributes ML through, once again, AllReduce algorithms. It does this by using NCCL between GPUs...".


4. Axis A — ML Algorithm Family (Section 3.1)

The paper defines an algorithm by three orthogonal characteristics: feedback × purpose × method (§3.1).

   FEEDBACK              PURPOSE                METHOD
   ────────              ───────                ──────
   Supervised            Anomaly detection      Evolutionary (EA, GA)
   Unsupervised          Classification         SGD-based
   Semi-supervised       Clustering             (SVM, Perceptron, ANN,
   Reinforcement         Dim. reduction          DNN, CNN, RNN, Hopfield,
                         Representation          SOM, Stochastic NN,
                         Regression              Autoencoder, GAN)
                                                Rule-Based ML (RBML)
                                                 (Association rules, CART)
                                                Topic Models
                                                 (LDA, LSA/LSI, Naïve Bayes,
                                                  PLSA/PLSI)
                                                Matrix Factorization
▲ Fig 4: Axis A — the ML-algorithm cube from §3.1. DNN training is
  one cell: supervised + classification/regression + SGD-based ANN.
  This is the cell DynamICCL targets; collective patterns differ
  for other cells (e.g., LDA/MF have different traffic structure).

The crucial architectural takeaway from §3.1 is that SGD is the common case that drives the vast majority of distributed-ML communication patterns the rest of the survey addresses. RBML, EA, TM and MF have qualitatively different communication shapes — but the paper still routes them through the same Layer-2/3/4 stack. For DynamICCL, the input distribution Agent-2 sees is dominated by SGD-based DNN training (Section 4.2 systems), but the agent should not break for matrix-factorization or topic-model jobs that occasionally share the cluster.


5. Axis B — Parallelism Strategy (Section 3, Fig 2)

The paper's parallelism axis has only two top-level branches plus ensembling. It is narrower than the 0025 survey's 4-axis taxonomy.

                    Parallelism (§3, Fig. 2)
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
   DATA-PARALLEL        MODEL-PARALLEL       ENSEMBLING
   (most common)        (rare; needed        (independent
                        for huge models)     models, late
                                              aggregate)
        │                    │                    │
   D^(1)→ M                  D → M^(1)           D^(1)→ M_1 ┐
   D^(2)→ M  → AllReduce     D → M^(2)           D^(2)→ M_2 │ vote/
   D^(3)→ M  / SUM /         D → M^(3)           D^(3)→ M_3 │ avg /
            avg              (one shared D;     ...         │ stack
                              sub-models                    ┘
                              communicate at    (§3.3 also covers
                              graph cut points; bagging, boosting,
                              DistBelief is     bucketing, RF,
                              the canonical     stacking, LCS)
                              example)
   Reqts: i.i.d. data         Reqts: model has  Reqts: data can be
   over samples; full         decomposable      partitioned without
   model fits in node         compute-graph     biasing trained models
                              with sparse cross
                              dependencies
▲ Fig 5: Axis B — parallelism types as the survey defines them.
  Note: pipeline parallelism, tensor parallelism, and ZeRO/sharded
  optimizer state are NOT named in this 2020 survey. Sufficient
  Factor Broadcasting (SFB) appears separately under topology §3.4
  as a P2P comm-reduction technique.

DistBelief (§4.2.3) is presented as the model-parallel reference: it "defines a model as a computation graph where each node implements an operation" and "locally connected models lend themselves better for model-parallelism because of limited cross-partition communication". This is a precursor to what later papers call tensor parallelism, but the survey does not formalize it as a separate axis.


6. Axis C — Topology / Architecture (Section 3.4, Fig 3)

   Fig. 3 of paper — the four canonical topologies
   ────────────────────────────────────────────────

  (a) CENTRALIZED                  (b) DECENTRALIZED — TREE
     (ensembling root)                (AllReduce up; broadcast down)

         [Trained model]                    [ML node]
              ▲                            ▲          ▲
           [Ensembling]              [aggregate]   [aggregate]
          ▲     ▲     ▲                ▲    ▲       ▲    ▲
        [n] [n] [n]               [n][n]  [n]    [n] [n]
         │   │   │                  │ │    │      │   │
       data data data              data data data data data

  (c) DECENTRALIZED — PARAMETER SERVER   (d) FULLY DISTRIBUTED — P2P

        [PS] -- [PS] -- [PS]                   [n]──[n]
         ▲      ▲      ▲                         \   │   \
         │      │      │                       [n]──[n]──[n]
        [n]    [n]    [n]                            │
         │      │      │                            [n]
        data   data   data
        (sharded model;            (each node owns its own model copy;
         workers pull/push          gossip / random walks; SFB to
         per-shard params)          reduce broadcast volume)
▲ Fig 6: Section 3.4 / Fig. 3. Four topologies organized by *degree
  of distribution*: from single-root ensembling to full P2P. NCCL
  occupies (b) — decentralized tree under AllReduce. The survey
  treats P2P / Gossip as a peer first-class option.

The paper's Section 3.4 enumerates the topology primitives:

  Trees:        scale, simple parent/child links, AllReduce-friendly
  Rings:        when broadcast is unsupported; minimal comm overhead;
                "commonly used between multiple GPUs on the same
                machine" (§3.4) ← directly names the NCCL ring case
  Parameter
   Server:      sharded master; workers read/write KV-store (Mu Li
                et al. [92,93,94] cited)
  Peer-to-Peer: each node owns its own copy; full broadcast
                expensive; SFB ([94 — Xie '15]) decomposes parameter
                matrix into two factors; Gossip Learning ([112])
                does random-walk model exchange

The bandwidth/latency trade-off is not given as big-O cost formulas in this survey (unlike 0025 Table 6). The paper handles costs qualitatively through Section 3.5.1 ("Computation Time vs. Communication vs. Accuracy") — the trade-off is named but not formalised.


7. Axis D — Bridging Sync Model (Section 3.5.2)

Section 3.5.2 enumerates four sync regimes, ordered by relaxation of the consistency constraint:

   Synchronization spectrum (§3.5.2)
   ─────────────────────────────────────────────────────────────────

   BSP (Bulk Synchronous Parallel)
     Workers synchronize between every comp+comm phase.
     Cited example: MapReduce as a BSP engine ([161] Xing et al.).
     Pro: serializable; provably correct convergence.
     Con: slowest worker stalls everyone (§3.5.2 calls out [34]).

   SSP (Stale Synchronous Parallel)
     Faster workers may move ahead by ≤ s iterations; barrier
     fires only when staleness bound is hit.
     Cited example: Petuum (§4.2.4) — "uses stale synchronicity to
     exploit inherent tolerance of ML against errors".
     Pro: strong convergence guarantees retained.
     Con: convergence degrades sharply if staleness too large.

   ASP (Approximate Synchronous Parallel)
     Limits how INACCURATE a parameter can be (vs SSP which limits
     STALENESS). Server delays sync until aggregated update is
     significant.
     Pro: indefinite delay possible.
     Con: hard to set the "significance" threshold ([73] cited).

   BAP / TAP (Barrierless / Total Asynchronous Parallel)
     Workers communicate without waiting; highest possible speedup.
     Pro: best wall-clock on heterogeneous clusters.
     Con: error grows with delay; convergence not guaranteed
     ([65] Han & Daudjee on Giraph Unchained).
▲ Fig 7: The four sync regimes from §3.5.2, listed top-to-bottom in
  decreasing strictness. The paper places this trade-off as
  "fast/correct convergence (top) vs faster/fresher updates (bottom)".

The paper's Section 3.5.3 then enumerates communication strategies that overlap with sync rather than replace it:

  Continuous communication: prevent bursts after a mapper finishes
                            (Bösen [155] cited as state of the art)
  WFBP (Wait-free Backprop): top-layer gradients sent during lower-
                             layer compute ([171] Poseidon)
  HybComm:   combine PS + SFB depending on tensor sparsity ([171])
  SFB-style payload reduction: send factorized form, not full update

8. Axis E — Communication Strategy (Section 3.5.3)

Although the paper folds this into §3.5, it is a distinct architectural concern. Strategy choices the survey calls out:

Strategy             | Mechanism                         | Cited in
─────────────────────|───────────────────────────────────|──────────
Continuous comm      | Spread comm over time to avoid    | Bösen [155]
                     | bursts after map phase            |
WFBP                 | Send top-layer grads during       | Poseidon
                     | bottom-layer backprop compute     | [171]
HybComm              | Switch PS↔SFB by tensor sparsity  | Poseidon
                     |                                   | [171]
1-bit SGD            | Quantize gradients to one bit per | CNTK
                     | value; reduces comm volume        | (§4.2.2,
                     |                                   | Seide [130])
Block-momentum SGD   | Per-block split + average + apply | CNTK
                     | block-level momentum              | (§4.2.2,
                     |                                   | [31])
SFB                  | Decompose parameter matrix into   | §3.4 P2P,
                     | two sufficient-factor vectors     | [94 Xie'15]
▲ Fig 8: Communication-strategy menu the paper lists. This is
  shorter than the dedicated 0025 compression survey because 0029
  treats compression as a system-specific feature, not as a first-
  class axis. Still, the four overlap/payload tactics are explicit.

9. The Distributed ML Ecosystem (Section 4, Fig 4)

  ┌───────────────────────────────────────────────────────────────┐
  │              Distributed Machine Learning Ecosystem           │
  │                          (Fig. 4)                             │
  │                                                               │
  │  ┌────────────────┐                       ┌────────────────┐  │
  │  │ General-Purpose│  ← Mahout / MLlib →   │ Single-Machine │  │
  │  │ Distributed    │                       │ ML Systems     │  │
  │  │ Frameworks     │                       │                │  │
  │  │                │                       │ - Theano       │  │
  │  │ - Apache Hadoop│                       │ - Caffe        │  │
  │  │ - Apache Spark │                       │ - Sklearn      │  │
  │  │ - Apache Flink │                       │ - MLPack       │  │
  │  └───────┬────────┘                       │ - NVIDIA libs  │  │
  │          │            ┌──────────────┐    └───────┬────────┘  │
  │          │  Hadoop+   │  Natively    │   Keras/           │   │
  │          │  Spark+    │  Distributed │   NCCL link        │   │
  │          └──AllReduce─►  ML Systems  ◄──────────────────  │   │
  │                       │              │                        │
  │                       │ - Caffe2     │                        │
  │                       │ - CNTK       │                        │
  │                       │ - DistBelief │                        │
  │                       │ - DIANNE     │                        │
  │                       │ - Tensorflow │                        │
  │                       │ - MXNet      │                        │
  │                       │ - Petuum     │                        │
  │                       │ - Horovod    │                        │
  │                       │ - Baidu AR   │                        │
  │                       │ - DMTK       │                        │
  │                       │ - MXNet-MPI  │                        │
  │                       └──────┬───────┘                        │
  │                              │                                │
  │                       ┌──────▼───────┐                        │
  │                       │  Cloud ML    │                        │
  │                       │              │                        │
  │                       │ - Google CAI │                        │
  │                       │ - Azure ML   │                        │
  │                       │ - AWS Sage-  │                        │
  │                       │   Maker      │                        │
  │                       │ - IBM Watson │                        │
  │                       └──────────────┘                        │
  └───────────────────────────────────────────────────────────────┘
▲ Fig 9: Section 4 ecosystem map. Three flows feed into native
  distributed-ML systems: general-purpose Hadoop/Spark via
  Mahout/MLlib; single-machine libraries via NCCL/Keras bindings;
  cloud-as-a-service consuming all of the above. NCCL is named
  explicitly as the interconnect linking single-machine GPU libs
  into native distributed-ML systems (Caffe2, Horovod-on-TF).

The survey catalogs each native system by its (parallelism × topology × sync) signature:

System            | Parallelism  | Topology     | Sync     | Section
──────────────────┼──────────────┼──────────────┼──────────┼────────
Distributed       | data         | centralized  | n/a      | §4.2.1
 Ensemble         |              |              |          |
Baidu AllReduce   | data         | tree (ring)  | BSP      | §4.2.2
Horovod (NCCL)    | data         | tree (ring + | BSP      | §4.2.2
                  |              |   recursive  |          |
                  |              |   halve-dbl) |          |
Caffe2 (NCCL+Gloo)| data         | ring         | BSP      | §4.2.2
CNTK              | data         | ring + 1-bit | BSP      | §4.2.2
                  |              |   SGD opt.   |          |
DistBelief        | data + model | parameter    | ASP      | §4.2.3
                  |              | server       | (Down-   |
                  |              |              |  pour)   |
DIANNE            | model + data | PS           | ASP      | §4.2.3
Tensorflow        | data + model | PS or AR     | flexible | §4.2.3
                  |              |              |          |
MXNet (KVStore)   | data         | PS           | ASP      | §4.2.3
DMTK Multiverso   | data         | PS           | ASP      | §4.2.3
Petuum            | data + model | PS + SSP     | SSP      | §4.2.4
                  |              |              | (bounded |
                  |              |              |  stale)  |
MXNet-MPI         | data         | hybrid AR+PS | hybrid   | §4.2.5
▲ Fig 10: Cross-tabulation of native distributed-ML systems by
  their three-axis signature. NCCL appears as the AR transport in
  Horovod (§4.2.2) and Caffe2 (§4.2.2). DynamICCL acts inside the
  AR row.

10. In-Scope vs Out-of-Scope Partition for DynamICCL

This is the most important diagram for a DynamICCL architect: where in the survey's design space does Agent-2 actually live, and what should it ignore?

┌──────────────────────────────────────────────────────────────────┐
│           DynamICCL Scope Partition over Survey 0029             │
│                                                                  │
│  ╔════════════════════════════════════════════════════════════╗  │
│  ║                  IN-SCOPE FOR AGENT-2                      ║  │
│  ║                                                            ║  │
│  ║   Layer 4: COMMUNICATION (Section 3.5)                     ║  │
│  ║    ► Sync = BSP (the only regime AllReduce supports)       ║  │
│  ║    ► Comm strategy: WFBP overlap (default in PT-DDP)       ║  │
│  ║                                                            ║  │
│  ║   Layer 3: TOPOLOGY (Section 3.4, Fig 3b)                  ║  │
│  ║    ► Decentralized — Tree                                  ║  │
│  ║    ► Tree primitive   = NCCL `algo=tree`                   ║  │
│  ║    ► Ring  primitive  = NCCL `algo=ring`                   ║  │
│  ║    ► Hierarchical AR  = NCCL `algo=collnet*`               ║  │
│  ║                                                            ║  │
│  ║   NCCL knobs Agent-2 controls today:                       ║  │
│  ║     algo ∈ {ring, tree, collnet_chain, collnet_direct,     ║  │
│  ║             nvls, nvls_tree}                               ║  │
│  ║     proto ∈ {LL, LL128, Simple}                            ║  │
│  ║     nChannels ∈ {1..32}                                    ║  │
│  ║     numThreads (default 1024)                              ║  │
│  ╚════════════════════════════════════════════════════════════╝  │
│                                                                  │
│  ╔════════════════════════════════════════════════════════════╗  │
│  ║              IN-SCOPE AS STATE FEATURES ONLY               ║  │
│  ║              (observe; do not control)                     ║  │
│  ║                                                            ║  │
│  ║   Layer 1 ML Algorithm   §3.1                              ║  │
│  ║      s_method ∈ {SGD, EA, RBML, TM, MF}                    ║  │
│  ║      s_dnn_family ∈ {DNN, CNN, RNN, GAN, ...}              ║  │
│  ║      → controls msg-size distribution & frequency          ║  │
│  ║                                                            ║  │
│  ║   Layer 2 Parallelism    §3, Fig 2                         ║  │
│  ║      s_parallel ∈ {data, model, ensembling, hybrid}        ║  │
│  ║      → defines whether AllReduce is even invoked           ║  │
│  ║                                                            ║  │
│  ║   Layer 3 Topology       §3.4                              ║  │
│  ║      s_arch ∈ {centralized, tree, PS, P2P}                 ║  │
│  ║      → if PS or P2P, NCCL AllReduce isn't the right        ║  │
│  ║        primitive; agent should defer or no-op              ║  │
│  ║                                                            ║  │
│  ║   Layer 4 Sync model     §3.5.2                            ║  │
│  ║      s_sync ∈ {BSP, SSP, ASP, BAP/TAP}                     ║  │
│  ║      → SSP/ASP/TAP imply non-AR transport at framework lvl ║  │
│  ║                                                            ║  │
│  ║   Comm strategy          §3.5.3                            ║  │
│  ║      s_overlap ∈ {WFBP, HybComm, continuous}               ║  │
│  ║      → WFBP shapes msg-size mix                            ║  │
│  ║                                                            ║  │
│  ║   Hardware substrate     §2                                ║  │
│  ║      s_compute ∈ {GPU, TPU, ASIC, CPU/AVX-512}             ║  │
│  ║      s_topo_phys ∈ {NVLink-only, NVL+IB, IB-only,          ║  │
│  ║                     RoCE, TCP/IP}                          ║  │
│  ║      → SHM/NVLink vs IB drastically changes α and β        ║  │
│  ╚════════════════════════════════════════════════════════════╝  │
│                                                                  │
│  ╔════════════════════════════════════════════════════════════╗  │
│  ║                       OUT-OF-SCOPE                         ║  │
│  ║          (controlled by application/framework, not NCCL)   ║  │
│  ║                                                            ║  │
│  ║   Choosing data-parallel vs model-parallel       (§3, Fig 2)║  │
│  ║   Choosing topology family (PS vs AR vs P2P)     (§3.4)    ║  │
│  ║   Choosing sync regime (BSP vs SSP vs ASP)       (§3.5.2)  ║  │
│  ║   Choosing comm strategy (WFBP vs continuous)    (§3.5.3)  ║  │
│  ║   1-bit SGD / block-momentum SGD                 (§4.2.2)  ║  │
│  ║   SFB / Sufficient Factor Broadcasting           (§3.4 P2P)║  │
│  ║   Ensembling aggregation logic                   (§3.3)    ║  │
│  ║   Hyperparameter optimization                    (§3.2)    ║  │
│  ║   Fault tolerance protocol (checkpoint, replay)  (§5.2)    ║  │
│  ║   Privacy mechanisms (DP, FedAvg, GAN attacks)   (§5.3)    ║  │
│  ║   Model portability (ONNX, Core ML, NNEF)        (§5.4)    ║  │
│  ║   Storage layer (GFS/HDFS) and scheduler         (§4.1)    ║  │
│  ║   Hardware substrate selection (GPU vs TPU vs    (§2)      ║  │
│  ║   ASIC)                                                    ║  │
│  ╚════════════════════════════════════════════════════════════╝  │
└──────────────────────────────────────────────────────────────────┘
▲ Fig 11: Three-zone partition. Agent-2 ACTS on the inner box; it
  READS the middle box as state features; it MUST NOT try to control
  the outer box. The boundary between the inner and middle box is
  the NCCL plugin API surface (`getCollInfo`, `tuneCollInfo`).

The middle ring is the architecturally interesting one — these are the features Agent-2 needs to OBSERVE to make its inner-box decisions correctly, but cannot CHANGE.


11. Per-Axis Trade-off Tables

11.1 Parallelism strategy trade-off (Section 3, Fig 2)

Property Data-Parallel Model-Parallel Ensembling Winner (DynamICCL)
Per-iteration comm volume one AR per step low (graph cuts) none until end Data-Parallel
Memory per node full model partial model full model Model-Parallel
Algorithmic constraint i.i.d. data decomposable G i.i.d. partition Data-Parallel
AR-friendly Yes Partial No Data-Parallel
Fault tolerance low (barrier) low high (independ) Ensembling
Operating regime for NCCL primary rare n/a Data-Parallel

For DynamICCL, the Data-Parallel cell is the home cell. Model-Parallel generates much smaller, sparser collectives at unpredictable points in the graph; ensembling generates one collective at the end of training. Agent-2's training distribution comes overwhelmingly from data-parallel SGD.

11.2 Topology trade-off (Section 3.4, Fig 3)

Dimension Centralized Tree (AR) Parameter Server P2P (Gossip) Winner (DynamICCL)
Single point of failure Yes (root) None PS shard None Tree / P2P
Easy to scale No Yes Yes (shard) Yes Tree
Sync-mode flexibility n/a BSP only All four All except SSP PS
Bandwidth efficiency low high (ring) medium (server) low (broadcast) Tree
Latency for small msg n/a log N (tree) 2 RTT (push/pull) O(diameter) Tree
NCCL-applicable No YES No No Tree
Heterogeneity tolerance n/a Low High High PS

The survey itself draws these conclusions (§3.4 + §5.2): "Synchronous AllReduce-based approaches seem to scale significantly better than the parameter server approach (up to a certain cluster size), but suffer from a lack of fault-tolerance". This is exactly the AR-vs-PS trade-off Agent-2 inherits as a constraint, not a choice.

11.3 Sync model trade-off (Section 3.5.2)

Property BSP SSP ASP BAP / TAP Winner (DynamICCL)
Convergence guarantee provable provable s≤bound weak (heuristic) none guaranteed BSP
Wall-clock on heterogeneous poor better best (no wait) best ASP/TAP
Compatible with AllReduce YES partial NO NO BSP
Straggler sensitivity high bounded none none TAP
Cited example system MapReduce Petuum (§4.2.4) Downpour (§4.2.3) Giraph Unchained
State-feature for Agent-2 observe observe observe observe s_sync

For DynamICCL, BSP is the operating regime — and the only one compatible with AllReduce. SSP/ASP/TAP are out-of-scope as actions but in-scope as state features (the agent must know if its caller has switched to PS+ASP and the AllReduce path is therefore being bypassed).

11.4 Communication-strategy trade-off (Section 3.5.3)

Property Continuous WFBP HybComm 1-bit SGD SFB Winner (DynamICCL)
Hides comm behind compute partial yes (deep) yes no no WFBP
Reduces payload volume no no no 32× factorized 1-bit SGD
Standard in PT-DDP partial YES no no no WFBP
Affects msg-size mix smooths many small mixed 32× smaller small observe
Layer of implementation framework framework framework caller caller (all out-of-scope)
Effect on Agent-2 smooths shapes mix shapes mix shrinks msg shrinks msg feature: msg_size

All of these techniques live above NCCL. Agent-2 cannot enable or disable them — but their presence shows up as the effective message size distribution the agent has to optimise over.

11.5 Hardware substrate trade-off (Section 2)

Dimension GPU (Volta/Turing) TPU ASIC (DianNao) CPU AVX-512 Winner (DynamICCL)
Branch divergence cost high (SIMD) low (MIMD) low (fixed func) low TPU/ASIC
Peak throughput 47× CPU [107] 70× CPU [128] 1000× CPU [32] 1× baseline ASIC
Energy efficiency medium 200×/W vs CPU 20× vs DianNao low ASIC
NCCL availability YES (NVLink+IB) partial (TF) no no GPU
Operating regime today dominant for DL Google-internal niche niche GPU

For DynamICCL, GPU is the substrate. The agent's policy is keyed to the (NVLink, IB, RoCE) transport hierarchy that GPU clusters use. The survey catalogs other substrates, but they are out-of-scope for Agent-2 because they do not run NCCL.


12. Cross-Axis Design-Space Matrix

A reconstructed (parallelism × topology × sync) matrix from Section 4's system catalog:

                    │   BSP            │   SSP            │   ASP/TAP
   ─────────────────┼──────────────────┼──────────────────┼────────────────
   Centralized      │  (ensembling     │  —               │  —
   (root)           │   training)      │                  │
   ─────────────────┼──────────────────┼──────────────────┼────────────────
   Tree / AllReduce │  Baidu AR        │  (rarely paired) │  INCOMPATIBLE
                    │  Horovod (NCCL)  │                  │  (AR requires
                    │  Caffe2 (NCCL)   │                  │   barrier)
                    │  CNTK (ring +    │                  │
                    │     1bit SGD)    │                  │
                    │  ◄ DynamICCL ►   │                  │
   ─────────────────┼──────────────────┼──────────────────┼────────────────
   Parameter        │  (rare; usually  │  Petuum          │  DistBelief
   Server           │   async)         │                  │  Downpour
                    │                  │                  │  TF (PS mode)
                    │                  │                  │  MXNet KVStore
                    │                  │                  │  DMTK Multiverso
   ─────────────────┼──────────────────┼──────────────────┼────────────────
   P2P (Gossip)     │  SFB-broadcast   │  —               │  Gossip Learning
                    │  (synchronous    │                  │  (random walk)
                    │   p2p)           │                  │
   ─────────────────┴──────────────────┴──────────────────┴────────────────
   Hybrid           │  MXNet-MPI: AR within group, PS across groups
                    │  Tensorflow: configurable both modes
▲ Fig 12: Cross-axis matrix populated from Section 4's system list.
  AR + ASP is incompatible (the survey is consistent with 0025 here).
  DynamICCL's home cell is bold-italic: AR + BSP. Petuum populates
  the unique PS+SSP cell that the rest of the literature largely
  ignores. The right column (ASP) is exclusively PS or P2P.

13. What to Borrow for DynamICCL

Agent-2 selects (algo, proto, nChannels, numThreads). The survey's value to DynamICCL is in its system-level observations about where collective communication sits in the broader stack. The borrows below are tighter than the 0025 borrows because this survey is broader and shallower; many of its lessons are about what NOT to absorb into Agent-2.

13.1 Three-layer reference architecture as state-vector schema

Survey's pattern (§3, Fig. 1+2+3). Distributed-ML systems decompose into Layer 1 (algorithm) × Layer 2 (parallelism) × Layer 3 (topology), bound by Layer 4 (communication).

DynamICCL application. Agent-2's state vector should mirror this schema with one slot per layer:

  s = ( s_algo_method      = SGD | EA | RBML | TM | MF             # L1
      , s_dnn_family       = DNN | CNN | RNN | GAN | ...           # L1
      , s_parallel         = data | model | ensembling | hybrid    # L2
      , s_arch             = centralized | tree | PS | P2P         # L3
      , s_sync             = BSP | SSP | ASP | TAP                 # L4
      , s_overlap          = none | WFBP | HybComm | continuous    # L4
      , s_payload_compress = none | 1bit-SGD | block-momentum |
                             SFB | other                           # L4
      , s_topo_phys        = NVLink-only | NVL+IB | IB-only |
                             RoCE | TCP                            # HW
      , s_compute          = GPU | TPU | ASIC | CPU                # HW
      , s_msg_size_log     = log_2(effective_bytes)                # derived
      , s_history_lstm_h   = LSTM hidden state of past collectives # derived
      )

The schema is paper-grounded: every slot maps to a concrete section in the survey. The agent's policy gates on (s_arch, s_sync) to verify it is even in the AR + BSP cell before emitting any NCCL config; otherwise it falls through to a no-op.

13.2 NCCL is the AR transport — no expansion of action space

Survey's pattern (§4.2.2). NCCL is named only as the AllReduce transport in Horovod, Caffe2, and (via NCCL 2.0) cross-node DDP. The survey does NOT propose any framework-level knobs that NCCL itself exposes — the AllReduce primitive is treated as monolithic.

DynamICCL application. The action space stays at (algo, proto, nChannels, numThreads). The survey provides no evidence that adding a fifth or sixth NCCL-internal action (e.g., chunk-size, aggregation factor) would map to a surveyed technique. This is the opposite finding to 0025 (HiCCL/HiCCL-style hierarchical decomposition would expand the action space) and is consistent with this survey being broader and less NCCL-specific.

13.3 BSP is the only sync regime AllReduce supports — gate on it

Survey's pattern (§3.5.2 + §4.2.2). The four sync regimes are mutually exclusive. AR is only compatible with BSP. SSP is Petuum's unique cell (PS-based). ASP belongs to DistBelief / TF-PS / MXNet KVStore (PS-based). TAP belongs to Giraph Unchained (graph-engine).

DynamICCL application. Agent-2's first decision rule should be:

  if s_sync ≠ BSP:
      return NO_OP    (the framework is using PS or P2P; NCCL AR
                       is not on the critical path)
  if s_arch ≠ tree:
      return NO_OP    (collective primitive is not AllReduce)
  if s_parallel = ensembling:
      return NO_OP    (no per-step collective; one collective at end)

Only after these gates does the agent emit an (algo, proto, nCh, nT) selection. This prevents wasting policy capacity on cells the agent does not control.

13.4 Effective message size dominates over nominal tensor size

Survey's pattern (§3.5.3 + §4.2.2). 1-bit SGD (Seide et al. [130] in CNTK) and SFB ([94] Xie '15) both shrink the AllReduce payload without changing the tensor's nominal shape. WFBP ([171] Poseidon) shifts the per-collective size distribution toward many small messages.

DynamICCL application. Agent-2's s_msg_size_log feature must be the post-compression, post-WFBP-fusion byte count, not the raw tensor count. A 1 GB tensor compressed by 1-bit SGD is a 31 MB collective; the right algorithm for 31 MB (likely tree+LL128) is different from the right algorithm for 1 GB (likely ring+Simple). This is exactly the same pattern borrowed from 0025 §11.4.1, and this survey reinforces it with two additional concrete techniques (1-bit SGD and SFB) that produce the same effect.

13.5 Topology is exogenous — encode it as an ordinal feature

Survey's pattern (§3.4). Topology is a deployment-time decision ($A) the cluster admin makes, not an inference-time choice. The paper enumerates four canonical topologies (centralized, tree, PS, P2P).

DynamICCL application. s_arch is set once per training job and remains constant across all of Agent-2's collectives in that job. It is a cluster-level feature, not a per-collective feature. The same applies to s_topo_phys (NVLink-only vs NVL+IB vs IB-only). Both should be passed as one-hot inputs at agent initialization and held constant; only s_msg_size_log, s_history_lstm_h, and s_congestion_signal vary per collective.

13.6 Performance benchmarks generalize poorly above ~hundreds of GPUs

Survey's pattern (§5.1). "...most of these benchmarks test at most a few hundred machines, whereas the scale at which, e.g., DistBelief, is demonstrated, can be two orders of magnitude larger." The survey's own conclusion is that the published evaluation regime of distributed-ML papers does not reflect production scale.

DynamICCL application. Agent-2's training data must include multi-node IB-RDMA traces, not just NVLink-only intra-node traces. The Pensieve borrow ("FCC + HSDPA + Verizon traces for generalization") is reinforced here: train on at least three cluster topologies (NVLink-only DGX, NVL+IB DGX-A100 cluster, IB-only HPC) and use the topology feature to discriminate. Single-cluster training will not transfer.

13.7 Fault tolerance is the unsolved problem — Agent-2 must

notice failures, not cause them

Survey's pattern (§5.2). "Common implementations of these HPC- inspired patterns, such as MPI and NCCL, lack fault-tolerance completely... Failure of a single machine blocks the entire training process." AllReduce is a barrier; one rank's hung collective hangs all ranks.

DynamICCL application. Two design constraints follow:

  1. Never select a config that requires more channels than the
     cluster supports — a config that fails at NCCL init kills all
     ranks.
  2. The agent's congestion detector (Phase-2 Trigger Agent) must
     distinguish "slow" from "hung". A hung rank is not a congestion
     signal; it is a fault that requires job restart, not a config
     change.

This is a negative borrow: the agent must not interpret long collective times during a hang as "this config is bad" and switch configs, because no config will work after a rank is dead. The state machine must include a FAULTED terminal state that suppresses all agent action until the job framework restarts.

13.8 Privacy and DP / GAN attacks are out-of-scope

Survey's pattern (§5.3). Federated learning, differential privacy, GAN-based reconstruction attacks (Hitaj et al. [71]) are application- layer privacy concerns. The agent does not see plaintext gradients and cannot influence privacy properties.

DynamICCL application. Agent-2 must not log gradient values, only sizes and latencies. This is a soft architectural constraint that keeps the agent privacy-neutral: it observes the AllReduce envelope (bytes-in, time-out) but never the contents.

13.9 Portability — the ONNX precedent for Agent-2 deployment

Survey's pattern (§5.4). ONNX, Core ML, NNEF are framework- independent model interchange formats. The survey identifies portability as a real constraint: a model trained in TF cannot be deployed in PyTorch without conversion.

DynamICCL application. The Agent-2 policy network should be serialized in a portable format (ONNX or TorchScript) so that the same policy weights work whether the NCCL plugin is loaded by PyTorch DDP, Horovod, or DeepSpeed. The plugin's input pipeline (building s from getCollInfo) is the only framework-specific shim; the policy itself is portable.

13.10 Cloud delivery model — Agent-2 as a "managed tuner"

Survey's pattern (§4.3). "The cloud-based delivery model is becoming more important, as it reduces the burden of entry into designing smart applications that facilitate machine learning techniques." SageMaker, Azure ML, GCP all wrap distributed-ML behind managed APIs.

DynamICCL application. A long-term deployment story for DynamICCL is as a hosted policy server: the cluster's NCCL plugin queries a central Agent-2 instance over a thin RPC, returning (algo, proto, nCh, nT). This separates the policy update path (managed, centralized, retrained nightly) from the inference path (per-cluster, local, low-latency). The Pensieve server-side deployment pattern maps directly. Caveat: the per-collective RPC latency must remain below the smallest collective's completion time, which is ~10 µs intra-node — likely too tight for a remote server. So the realistic deployment is a local plugin with periodic policy pulls, not per-collective remote inference.


14. Summary Table of Borrowed Patterns

Pattern Survey origin DynamICCL application
3-layer + 1 cross-cutting reference arch §3, Figs 1-3 State-vector schema with one slot per layer
AR + BSP is the only valid cell §3.5.2 + Fig 12 Gate-rule: emit no-op outside (AR, BSP, data-parallel) cell
Topology and parallelism are exogenous §3.4, §3 Fig 2 s_arch, s_parallel as init-time constants, not per-call inputs
Effective msg size dominates §3.5.3, §4.2.2 s_msg_size = post-compression, post-WFBP-fusion byte count
AllReduce lacks fault tolerance §5.2 FAULTED state suppresses agent action; never confuse hang w/ slow
Benchmarks underestimate large scale §5.1 Train policy on ≥3 cluster topologies (NVL, NVL+IB, IB-only)
NCCL is named transport, no new knobs §4.2.2 Action space stays at (algo, proto, nCh, nT) — no expansion
Privacy and DP are application-layer §5.3 Agent logs sizes/latencies only, never gradient contents
Portability via interchange formats §5.4 Serialize Agent-2 policy as ONNX/TorchScript; framework-agnostic
Cloud / managed delivery §4.3 Long-term: hosted policy server + local inference
Hardware substrate selection §2 s_compute, s_topo_phys as one-hot init features
ML-algo family shapes traffic §3.1 s_algo_method, s_dnn_family as policy-conditioning context

Analogy

The survey's three-layer reference architecture (algorithm × parallelism × topology) plus the cross-cutting communication concern is structurally a modern factory floor. Layer 1 is what product is being made (the ML algorithm — phones, cars, or pharmaceuticals each demand different production lines). Layer 2 is how the work is split among workers (data-parallel = each worker assembles complete units from their slice of incoming materials; model-parallel = an assembly line where each worker performs one specialized step on a single unit moving past them; ensembling = each worker builds a complete competing prototype, judges decide which wins). Layer 3 is the factory's communication topology — hierarchical foreman tree, centralized whiteboard with a clerk (parameter server), or peer talkback radios (P2P). Layer 4 is the shop-floor protocol — strict shift-bell synchronization (BSP), "finish your task within 5 minutes of others" (SSP), "post your update whenever you're done" (ASP), or "shout it across the room" (TAP).

Agent-2's role is the shop-floor PLC controller that decides which conveyor belt speed and which packaging line to use for each batch leaving Layer 3. It does not choose what to build, how to split the work, or which topology to install — those are the plant manager's decisions made before the shift starts. It only optimises the throughput of one specific subsystem (the AllReduce conveyor) given everything else is fixed. The survey's contribution is the floor plan that tells the controller exactly which subsystem it owns and which it must not touch.