Architecture & Design-Space Analysis

Communication-Efficient Data Parallel Distributed Deep Learning: A Comprehensive Survey

Source: Zhenheng Tang, Shaohuai Shi, Wei Wang, Bo Li, Xiaowen Chu — Hong Kong Baptist Univ. / Harbin IT / HKUST / HKUST(GZ), arXiv:2003.06307v2, 1 Sep 2023 (35 pp.) Analyst: Vishwakarma Date: 2026-04-28


Table of Contents

  1. Survey Scope and the Four-Axis Framing
  2. Master Taxonomy Tree (Design Space)
  3. Axis 1 — Communication Synchronization (Section 3)
  4. Axis 2 — System Architectures (Section 4)
  5. Axis 3 — Compression Techniques (Sections 5, 6)
  6. Axis 4 — Parallelism of Communication and Computing (Section 7)
  7. Auxiliary Technologies (Section 9)
  8. Cross-Axis Design-Space Matrix (paper Tables 1, 5, 9)
  9. Per-Axis Trade-off Tables
  10. Convergence-Cost Coupling (Section 8)
  11. What to Borrow for DynamICCL — Action Space, State Features, Out-of-Scope
  12. Summary Table of Borrowed Patterns
  13. Analogy

1. Survey Scope and the Four-Axis Framing

The paper's load-bearing organizing claim, stated in Section 2 and crystallized in Table 1 plus Fig. 2, is that all communication-efficient data-parallel training algorithms can be decomposed along four orthogonal dimensions, all sitting on top of the BSP-SGD baseline:

┌──────────────────────────────────────────────────────────────────────┐
│             The four orthogonal axes (paper Table 1)                 │
│                                                                      │
│   Axis 1: COMMUNICATION SYNCHRONIZATION                              │
│           {Synchronous, Stale-Synchronous, Asynchronous, Local}      │
│           "when do workers wait for each other?"                     │
│                                                                      │
│   Axis 2: SYSTEM ARCHITECTURE                                        │
│           {Parameter-Server (centralized), All-Reduce, Gossip}       │
│           "what is the communication topology?"                      │
│                                                                      │
│   Axis 3: COMPRESSION                                                │
│           {Quantization (low-bit), Sparsification (top-k / random)}  │
│           "how much volume per gradient transfer?"                   │
│                                                                      │
│   Axis 4: PARALLELISM OF COMM AND COMPUTE                            │
│           {Pipelining, Scheduling}                                   │
│           "how is comm time hidden behind compute time?"             │
│                                                                      │
│           +--- two HW-level appendices: Communication Protocol       │
│                (TCP/IP, RoCE, IB) and Network Topology (Fat-Tree,    │
│                BCube, Torus). Survey treats these as exogenous.      │
└──────────────────────────────────────────────────────────────────────┘
▲ Fig 1: The four orthogonal axes the survey uses to classify
  communication-efficient DDL algorithms. Algorithms are points in
  this 4-D product space; combinations such as (BSP, All-Reduce,
  Top-k, WFBP) name a single deployable training stack.

The paper is explicit that the axes are independent — "these algorithms can be combined to create new ones that are mostly independent of each other" (Section 1.3). Every system surveyed instantiates a specific choice on each axis.

DynamICCL Agent-2 currently selects (algo, proto, nChannels, numThreads) within a single cell of this 4-D space — namely (BSP, All-Reduce, no compression, no scheduling) — i.e., it tunes the NCCL implementation of one collective. The survey reveals that there is an entire layer of decisions above Agent-2 that the agent could in principle absorb into its state, and a layer below (transport / topology) that is exogenous.


2. Master Taxonomy Tree (Design Space)

                Communication-Efficient Data-Parallel DDL
                                  |
     +----------------------------+----------------------------+
     |              |              |             |             |
   Axis 1         Axis 2        Axis 3        Axis 4      (HW-level
   Sync           Arch          Compr.        Pipe/Sched   appendix:
   (Sec 3)        (Sec 4)       (Sec 5,6)     (Sec 7)      proto+topo)
     |              |              |             |
  +--+--+        +--+--+         +-+-+         +-+-+
  |  |  |        |  |  |         |   |         |   |
  v  v  v        v  v  v         v   v         v   v
 BSP SSP ASP    PS  AR  Gos    Quant Spars   Pipe Sched
  |  |  |        |  |  |       |     |       |    |
  |  |  +--+--+  |  |  +-+-+   +-+-+ +-+-+   +-+  +-+
  |  |     |  |  |  |    | |   | | |   | |   | |    |
  v  v     v  v  v  v    v v   v v v   v v   v v    v
 BSP SSP  Backup ASP-  Ring Gossip 1bit QSGD Top-k WFBP   Tensor
 -SGD     workers SGD  AR   Push   SGD       Random  MG-  partition
         + R2SP   (Down-Recur Gossip Sign- Tern- (Random  WFBP   (Peng)
 Local-                pour) sive  Avg    SGD   Grad,    Priority
 SGD                       Doub  Stochastic     Random   Sched
 + FedAvg                  Double      Mask)   (Tictac,
 + post-local              binary tree                   ByteScheduler)
   SGD                     Hierarchical               Contention-
 + 3PC LAG                 (BlueConnect, Rabens.)      aware
                           Topology-aware               (PLink)
                           (BLink, PLink, BML)
                           2D-Torus, BCube
                           Programmable switch (160)

                     Sparsification subtypes:
                       Random-k (Random Mask, Subsampling)
                       Deterministic (Fixed Thr, Top-k,
                                      Adaptive Thr, AdaComp,
                                      gTop-k, SBC, STC)
                       Coordinate Descent (BCD, IBCD)
                       Proximal Methods (small-variance grads only)

                     Quantization subtypes:
                       1-bit SGD, TernGrad, Sign-SGD,
                       QSGD (stochastic), DIANA,
                       Layer-wise / Adaptive bits

                     Auxiliary technologies (Section 9):
                       Error Accumulation (EF, EF-SignSGD, ECQ-SGD)
                       Momentum Correction (DGC, Global-Momentum)
                       Low-rank Decomposition (ATOMO, Spectral-ATOMO,
                                               Count Sketch)
                       Local Gradient Clipping
                       Warm-up Training
▲ Fig 2: Master taxonomy reconstructed from Sections 3-7 and 9.
  The four trunk branches mirror Table 1; each leaf is an algorithm
  the survey actually cites. Auxiliary technologies (Sec 9) modify
  the compression branch but are independent of synchronization
  and architecture choices.

3. Axis 1 — Communication Synchronization (Section 3)

         Synchronization spectrum (timeline view, Fig. 3 of paper)
         ──────────────────────────────────────────────────────────

   SYNCHRONOUS (BSP)   SSP                LOCAL-SGD               ASP
   ───────────────    ────────           ──────────              ──────
   barrier every       relaxed            local steps τ            no
   step                staleness          before sync              barrier
                       bound s

   |####|####|####|   |####|####|         |####...####|sync|     |####|---
   |####|####|####|   |####|####|         |####...####|sync|     |---|####
   |####|####|####|   |####|##|####|      |####...####|sync|     |####|---
        ^                  ^                            ^
   strict barrier   bounded gap          k local iters     unbounded gap
                    (e.g. s=3)           per sync          (stragglers
                                                           dropped or
                                                           backed-up)

   Convergence:        OK if heterogeneous   OK if τ small    Inferior
   stable              (Cui'14, Ho'13)       (Stich, Yu)      conv (Tab.4)
   slowest worker      stragglers don't      fewer comm        no waiting
   bottleneck          stop everyone         rounds            (Downpour)
▲ Fig 3: The four sync regimes the survey covers, with their timeline
  signature. BSP and All-Reduce are tightly coupled (All-Reduce only
  works with BSP-style sync). ASP is incompatible with All-Reduce —
  it requires PS or Gossip.

The survey's Table 5 captures the BSP <-> SSP <-> ASP <-> Local-SGD trade-off as observed in their unified benchmark across PS / All-Reduce / Gossip:

           Model        Comm        Comm           Conv-
Arch  Sync Consistency  Frequency   Congestion     ergence
─────────────────────────────────────────────────────────
PS    BSP  high         high        high           stable
PS    SSP  normal       high        normal         normal
PS    ASP  low          high        low            unstable
PS    LSGD normal       low         high           unstable
AR    BSP  high         high        low            easy
AR    LSGD normal       low         low            stable
Gos   BSP  low          high        low            stable
Gos   ASP  low          high        low            unstable
Gos   LSGD low          low         low            stable
▲ Fig 4: Paper Table 5 reproduced. Cross-product of architecture x
  sync determines four properties. AR-BSP is "easy convergence + low
  congestion + high frequency" — the regime DynamICCL operates in.

Key sub-techniques the survey calls out per cell:

SSP family:
  - Backup workers (Chen'16): drop slowest; PS updates without waiting
  - R2SP (Chen'19): round-robin staggered worker updates → load balance
  - 3PC (Richtarik): unified error-feedback + LAG compressor

Local-SGD family:
  - Vanilla Local-SGD (τ local steps then average)
  - Post-local SGD (Lin'20): mini-batch SGD warmup, then Local-SGD
  - LAG (Chen'18): skip upload if grad ≈ last-sent grad
  - FedAvg (McMahan): proportional to local data, federated regime
  - Slow-momentum / Quasi-Hyperbolic Momentum variants

ASP family:
  - DistBelief Downpour (Dean'12)
  - D-ADMM (Wei): asynchronous ADMM with central clock
  - Delayed Block Proximal Gradient (Li'14)
  - Asynchronous distributed ADMM (Zhang)

4. Axis 2 — System Architectures (Section 4)

   PARAMETER SERVER                ALL-REDUCE                  GOSSIP
   ─────────────────              ──────────────              ──────────

      [Server(s)]                   W1───W2                    W1   W2
       ↑  ↓  ↑  ↓                   │ \ / │                     \  /
     pull  push                     │  X  │                      \/
       ↓  ↑  ↓  ↑                   │ / \ │                      /\
      W1 W2 W3 W4                   W4───W3                     W4   W3

   centralized                     model-centralized           decentralized
   topology                        (all workers same           (workers may
   bottleneck at server            model, no master)           hold different
   bandwidth                       collective op required      models; converge
                                   N-1 latency, 2(N-1)/N BW    via consensus)

   Algorithms surveyed:
    DistBelief      Ring (NCCL,Gloo)              Gossip-SGD (Boyd et al)
    GeePS (GPU PS)  Recursive Doubling            Gossip-Push (Lian'17)
    AdaPS (LeftPS)  Rabenseifner Halve+Doubling   SGP (Assran'19) =
    BytePS          Double Binary Tree                 PUSHSUM + Gossip
    Switch-aggreg   2D-Torus All-Reduce               Push
    (programmable   BlueConnect (hierarchical)    Gossip-PGD (consensus
     switches[160]) BLink, PLink (topo-aware)         learning)
                    BML (BCube optimized)         CHOCO-SGD (biased
                    Hybrid (AR + PS)                  compression OK)
                                                  ECD-PSGD, DCD-PSGD
                                                  D-PSGD (Lian'17)
▲ Fig 5: Three architectures with their algorithm catalogs. NCCL
  lives entirely in the All-Reduce column; PS and Gossip are
  alternative communication topologies the survey treats as
  first-class peers.

Communication-cost model (Section 4.2, Table 6):

Algorithm           Latency      Bandwidth
──────────────────────────────────────────
Binary tree         2α log n     2β(log n)·N
Recursive doubling   α log n      β(log n)·N
Ring                2(n-1)α      (2(n-1)/n)·β·N      ← BW-optimal
Double binary tree   ~α log n     ~β·N               ← NCCL ≥ 2.4
Hierarchical AR      multi-tier   topology-tuned
2D-Torus             √n×√n        N over toroidal links
▲ Fig 6: Paper Table 6 — All-Reduce algorithm cost. Ring is bandwidth-
  optimal but latency-linear in n; binary tree and recursive doubling
  trade BW for log-latency. Hierarchical and topology-aware variants
  combine both (NCCL Tree = double binary tree).

5. Axis 3 — Compression Techniques (Sections 5, 6)

                         COMPRESSION
                              │
                  ┌───────────┴───────────┐
                  │                       │
          QUANTIZATION (Sec 5)       SPARSIFICATION (Sec 6)
          (fewer bits per dim)        (fewer dims)
                  │                       │
   ┌──────┬───────┼───────┬─────┐     ┌───┴────┬─────────┬───────┐
   │      │       │       │     │     │        │         │       │
 1-bit  TernGrad SignSGD QSGD  DIANA Random Determ-   Coordinate Proximal
  SGD   (Wen'17) (Bern.  (Alis-(splits to-k  inistic    Descent  Methods
 (Sei            18)     tarh' subgrads               (Top-k,             (small-
  '14)                    17)                          gTop-k,             var grads
                                                       AdaComp,            only)
   error   layer-wise   majority  stochastic   adaptive  Fixed Thr,
   feed-   ternarize    vote      dithering    bits     SBC, STC,
   back    +  Scalar    Byzantine 32× cap                 Random Mask
   needed  Sharing     fault-tol  unbiased,                ·Subsampling
                                  stochastic               (Konecny'16)

  Compression cap:                 Compression cap:
   ~32× (1-bit)                     up to 1000× (Top-k +
   tradeoff: bias if error           error feedback)
   not accumulated                  tradeoff: compute cost
                                    (sorting), index volume
                                    grows with workers
▲ Fig 7: Compression sub-tree. Quantization compresses *each
  coordinate*; sparsification compresses *which coordinates are
  sent*. The right-side caps from Section 6: quantization plateaus
  at ~32×, top-k with EF reaches 1000× (Section 6.5).

The survey makes the compression-rate vs convergence-rate trade-off explicit (Section 6.5 + Table 8): aggressive compression always slows convergence per-step, but EF-Top-k with rate=1000 converges despite the compression because parameters skipped this step accumulate in the error term and are sent next time. This is the "compression error deferred but never lost" pattern.

Auxiliary technologies (Section 9) that rescue compression:
  9.1 Error Accumulation: v ← v + ∇ - C(v); send C(v); next step adds v
       (used by 1-bit SGD, EF-SignSGD, ECQ-SGD)
  9.2 Momentum Correction (DGC): apply momentum BEFORE sparsification
       so velocity is preserved across compression steps
  9.3 Low-rank Decomposition: ATOMO, Spectral-ATOMO (SVD), Count Sketch
  9.4 Local Gradient Clipping: scale clip threshold by √N to preserve
       expected aggregate variance
  9.5 Warm-up Training: less aggressive sparsity in early epochs when
       gradients are large and unstable

6. Axis 4 — Parallelism of Communication and Computing (Section 7)

   Layer-wise DAG of forward/backward/comm operations:

     L1: Fwd ──► L2: Fwd ──► ... ──► Loss ──► L_n: Bwd ──► L_{n-1}: Bwd ──► ...
                                                  │              │
                                                  ▼              ▼
                                            Allreduce(grad_n)  Allreduce(grad_{n-1})

   Two pipelining strategies the survey enumerates:

   WFBP (Wait-Free Backward Propagation, Awan'17, Zhang'17):
     trigger Allreduce(grad_l) as soon as Bwd_l finishes;
     overlaps with Bwd_{l-1} compute. Standard in PyTorch DDP, Horovod.

   MG-WFBP (Merged-Gradient WFBP, Shi'19, Shi'21):
     fuse small layer gradients into one tensor before Allreduce,
     amortizing per-call latency over more bytes. Solves "small tensor
     dominated by startup" problem on large clusters.

   Tensor partitioning (Peng'19 — ByteScheduler):
     split LARGE gradients into smaller chunks, then re-pipeline.
     Inverse of merging. Enables overlap with feed-forward of next batch.

   Priority Scheduling (Tictac, Hashemi'18; ByteScheduler, Peng'19):
     order Allreduce calls by which layer is needed soonest in the next
     forward pass. Critical path of next iteration drives the ordering.

   Contention-aware (Wang'20):
     when multiple DL jobs share GPUs, schedule cross-job collectives
     to avoid simultaneous bandwidth contention.

▲ Fig 8: Parallelism axis. WFBP is the universal default; MG-WFBP and
  partitioning fight the small-tensor and large-tensor extremes
  respectively; priority scheduling fights latency on the next-batch
  critical path.

7. Auxiliary Technologies (Section 9)

These are not axes — they are correction terms applied on top of compression to rescue convergence. Surveyed:

Auxiliary technique     Mechanism                   Used in
────────────────────────────────────────────────────────────────────
Error Accumulation      v_t = v_{t-1} + ∇ - C(v)    1-bit SGD, EF-SGD,
(error feedback)        send C(v_t); residual       SignSGD-EF,
                        carried to next step        ECQ-SGD, DGC

Momentum Correction     u = m·u + ∇                 DGC (Lin'18),
                        v = v + u                   Global-Momentum (Zhao)
                        send sparse(v)
                        (momentum applied before
                         compression)

Low-rank Decomp         G ≈ U·V → send U, V         ATOMO (Wang'18),
                        for r-rank approx           Spectral-ATOMO,
                                                    Count Sketch (Ivkin)

Local Grad Clipping     thr_k = thr / √N            DGC variant
                        per-worker clip preserves
                        E[‖G‖] = √N · σ

Warm-up Training        rate=1.0 in early epochs,   DGC
                        compress more aggressively
                        as training stabilizes

The error-feedback pattern is the most architecturally important — it decouples compression rate from convergence rate. Without EF, 1-bit SGD diverges. With EF, EF-Top-k converges at 1000× compression.


8. Cross-Axis Design-Space Matrix (paper Tables 1, 5, 9)

The survey's Table 9 (page 24) is the canonical cross-axis matrix. It catalogs every (Architecture × Synchronization × Compression) cell the literature has populated, with communication cost in big-O and references. Reconstructed in compact form:

           |  Compression: NULL     | Quant.        | Spars.
Arch | Sync|  ServerCost / WorkerCost / Conv-rate (convex / non-convex)
─────┼─────┼────────────────────────┼───────────────┼──────────────
PS   | BSP |  O(32nNT) / O(32NT) / O(1/T) / O(1/√T)
PS   | BSP |        Quant: O(32nNT) / O(bNT) / O(1/T) / O(1/√T)
PS   | BSP |        Spars: O(32nNT) / O(k(logN+32)T) / O(1/T) / O(1/√T)
PS   | SSP |  O(32n·ΣT_i) / O(32NT_i) / O(1/T) / O(1/√T)
PS   | ASP |  Null/Quant/Spars all populated, conv = O(1/T) / O(1/√T)
PS   | LSGD|  O(32N·T/τ) / O(32N·T/τ) / O(1/T) / O(1/√T)
AR   | BSP |  -    / O(32NT)  / O(1/T) / O(1/√T)
AR   | BSP |        Quant: - / O(bNT) / O(1/T) / O(1/√T)
AR   | BSP |        Spars: - / O(kn(logN+32)T) / O(1/T) / O(1/√T)
AR   | LSGD|  -    / O(32N·T/τ) / O(1/T) / O(1/√T)
Gos  | BSP |  -    / O(32N·n_peers·T) / - / O(1/√T)
Gos  | ASP |  -    / O(32N·n_peers·T) / - / O(1/√T)
Gos  | LSGD|  -    / O(32N·n_peers·T/τ) / - / O(1/√T)
+ Pipelining and Scheduling are orthogonal to all of the above.
▲ Fig 9: Cross-axis design-space matrix from Table 9. Some cells
  are unpopulated (e.g., AR + ASP — All-Reduce is incompatible with
  asynchronous sync, see Sec 4.2). Convergence rate is NOT a free
  variable across cells — convex problems → O(1/T); non-convex →
  O(1/√T) is the asymptotic floor.

9. Per-Axis Trade-off Tables

9.1 Synchronization regime trade-off (Section 3, Tables 2-5)

Property BSP-SGD SSP-SGD ASP-SGD Local-SGD Winner (DynamICCL)
Model consistency High Bounded by s Low Periodic BSP
Per-step convergence Best Slight slow Worst (drift) Slight slow BSP
Wall-clock convergence Slow w/strag Better Best on hetero Best on uniform Local-SGD if homog
Compatible with AllReduce Yes Partial NO Yes BSP
State signature barrier s-bound unbounded τ steps barrier (Agent-2)
Test acc. drop at N=32 ref -0.5 to -2% up to -90% -0.5 to -3% BSP

For DynamICCL, BSP is the operating regime — it is the only sync mode compatible with All-Reduce, and Agent-2 lives inside the All-Reduce collective. ASP/SSP/Local-SGD are out-of-scope for action selection but in-scope as state features (the agent should know which sync regime the application is using because it changes the comm pattern's frequency and burstiness).

9.2 Architecture trade-off (Section 4)

Dimension Parameter Server All-Reduce Gossip Winner (DynamICCL)
Server bottleneck Yes (high cong) None None All-Reduce
Sync compatibility All four BSP, Local-SGD All except SSP All-Reduce
Topology requirement Star Logical ring/tree Arbitrary graph All-Reduce
Bandwidth-optimal collective N/A Ring 2(N-1)/N · β Spectral-gap dep All-Reduce
Latency-optimal collective N/A Tree α log N n/a All-Reduce
Heterogeneity tolerance High (PS waits) Low (barrier) High (gossip) -
Operating regime today - NCCL = AR - All-Reduce

9.3 Compression trade-off (Sections 5, 6, Table 8)

Property Quantization Random-k Spars. Top-k Spars. EF-Top-k Winner (DynamICCL)
Maximum rate (no div) ~32× ~10× 100-1000× 1000×+ EF-Top-k
Unbiased estimator Yes (QSGD) Yes (Subsampling) No Recovered via EF EF-Top-k
Compute overhead Low Lowest Sort O(N log N) Sort + EF buffer Random-k
Index volume per worker None k indices k indices k indices Quant
Convergence rate (convex) O(1/T) O(1/T) O(1/T) O(1/T) tie
Memory overhead Low None None N-dim residual Quant/Spars
Coupling with NCCL knobs Orthogonal Orthogonal Orthogonal Orthogonal tie

Key observation: compression operates on the payload of the collective, not on the algorithm. NCCL knobs (algo, proto, nChannels) are orthogonal to compression choice. Compression is a state feature for Agent-2 (changes the effective message size).

9.4 Parallelism / scheduling trade-off (Section 7)

Property WFBP MG-WFBP Tensor Partition Priority Sched Winner (DynamICCL)
Fights small tensors No (start cost) Yes (merge) No Partial MG-WFBP
Fights large tensors No No Yes (split) No Tensor Partition
Per-tensor latency Per-call Amortized Lower per chunk Amortized MG-WFBP
Critical-path awareness No No No Yes Priority Sched
Standard in PyTorch DDP Yes Optional Optional Optional WFBP
Effect on Agent-2 None Changes msg dist Changes msg dist Changes order -

Pipelining/scheduling lives above NCCL. Agent-2 cannot perform these transformations but the resulting message-size distribution becomes the input to Agent-2's per-collective decision.

9.5 Synchronous vs error-feedback compression — when does the agent see it?

Scenario Effect on Agent-2's view
BSP + no compression Full-precision msg, msg_size = N·32b
BSP + 1-bit SGD + EF msg_size shrinks to N·1b; same algo
BSP + Top-k with k=1% + EF msg_size = k·(idx+val) bits
Local-SGD Comm fires every τ steps, larger payload
ASP (PS) Comm is asymmetric (push/pull), no AR

The agent's effective_message_size_bytes state feature must be computed from (compression_method, compression_rate, original_msg_size). Treating compressed and uncompressed messages as the same feature is a bug.


10. Convergence-Cost Coupling (Section 8)

The survey's most important reminder for an RL designer: communication efficiency is always in tension with convergence. Section 8 catalogs the proven convergence rates:

Convex case:                Non-convex case (DNNs):
─────────────────           ───────────────────────
BSP-SGD:    O(1/T)          BSP-SGD:    O(1/√T)        (Bottou'18)
QSGD:       O(1/T)          QSGD:       O(1/√T)        (Alistarh'17)
EF-SignSGD: O(1/T)          EF-SignSGD: O(1/√T)        (Karimireddy)
SSP-SGD:    O(1/T)          SSP-SGD:    O(1/√T)        (Ho'13)
ASP-SGD:    O(1/T)          ASP-SGD:    O(1/√T)        (Lian'15)
Top-k:      O(1/T)          Top-k:      O(1/√T)        (Stich'18)
Local-SGD:  O(1/T)          Local-SGD:  O(1/√T)        (Stich'18)
Gossip-SGD: O(1/T)          Gossip-SGD: O(1/√T)        (Lian'17)
CHOCO-SGD:  O(1/(nT)+1/(Tδ²ω))   ← biased compression OK
▲ Fig 10: Asymptotic convergence rates from Section 8. Most
  comm-efficient methods preserve the convergence ORDER but degrade
  the CONSTANT. The wall-clock improvement comes from doing more
  iterations per second, not from faster per-iter convergence.

Practical implication for DynamICCL: Agent-2's reward MUST be end-to-end iteration time (or training step time), not raw collective latency. A configuration that finishes the collective fast but introduces a stale gradient will look great on per-call metrics and catastrophically slow on convergence. This reinforces the Pensieve borrow ("reward = actual metric, not proxy") and the Lee & Lee borrow ("end-to-end iter time is the correct reward").


11. What to Borrow for DynamICCL

Agent-2 selects (algo, proto, nChannels, numThreads) inside one cell of the survey's 4-D space, namely (BSP, All-Reduce, no compression, WFBP). The survey's value to DynamICCL is in identifying which surveyed techniques map onto:

(A) Agent-2 ACTION SPACE — knobs the agent can directly tune (B) Agent-2 STATE FEATURES — observable context the agent reads (C) OUT-OF-SCOPE — things the agent should NOT try to control

11.1 Mapping the survey onto Agent-2's action space

Agent-2's current action vector is (algo, proto, nChannels, numThreads). The survey suggests three additive action dimensions worth considering:

Surveyed technique           | Maps to Agent-2 action?        | Recommendation
──────────────────────────── | ────────────────────────────── | ──────────────
Ring vs Tree vs RecDoubling  | Yes — already in `algo`        | KEEP
LL / LL128 / Simple proto    | Yes — already in `proto`       | KEEP
Hierarchical AR (BlueConn,   | Partial — `algo=collnet`       | KEEP
  BLink, PLink, BML)         |   wraps it on NCCL             |
Double-binary tree           | Yes — `algo=tree`              | KEEP
2D-Torus / BCube AR          | Topology-specific; NCCL hides  | DEFER (HW)
Pipelining (WFBP, MG-WFBP)   | NCCL-external, framework-level | OUT-OF-SCOPE
Tensor partition / fusion    | NCCL-external (BytePS/Horovod) | OUT-OF-SCOPE
Priority scheduling (Tictac, | NCCL-external, framework-level | OUT-OF-SCOPE
  ByteScheduler)             |                                |
Quantization (1-bit, QSGD,   | Outside NCCL: payload-level    | OUT-OF-SCOPE
  TernGrad, SignSGD)         |   compression by the caller    |   (state ftr)
Sparsification (Top-k, EF)   | Outside NCCL                   | OUT-OF-SCOPE
                                                              |   (state ftr)
Sync regime (BSP, SSP, ASP,  | Application-level decision     | OUT-OF-SCOPE
  Local-SGD)                 |                                |   (state ftr)
Architecture (PS, AR, Gossip)| Application-level, set at init | OUT-OF-SCOPE
                                                              |   (state ftr)

Net result: the survey does not expand Agent-2's action space. The four NCCL knobs already cover the algorithm-level decisions the agent can make without modifying the framework. What the survey expands is the state space — the agent must read more context.

11.2 Mapping the survey onto Agent-2's state features

The survey identifies a richer set of context features the agent should ingest:

Agent-2 state additions justified by this survey:

  s_sync ∈ {BSP, SSP, ASP, Local-SGD, FedAvg}
    rationale: sync regime determines comm frequency and burstiness;
    Local-SGD with τ=16 fires 16× less often → different congestion
    background; ASP loses cross-rank coupling entirely.

  s_arch ∈ {PS, All-Reduce, Gossip}
    rationale: identifies whether collective lives in the AR cell at
    all. If arch=PS the "collective" is push/pull; agent should output
    a no-op config or a special PS-aware action.

  s_compression ∈ {none, quant_b, top_k, random_k, signSGD, ...}
    rationale: changes the effective payload size — same nominal
    tensor at 1-bit is 32× smaller than at fp32. Agent-2's
    `msg_size_bytes` feature must be the EFFECTIVE size after
    compression, not the nominal tensor size.

  s_compression_rate ∈ ℝ (e.g., 0.01 for top-1%, 32 for 1-bit)
    rationale: continuous knob orthogonal to method; influences
    whether algo=ring (large) vs algo=tree (small) is preferred.

  s_pipeline_fanout ∈ {WFBP, MG-WFBP, partitioned}
    rationale: same gradient may arrive at NCCL as one big call, many
    small calls (WFBP), or one fused call (MG-WFBP). Distribution of
    msg_size across calls is policy-relevant.

  s_priority_critical ∈ {0, 1}
    rationale: priority-scheduled collectives need low-latency configs
    (tree+LL); background collectives can use high-bandwidth (ring+Simple).

  s_τ_local ∈ ℕ (local-SGD steps between syncs; 0 if BSP)
    rationale: predicts inter-collective gap; long gap = no congestion
    accumulation, short gap (BSP) = persistent congestion regime.

  s_topology ∈ {fat-tree, BCube, 2D-Torus, mixed}
    rationale: topology-aware AR algorithms (BLink, PLink, BML) only
    pay off on matching topology. NCCL's autotuner has limited topo
    awareness; agent should bias toward hierarchical configs when
    s_topology indicates a structured fabric.

  s_protocol ∈ {TCP/IP, RoCE, IB}
    rationale: paper's Section 2 calls out hardware-level protocol as
    a tuning factor; affects α and β cost-model constants.

The Trigger Agent (Phase-2 detection layer) should monitor:

11.3 What is explicitly out-of-scope

The survey makes clear that several techniques are out of Agent-2's control because they live in the application or compiler layer, not in NCCL. Agent-2 should not try to learn to use them, but should observe their effects:

OUT-OF-SCOPE for Agent-2 actions:

  - Switching sync regime (BSP <-> ASP <-> Local-SGD)
      requires modifying the training loop, not the collective.

  - Switching architecture (AR <-> PS <-> Gossip)
      determined at framework init, not per-collective.

  - Choosing compression method (Top-k vs QSGD vs SignSGD)
      lives above NCCL — caller compresses BEFORE calling allreduce.

  - Activating error-feedback / momentum-correction
      auxiliary tech (Sec 9) — caller-side state.

  - WFBP / MG-WFBP / tensor partitioning
      framework-level (Horovod, BytePS, ByteScheduler) — NCCL is
      called once these decisions are made.

  - Hardware protocol selection (RoCE vs IB vs TCP)
      cluster-static. The agent reads it as a feature.

  - Network topology choice (Fat-Tree vs BCube vs Torus)
      cluster-static. The agent reads it as a feature.

The survey's Section 10 future-directions list ("higher compression levels", "adaptive compression per-layer", "fault-tolerant algorithms") all point at the application layer, not the NCCL layer. Agent-2's scope is correctly narrow.

11.4 Three concrete design borrows

11.4.1 EFFECTIVE message size, not nominal tensor size

The single most actionable feature engineering recommendation:

def effective_msg_size(tensor_bytes, compression_method, rate):
    if compression_method == 'none':
        return tensor_bytes
    if compression_method == 'quant':
        return tensor_bytes * (rate / 32)         # bits-per-coord ratio
    if compression_method == 'top_k':
        # k*(index_bits + value_bits) per worker
        return rate * (4 + 4) * tensor_bytes / (4 * tensor_bytes/4)
    if compression_method == 'random_k':
        return rate * tensor_bytes
    ...

agent_state['msg_size_bin'] = log2_bin(effective_msg_size(...))

A 100 MB gradient compressed with 1-bit SGD is a 3.1 MB collective — the agent should pick the algorithm best for 3.1 MB, not 100 MB.

11.4.2 Multi-mode policy keyed on (sync, arch)

The survey's Table 5 shows the four-property profile (consistency, frequency, congestion, convergence) varies dramatically across (arch, sync) pairs. A single Agent-2 policy trained on AR+BSP will not generalize. Two options:

Option A (preferred): one policy + (s_sync, s_arch) as input features
Option B: a small mixture-of-experts gated by (s_sync, s_arch)

The Pensieve borrow ("one model + masked action space") supports Option A. The NCCLX borrow ("MoE over parallel domain") supports Option B if the regime difference is structural enough.

11.4.3 Convergence-aware reward

From Section 8: every comm-efficient method preserves the asymptotic convergence order but changes the constant. A config that delivers fast collectives but stale gradients can hurt convergence. The reward must include a convergence proxy:

r_t = - completion_time_t                          (latency)
      - λ_switch  · 1[config_changed]              (Hopper NLM)
      - λ_cong    · congestion_signal_t            (Trigger Agent)
      - λ_probe   · 1[exploratory_action]          (STAR-MPI cost)
      - λ_conv    · |loss_t - loss_{t-1}|^{-1}     (NEW: convergence
                                                     stall penalty)
      - λ_SM      · SM_consumption_ratio           (NCCLX borrow)

λ_conv punishes "fast but stalling" configurations. Loss is already tracked by the trainer; routing a single scalar into the plugin is cheap.


12. Summary Table of Borrowed Patterns

Pattern Survey origin DynamICCL application
4-axis decomposition of DDL Sec 2, Table 1 State-vector axes: (sync, arch, compression, parallelism)
(arch × sync) determines comm regime Table 5 Two-feature gate: skip non-AR cells; specialize per BSP/Local-SGD
Effective msg size after compression Sec 5, 6 msg_size_bin computed from (method, rate, nominal)
Sync regime as state feature Sec 3 s_sync ∈ {BSP, SSP, ASP, Local-SGD, FedAvg}
Architecture as state feature Sec 4 s_arch ∈ {PS, AR, Gossip}
Compression method/rate as state Sec 5, 6, Sec 9 s_compression, s_compression_rate
Pipelining as msg-distribution context Sec 7 s_pipeline_fanout, s_priority_critical
AR + BSP is Agent-2's home cell Table 5, Table 9 Restrict action space to AR-compatible NCCL configs
Convergence vs comm-efficiency Sec 8, all proofs O(1/√T) λ_conv · convergence-stall in reward
Wall-clock = freq × per-iter time Sec 8 derivation Reward = end-to-end step time, not just collective latency
Auxiliary EF/momentum is caller-side Sec 9 OUT-OF-SCOPE for action; observe via msg-size dynamics
Topology-aware AR (BLink/PLink/BML) Sec 4.2 s_topology biases agent toward hierarchical algo when applicable
Programmable-switch aggregation [160] Sec 4.1 (Sapio'21) OUT-OF-SCOPE; flag s_switch_aggregation if cluster supports it

Analogy

The four-axis decomposition maps cleanly onto building energy management. Axis 1 (synchronization) is the building's thermostat schedule — strict (BSP = same temperature 24/7) vs. relaxed (SSP = within ±2°C of setpoint) vs. asynchronous (ASP = each room sets its own temp). Axis 2 (architecture) is the HVAC topology — central boiler with radiators (PS), interconnected zone controllers (AR), or peer-to-peer split units exchanging info (Gossip). Axis 3 (compression) is the insulation — quantization is replacing single-pane windows with double-pane (less heat per dimension lost), sparsification is closing off unused rooms (drop dimensions entirely). Axis 4 (parallelism) is the time-of-day scheduling — pre-cool the building during off-peak electricity (WFBP overlaps comm with compute).

Auxiliary technologies (Sec 9) are the corrections you apply when insulation alone fails: error-feedback is "remember which rooms got under-heated yesterday and over-heat them today" (the residual accumulates). Momentum correction is "don't suddenly drop heat delivery just because today's measurement is lower."

DynamICCL Agent-2 is the smart thermostat for one zone of one building — it cannot change the boiler topology, cannot change the weekly schedule, cannot install new insulation, cannot add solar panels. It can only adjust the local damper position and fan speed (algo, proto, nChannels, numThreads). But to do that well, it must sense what the rest of the building is doing — that is what the survey's other three axes contribute as context.