Architecture & Measurement-Design Analysis

Collective Communication Performance Evaluation for Distributed Deep Learning Training

Source: Lee, S.; Lee, J. Appl. Sci. 2024, 14, 5100. MDPI. https://doi.org/10.3390/app14125100 Authors: Sookwang Lee (ETRI Supercomputing Tech. Research Center) and Jaehwan Lee (Korea Aerospace University) Submitted: 16 May 2024 — Published: 12 June 2024 Reader: Direct PDF read (gemini-reader quota exhausted; codex-reader model unavailable) Analyst: Vishwakarma Date: 2026-04-28


Table of Contents

  1. Evaluation Harness Architecture (the "instrument")
  2. System-Under-Test Architecture (the "specimen")
  3. Design-Space Diagram (workload x configuration x topology axes swept)
  4. Measurement Control Flow Through One Experiment
  5. Quantitative Results — Where Each Library Wins
  6. Configuration-Regime Trade-off Tables
  7. Bottlenecks & Insights Surfaced by the Measurements
  8. Limitations of the Methodology
  9. What to Borrow for DynamICCL
  10. Analogy

1. Evaluation Harness Architecture (the "instrument")

The harness is deliberately minimal — no Nsight, no NCCL_DEBUG=INFO parsing, no per-channel telemetry. The authors instead time function call boundaries from inside the application, in two distinct contexts (Linux shell directly invoking library APIs; PyTorch invoking the same backends through DDP). The architecture is best understood as a matrix of (environment) x (backend) x (architecture) x (subroutine) producing a single latency cell per matrix point.

+------------------------------------------------------------------+
|                 Measurement Harness                              |
|                                                                  |
|  +-------------------+    +---------------------------------+    |
|  | Workload Driver   |--->| Tensor Generator                |    |
|  | (shell script /   |    | 1 GiB random tensor per rank,   |    |
|  |  PyTorch script)  |    | held in either GPU or CPU mem   |    |
|  +-------------------+    | depending on backend            |    |
|           |               +---------------------------------+    |
|           v                            |                         |
|  +---------------------------------------------------------+     |
|  |          Backend-Switch Layer (per Fig 4 / Fig 8)       |     |
|  |                                                         |     |
|  |   if backend==NCCL:    nccl_Bcast / nccl_Allgather /    |     |
|  |                        nccl_Allreduce + (sometimes)     |     |
|  |                        MPI_Send chief->PS               |     |
|  |   if backend==MPI:     MPI_Bcast / MPI_Gather /         |     |
|  |                        MPI_Allreduce + cudaMemcpy       |     |
|  |   if backend==CUDA-MPI: same as MPI but no cudaMemcpy   |     |
|  |   if backend==GLOO:    PyTorch dist primitives over     |     |
|  |                        CPU memory                       |     |
|  +---------------------------------------------------------+     |
|           |                                                       |
|           v                                                       |
|  +---------------------------------------------------------+     |
|  | Wall-clock Timer (per subroutine)                       |     |
|  | total_latency = t_collective + t_cudaMemcpy + t_MPI_Send|     |
|  | reported as components in Tables 2-4                    |     |
|  +---------------------------------------------------------+     |
|           |                                                       |
|           v                                                       |
|  +---------------------------------------------------------+     |
|  | Result Aggregator                                       |     |
|  | (env x backend x architecture x subroutine x nGPU)      |     |
|  | -> bar charts (Fig 13-23) + tables (Tables 2-12)        |     |
|  +---------------------------------------------------------+     |
+------------------------------------------------------------------+
^ Fig 1: Measurement harness — a thin wall-clock instrumentation
  layer wrapped around library API calls, run identically in
  Linux shell and PyTorch. No transport-layer counters captured.

The instrument is shallow on purpose. It captures only three quantities: collective function time, cudaMemcpy time (H<->D), and MPI_Send time (used by NCCL when the parameter server has no GPU allocated). There are no NCCL_DEBUG dumps, no IB performance counters, no GPU SM utilization traces. This means every conclusion in the paper rests on end-to-end wall time — a coarse signal but one that matches what an RL agent actually sees at the ncclAllReduce call boundary, which makes the measurements directly relevant to DynamICCL's reward signal.

Methodology specifics extracted from the paper:

Knob Value
Tensor size 1 GiB random data, generated fresh each call
Subroutines timed Bcast, Gather/Allgather, Allreduce
Architectures Parameter Server (PS) and Ring All-Reduce
Repetitions Implicit (single-iteration latencies reported)
Warmup Not described
GPU sweep 1, 2, 3, 4 GPUs (intra-node only)
DL workload ResNet-18 on CIFAR-10, batch 32, 10 epochs
Iteration steps 391 (DL) -> Tables 5-6

2. System-Under-Test Architecture (the "specimen")

A single multi-GPU node — explicitly intra-node — with consumer GPUs and PCIe-only interconnect. No NVLink. No NIC. No multi-node fabric. This is critical context because every cross-rank tensor movement in this study traverses PCIe (Gen3, 16 GB/s bidirectional) or host memory, never NVLink or RDMA.

+----------------------- Single Node (Table 1) ----------------------+
|                                                                    |
|  +-------------------------------------------------------------+   |
|  |  Intel Core i9-10900 (10 cores)   |   32 GiB DDR4-2933      |   |
|  +-------------------------------------------------------------+   |
|                       |                                            |
|                       v PCIe Gen3 x16 (16 GB/s bidirectional)      |
|  +--------+   +--------+   +--------+   +--------+                 |
|  | GPU 0  |   | GPU 1  |   | GPU 2  |   | GPU 3  |                 |
|  | RTX    |   | RTX    |   | RTX    |   | RTX    |                 |
|  | 3080   |   | 3080   |   | 3080   |   | 3080   |                 |
|  | 12 GiB |   | 12 GiB |   | 12 GiB |   | 12 GiB |                 |
|  +--------+   +--------+   +--------+   +--------+                 |
|                                                                    |
|   No NVLink between RTX 3080s.                                     |
|   No InfiniBand/RoCE NIC.                                          |
|   All inter-GPU traffic = PCIe peer copy or staged through         |
|   host memory.                                                     |
+--------------------------------------------------------------------+

  Software stack (Section 4):
  +------------------------------------------------+
  |  PyTorch 2.0.1                                 |  application
  +------------------------------------------------+
  |  NCCL 2.4   |   GLOO   |   OpenMPI 4.1.4  /    |  collective libs
  |             |          |   MPICH 3.3 (+ CUDA-  |
  |             |          |   aware OpenMPI)      |
  +------------------------------------------------+
  |  CUDA 11.3  |  NVIDIA driver 515.48            |  GPU runtime
  +------------------------------------------------+
  |  Bare metal | Singularity | Docker             |  container layer
  |             |             | (single + cross)   |
  +------------------------------------------------+
  |  Linux + RTX 3080 PCIe Gen3 x16                |  hardware
  +------------------------------------------------+
^ Fig 2: System under test — 4x RTX 3080 over PCIe-only, varied across
  four virtualization environments (bare metal / Singularity /
  single-docker / cross-docker). NCCL 2.4 is the version studied.

This testbed is closer to a research workstation than to an HPC cluster. The implication for DynamICCL: the regimes the paper exposes most clearly are PCIe-bound intra-node and virtualization- boundary-bound — neither of which is the regime where Ring vs. Tree algorithm choice matters most (that regime needs NVLink + IB and many ranks). What the paper does expose strongly is the cudaMemcpy overhead breakdown and the cross-container latency penalty — features DynamICCL's Agent-2 should consume.


3. Design-Space Diagram (workload x configuration x topology axes)

The independent variables form a 5-dimensional sweep. The paper does not explicitly enumerate it as a design space, but every figure / table fixes 4 of the 5 axes and varies the fifth.

                   DESIGN SPACE (5 axes)
  +-------------------------------------------------------------+
  |                                                             |
  |  Axis 1: ENVIRONMENT (4 levels)                             |
  |    [bare metal] [Singularity] [single-docker] [cross-docker]|
  |                                                             |
  |  Axis 2: BACKEND LIBRARY (5 levels)                         |
  |    [MPICH] [OpenMPI] [CUDA-aware MPI] [GLOO] [NCCL 2.4]     |
  |                                                             |
  |  Axis 3: PARALLELISM ARCHITECTURE (2 levels)                |
  |    [Parameter Server] [Ring All-Reduce]                     |
  |                                                             |
  |  Axis 4: COLLECTIVE / SUBROUTINE (3 levels)                 |
  |    [Bcast] [Gather / Allgather] [Allreduce]                 |
  |                                                             |
  |  Axis 5: nGPU (4 levels)                                    |
  |    [1] [2] [3] [4]                                          |
  |                                                             |
  |  Held FIXED (no sweep):                                     |
  |    - tensor size: 1 GiB                                     |
  |    - data type: float (random)                              |
  |    - NCCL algorithm/protocol/nChannels/numThreads:          |
  |      DEFAULT (NCCL 2.4 internal selection -- not swept!)    |
  |    - intra-node only (no inter-node experiments)            |
  |    - DL model: ResNet-18, CIFAR-10, batch=32, epochs=10     |
  |                                                             |
  +-------------------------------------------------------------+
^ Fig 3: Design space — 4 x 5 x 2 x 3 x 4 = 480 cells maximum,
  not all populated (e.g. NCCL+PS+gather requires the chief-worker
  workaround; GLOO is PyTorch-only). Note Axis 5 caps at 4 GPUs;
  the paper does not vary message size or any NCCL knob.

The crucial absence: the paper does not sweep NCCL knobs. No NCCL_ALGO, no NCCL_PROTO, no nChannels, no numThreads, no chunkSize. Every NCCL number reported is at NCCL 2.4 default selection. This means the paper is silent on the exact action space DynamICCL's Agent-2 chooses from. What the paper does tell us is which library (NCCL vs MPI vs GLOO) wins at the higher abstraction layer in each environment — useful as a prior for which backend a DynamICCL deployment should target, but not as evidence about the within-NCCL configuration regime.

The paper's true contribution to the DynamICCL state vector is Axis 1 (environment) and Axis 3 (PS vs ring) — both of which substantially affect NCCL latency holding all NCCL knobs fixed, which means they are exogenous features the agent must observe but cannot control.


4. Measurement Control Flow Through One Experiment

Reproduced from the paper's Figures 4, 8, 9 — Linux-shell flow on the left, PyTorch flow on the right. The branching on Using NCCL? and Using CUDA-aware OpenMPI? is the heart of the methodology: the same 1 GiB tensor takes a different memory-routing path depending on backend, and the timer captures this difference.

  Linux-shell Allreduce flow (Fig 4c)         PyTorch DL flow (Fig 9b)
  +-------------------------------+           +------------------------+
  | (1) Generate 1 GiB tensor in  |           | (1) Load CIFAR-10 in   |
  |     each worker GPU memory    |           |     each node          |
  +---------------+---------------+           +-----------+------------+
                  |                                       |
                  v                                       v
            Using MPI?                         (2) Forward + backward
            +-------+                              ResNet-18 batch=32
            |yes |no|                                     |
            v    v                                        v
  Using CUDA-   Call nccl_Allreduce              (3) Call All-Reduce
  aware MPI?    to execute reduce                    (NCCL, MPI, or
  +-------+     sum                                  GLOO backend)
  |yes |no|                                              |
  v    v                                                 v
  Skip  Call cudaMemcpy           Each iter:    (4) Average parameters
  cuda- D->H to copy data         repeat to         in each node
  Memcpy to CPU mem               391 batches        |
  |     |                         (10 epochs)        v
  v     v                                       (5) Max epoch?
  Call MPI_Allreduce                                |
  on CPU memory                                     v
  |                                              END
  v
  Call cudaMemcpy
  H->D back to GPU
  |
  v
  END
^ Fig 4: Two control flows — synthetic Linux-shell test (left,
  Fig 4c) and PyTorch DDP training loop (right, Fig 9b). The shell
  test exposes per-call latency components in isolation; the
  PyTorch test exposes aggregate training-time impact at 391 steps.

The paper deliberately runs both flows so the reader can attribute DL-time differences to specific subroutine costs measured under the shell flow. This is a clean methodological pattern — the equivalent of separating microbenchmark and end-to-end benchmark — and is directly useful for DynamICCL's evaluation strategy.


5. Quantitative Results — Where Each Library Wins

These are the numbers that should be loaded into DynamICCL's training simulator as priors for the (environment, backend) feature combinations.

5.1 Linux Shell Allreduce, 4 GPUs, Bare Metal

Backend Latency (s) Component breakdown (Table 4)
MPICH 3.877 2.483 allreduce + 0.639 H->D + 0.755 D->H
OpenMPI 3.296 1.903 allreduce + 0.639 H->D + 0.755 D->H
CUDA-aware MPI 3.226 3.226 allreduce, no cudaMemcpy
NCCL 2.285 2.285 allreduce, no cudaMemcpy

NCCL is 78% faster than MPICH for allreduce on bare metal — the strongest single result in the paper. The cause is the elimination of the H<->D cudaMemcpy round-trip (1.394 s combined, ~36% of MPICH's total).

5.2 PyTorch Allreduce, 4 GPUs, Bare Metal (Fig 21)

Backend Latency (s)
MPI 2.80
GLOO 1.61
NCCL 0.647

NCCL is 332% faster than MPI and 149% faster than GLOO. Quoted finding: "in PyTorch, NCCL showcased a substantial performance advantage, with a latency difference of 345% when compared to MPI" (Section 7.3). For all-reduce, NCCL dominates regardless of environment — this is a robust regime.

5.3 NCCL Cross-Docker Penalty (the "wall")

Subroutine Bare metal NCCL (s) Cross-docker NCCL (s) Increase
Bcast 1.008 (Linux shell) 2.384 +137% / 213% quoted
Allgather 3.448 (Linux shell) 5.135 +49%
Allreduce 2.285 (Linux shell) 2.200 -3.7% (flat)

In PyTorch (Fig 23): bcast +89%, allgather +54%, allreduce +131% when crossing the cross-docker boundary on 4 GPUs. The headline "213% higher latency compared to single docker" applies specifically to NCCL bcast.

5.4 GLOO Cross-Docker Surprise

GLOO_Gather is 36% faster in cross-docker than in single-docker (Section 7.3 finding 2). This is opposite to NCCL's behavior and is the only configuration where containerization helps. It exists because GLOO is CPU-memory-based; cross-docker isolation removes some intra-container memory contention.

5.5 Full DL Training (ResNet-18, CIFAR-10, 10 epochs, 4 GPUs)

Architecture Backend Bare metal training (s) Cross-docker (s)
Parameter Server MPI 1095 1447 (x1.32)
Parameter Server GLOO 1676 1386 (x0.82)
Parameter Server NCCL 503 833 (x1.66)
Ring All-Reduce MPI 384 451 (x1.17)
Ring All-Reduce GLOO 650 750 (x1.15)
Ring All-Reduce NCCL 186 283 (x1.51)

Ring all-reduce is uniformly faster than PS (NCCL: 186 s vs 503 s on bare metal — 2.7x speedup). The gap widens because PS imposes two collective phases (broadcast + gather) per iteration vs. a single allreduce phase.

5.6 Best / Worst Summary (Tables 7-12 distilled)

Best PyTorch-DL latencies: NCCL bcast 9.98 s (singularity), NCCL gather 356 s (singularity), NCCL allreduce 94.10 s (single-docker). Worst PyTorch-DL: MPI bcast 363 s (cross-docker), MPI gather 1084 s (cross-docker), GLOO allreduce 617 s (cross-docker). The pattern: multi-GPU-per-container always beats single-GPU-per-container.


6. Configuration-Regime Trade-off Tables

6.1 Backend Choice by Architecture (paper's central trade-off)

Dimension MPI/CUDA-aware MPI GLOO NCCL Winner (DynamICCL)
Allreduce latency (bare metal) 3.23 s 1.61 s 0.647 s NCCL
PS broadcast (bare metal) 1.84 s (CUDA-MPI) 1.20 s (PT) 1.01 s NCCL
PS gather (bare metal) 2.225 s 1.20 s (PT) 3.45 s MPI (CUDA-aware)
Cross-docker robustness Stable Improves Degrades 213% GLOO/MPI
GPU resource efficiency High High -1 GPU for PS MPI/GLOO
FSDP / large-scale training N/A N/A De facto choice NCCL

For DynamICCL, prefer NCCL because Agent-2's optimization target is collective primitive selection within NCCL — the paper's evidence for NCCL allreduce dominance (78%-345% lead) is exactly the regime where DynamICCL applies. The GLOO and MPI numbers serve as a floor: if NCCL with default config underperforms GLOO, that is the strongest possible signal that the agent's chosen NCCL knobs are wrong.

6.2 Architecture Choice (PS vs Ring)

Dimension Parameter Server Ring All-Reduce Winner (DynamICCL)
Training time (NCCL, BM) 503 s 186 s Ring
GPU resource utilization -1 GPU for PS (NCCL only) All GPUs as workers Ring
Cross-docker degradation x1.66 (NCCL) x1.51 (NCCL) Ring
Comm pattern complexity broadcast + gather single allreduce Ring
Step count (NCCL) 521 391 Ring

For DynamICCL, prefer Ring because the paper's data (and DDL practice) confirms ring-allreduce as the dominant data-parallel pattern for intra-node training. Agent-2's training distribution should oversample ring-allreduce regimes accordingly.

6.3 Virtualization Environment

Dimension Bare metal Singularity Single-docker Cross-docker Winner (DynamICCL)
NCCL allreduce (PyTorch) 0.647 s ~0.62 s 0.64 s 1.49 s Bare metal / Sing.
NCCL bcast (PyTorch) 1.30 s 1.06 s 1.27 s 2.60 s Singularity
Best DL allreduce time 186 s 162 s 164 s 283 s Singularity
Cross-container required No No No YES --

For DynamICCL, prefer treating "is_cross_docker" as a binary state feature — it shifts NCCL latency by 50-200% with no change in workload, and the agent must observe it to make correct decisions.


7. Bottlenecks & Insights Surfaced by the Measurements

7.1 cudaMemcpy is the dominant cost in non-CUDA-aware MPI

Tables 2-4 break out cudaMemcpy as a separate component. For MPICH broadcast, cudaMemcpy is 41% of total time (0.653 s of 1.598 s). For MPICH allreduce, cudaMemcpy H<->D is 35% (1.394 s of 3.877 s). This single insight justifies CUDA-aware variants and validates NCCL's GPU-direct architecture: eliminating the H<->D round-trip is worth at least one full-collective-time savings.

7.2 NCCL on Cross-Docker hits a structural wall

The 213% bcast latency increase is "structural" — it appears whenever NCCL must cross container boundaries. The paper implies this is due to NCCL's reliance on shared memory IPC for intra-node GPU coordination, which is unavailable across docker namespaces. This is actionable intelligence: an RL agent should recognize this regime and either back off from NCCL-aggressive configs or escalate to a fallback.

7.3 GPU-resource cost of PS+NCCL

Section 6.1 / Fig 10b: NCCL requires a GPU for the parameter server, reducing worker count from N to N-1 (Table 5: 521 worker steps vs 391). This is invisible if you only look at per-call latency; only the end-to-end DL training time (Table 5) exposes the throughput penalty. Pattern for DynamICCL: end-to-end iteration time is the correct reward signal, not per-call latency in isolation.

7.4 PCIe-only intra-node is the tested topology

Because there is no NVLink and no NIC, the paper's measurements upper-bound NCCL bandwidth at ~16 GB/s (PCIe Gen3 x16). On NVLink (~600 GB/s) the relative ordering of ring-vs-tree, LL-vs-LL128 may differ. The reported NCCL allreduce of 0.647 s for 1 GiB across 4 GPUs corresponds to roughly 6 GB/s effective ring bandwidth — well below PCIe peak, suggesting protocol/sync overhead dominates at this scale.

7.5 Singularity > Single-docker > Bare metal in some cases

Several Singularity numbers beat bare metal (e.g., MPICH allreduce 2.97 s vs 3.87 s on bare metal — Section 5.2; "MPICH recorded a latency of 2.97 s, which is 30% lower than that of bare metal"). This is counter-intuitive but explained by HPC-tuned filesystem and namespace defaults in Singularity. The signal for DynamICCL: the container runtime is itself a configuration regime that can change NCCL behavior.


8. Limitations of the Methodology

Limitation Implication for DynamICCL
Single fixed message size (1 GiB) No data on small-message regime where LL/LL128
matter; no message-size sensitivity surface
No NCCL knob sweep No ground truth on (algo, proto, nCh) choices
4 GPUs max, intra-node only No scaling-degradation data; no Ring vs Tree
crossover; no inter-node fabric
No NVLink Cannot validate intra-node high-BW regime
No repetition counts / variance Cannot estimate measurement noise floor for
the RL reward signal
Single workload (ResNet-18) No model-size sensitivity
NCCL 2.4 (older — June 2024 paper) Recent NCCL versions may have shifted defaults
No GPU/network telemetry Cannot supply rich state features; only end-
to-end latency available

The most consequential limitation is the missing knob sweep. The paper validates that NCCL is the right library but provides zero evidence about which NCCL configuration is best — which is exactly the question DynamICCL answers.


9. What to Borrow for DynamICCL

The paper is methodologically modest but contributes three concrete items to DynamICCL's design: telemetry features that should enter Agent-2's state vector, evaluation patterns DynamICCL should adopt, and configuration regimes where the policy must be most aggressive.

9.1 State-vector features the paper validates as predictive

These features change NCCL latency without any change in NCCL knobs, which means they are exogenous regime indicators the agent must observe in order to choose knobs correctly.

  Add to Agent-2 state vector s_t:
  +-----------------------------------------------------+
  |  is_cross_container   : bool      (Sec 7.3 finding 1)|
  |  is_singularity       : bool      (Sec 5.2)         |
  |  is_bare_metal        : bool      (baseline)         |
  |  is_pcie_only_intra   : bool      (no NVLink)        |
  |  parallelism_arch     : {PS, Ring} (Sec 6.3)        |
  |  cudamemcpy_observed_s: float     (Tables 2-4)      |
  |  prev_allreduce_lat_s : float     (k=8 history)     |
  |  collective_type      : enum      (already there)   |
  +-----------------------------------------------------+
^ Fig 5: Borrowed state features. The first four are env binaries
  the agent observes once at startup; cudamemcpy_observed_s and
  prev_allreduce_lat_s are runtime features updated per call.

The cudaMemcpy observation is the most novel addition: it is a backend symptom — high cudaMemcpy time means the backend chose a non-CUDA-aware path, which Agent-2 can correlate with its own algorithm/protocol selections.

9.2 Evaluation patterns DynamICCL should reuse

The paper's parallel shell + PyTorch measurement pattern is the correct evaluation harness layout for DynamICCL. Specifically:

Pattern (paper) DynamICCL adoption
Microbenchmark in Linux shell: time API directly NCCL-tests microbenchmark per (algo, proto, nCh) cell
End-to-end test in PyTorch DDP Real workload (e.g. Llama-7B step time) with same config grid
Component breakdown (call vs cudaMemcpy vs send) Component breakdown (kernel vs proxy vs network) per channel
4 environments x 5 backends x 3 collectives Cluster x algo x proto x nCh x msg-size grid
Best/worst tables (Tables 7-12) Best/worst regime tables for Agent-2 sanity checking

The "two harnesses, one workload" pattern is what lets the paper attribute DL-time differences to specific microbenchmark costs. DynamICCL needs the same: a fast microbenchmark simulator (per Pensieve/borrow note 6.4) plus a real DL training loop, with correlated outputs.

9.3 Configuration regimes where Agent-2 should be most aggressive

The paper identifies three regimes where the exogenous environment (not NCCL knobs) flips the optimal backend by an order of magnitude. Agent-2 should treat these as "high-policy-gradient" regions — where exploration and exploitation deliver the largest reward swings.

Regime A — NCCL Cross-Docker: NCCL latency rises 213% (bcast) and 131% (allreduce). In this regime Agent-2 should aggressively explore reducing nChannels (less SHM contention across container boundaries) and switching from LL128 to Simple (lower coordination density). The paper does not test these; Agent-2 must discover them.

Regime B — Parameter Server with NCCL: -1 GPU + extra broadcast/gather phase. Agent-2 may have no way to fix this within NCCL knobs alone; the right action is to surface it to a higher-level controller. Implication: DynamICCL should expose a "non-actionable_regime" flag for the user when no NCCL knob can recover from the architectural overhead.

Regime C — Singularity vs Bare metal: Singularity is sometimes faster than bare metal. The agent should not assume bare metal is always optimal; it should learn from observed latencies that container choice can be a free win.

9.4 The 1 GiB / 4 GPU sweet spot is not where Agent-2 is most useful

The paper shows that at 1 GiB on 4 GPUs intra-node, NCCL with default config beats every alternative by 78-345%. Agent-2's marginal value is small in this regime — the default config is already near optimal. Agent-2's training data should under-sample this regime and oversample the regimes the paper does not test: small messages (<64 KiB), large rank counts (>=16), inter-node fabrics, and cross-docker NCCL.

9.5 End-to-end iteration time as the reward (not per-call latency)

Section 6.3 + Table 5 demonstrate that per-call latency can mislead: NCCL+PS has a fast per-call broadcast (12 s in Table 5) but loses 1 GPU and runs 521 instead of 391 worker steps. Only end-to-end training time exposes this. Agent-2's reward must include training-step-time, not just collective-latency. This is consistent with the Pensieve-borrow principle (reward = actual metric, not proxy) and the paper provides empirical evidence for it.

9.6 Backend-floor sanity check

Quoted: "NCCL achieves up to 345% lower execution time in all-reduce operations compared to other libraries" (Abstract). DynamICCL can use this as a sanity floor: if Agent-2's chosen NCCL config underperforms GLOO+ring at the same DL workload, the action selection is broken. A runtime guard if observed_lat > 2.0 * gloo_baseline_lat: revert to NCCL_DEFAULT is a cheap safety net the paper's data justifies.


10. Analogy

The paper is a wind-tunnel test of a generic airframe with the control surfaces locked. The investigators measure how the airframe (1 GiB tensor on 4 GPUs) flies through different atmospheres (bare metal, Singularity, single-docker, cross-docker) using different engines (MPI, GLOO, NCCL) — but the rudder, ailerons, and elevator are bolted in their default position. The result is a clean map of which engine wins in which atmosphere, but no information about how to fly the airframe. DynamICCL is the autopilot that the wind-tunnel study cannot replace: it operates the locked control surfaces in real time. The paper's value to DynamICCL is therefore the atmospheric map — knowing where the wind shears (cross-docker) and where the air is calm (bare-metal allreduce at 1 GiB) — so the autopilot knows where to apply the most authority. The paper itself, however, never demonstrates that the autopilot is necessary: at its single tested operating point, the locked-default airframe already flies well.


Summary of Borrowed Patterns

Pattern from Lee & Lee (2024) DynamICCL application
cudaMemcpy as latency component (Tables 2-4) Add cudamemcpy_observed_s to Agent-2 state vector
Cross-docker boundary as latency multiplier (Sec 7.3) is_cross_container as binary state feature
Two-harness eval (Linux shell + PyTorch) Microbenchmark + real DL workload, correlated
End-to-end DL time vs per-call latency (Table 5) Reward includes training-step-time, not just collective latency
NCCL+PS GPU-resource penalty (Sec 6.1) "non_actionable_regime" flag when NCCL knobs cannot recover
Singularity beats bare metal (Sec 5.2) container_runtime as ordinal state feature
1 GiB / 4 GPU is "default-optimal" Under-sample this regime in Agent-2 training data
GLOO baseline as floor (Fig 21) Runtime safety guard: revert to NCCL_DEFAULT if NCCL > 2x GLOO
ResNet-18 / CIFAR-10 baseline workload Reuse as DynamICCL's smoke-test workload for plugin correctness
391 worker steps as a reproducibility anchor Standardize on a fixed step budget for cross-config DL comparisons