Architecture & Measurement-Design Analysis
Collective Communication Performance Evaluation for Distributed Deep Learning Training
Source: Lee, S.; Lee, J. Appl. Sci. 2024, 14, 5100. MDPI. https://doi.org/10.3390/app14125100 Authors: Sookwang Lee (ETRI Supercomputing Tech. Research Center) and Jaehwan Lee (Korea Aerospace University) Submitted: 16 May 2024 — Published: 12 June 2024 Reader: Direct PDF read (gemini-reader quota exhausted; codex-reader model unavailable) Analyst: Vishwakarma Date: 2026-04-28
Table of Contents
- Evaluation Harness Architecture (the "instrument")
- System-Under-Test Architecture (the "specimen")
- Design-Space Diagram (workload x configuration x topology axes swept)
- Measurement Control Flow Through One Experiment
- Quantitative Results — Where Each Library Wins
- Configuration-Regime Trade-off Tables
- Bottlenecks & Insights Surfaced by the Measurements
- Limitations of the Methodology
- What to Borrow for DynamICCL
- Analogy
1. Evaluation Harness Architecture (the "instrument")
The harness is deliberately minimal — no Nsight, no NCCL_DEBUG=INFO parsing, no per-channel telemetry. The authors instead time function call boundaries from inside the application, in two distinct contexts (Linux shell directly invoking library APIs; PyTorch invoking the same backends through DDP). The architecture is best understood as a matrix of (environment) x (backend) x (architecture) x (subroutine) producing a single latency cell per matrix point.
+------------------------------------------------------------------+
| Measurement Harness |
| |
| +-------------------+ +---------------------------------+ |
| | Workload Driver |--->| Tensor Generator | |
| | (shell script / | | 1 GiB random tensor per rank, | |
| | PyTorch script) | | held in either GPU or CPU mem | |
| +-------------------+ | depending on backend | |
| | +---------------------------------+ |
| v | |
| +---------------------------------------------------------+ |
| | Backend-Switch Layer (per Fig 4 / Fig 8) | |
| | | |
| | if backend==NCCL: nccl_Bcast / nccl_Allgather / | |
| | nccl_Allreduce + (sometimes) | |
| | MPI_Send chief->PS | |
| | if backend==MPI: MPI_Bcast / MPI_Gather / | |
| | MPI_Allreduce + cudaMemcpy | |
| | if backend==CUDA-MPI: same as MPI but no cudaMemcpy | |
| | if backend==GLOO: PyTorch dist primitives over | |
| | CPU memory | |
| +---------------------------------------------------------+ |
| | |
| v |
| +---------------------------------------------------------+ |
| | Wall-clock Timer (per subroutine) | |
| | total_latency = t_collective + t_cudaMemcpy + t_MPI_Send| |
| | reported as components in Tables 2-4 | |
| +---------------------------------------------------------+ |
| | |
| v |
| +---------------------------------------------------------+ |
| | Result Aggregator | |
| | (env x backend x architecture x subroutine x nGPU) | |
| | -> bar charts (Fig 13-23) + tables (Tables 2-12) | |
| +---------------------------------------------------------+ |
+------------------------------------------------------------------+
^ Fig 1: Measurement harness — a thin wall-clock instrumentation
layer wrapped around library API calls, run identically in
Linux shell and PyTorch. No transport-layer counters captured.
The instrument is shallow on purpose. It captures only three
quantities: collective function time, cudaMemcpy time (H<->D), and
MPI_Send time (used by NCCL when the parameter server has no GPU
allocated). There are no NCCL_DEBUG dumps, no IB performance counters,
no GPU SM utilization traces. This means every conclusion in the paper
rests on end-to-end wall time — a coarse signal but one that matches
what an RL agent actually sees at the ncclAllReduce call
boundary, which makes the measurements directly relevant to DynamICCL's
reward signal.
Methodology specifics extracted from the paper:
| Knob | Value |
|---|---|
| Tensor size | 1 GiB random data, generated fresh each call |
| Subroutines timed | Bcast, Gather/Allgather, Allreduce |
| Architectures | Parameter Server (PS) and Ring All-Reduce |
| Repetitions | Implicit (single-iteration latencies reported) |
| Warmup | Not described |
| GPU sweep | 1, 2, 3, 4 GPUs (intra-node only) |
| DL workload | ResNet-18 on CIFAR-10, batch 32, 10 epochs |
| Iteration steps | 391 (DL) -> Tables 5-6 |
2. System-Under-Test Architecture (the "specimen")
A single multi-GPU node — explicitly intra-node — with consumer GPUs and PCIe-only interconnect. No NVLink. No NIC. No multi-node fabric. This is critical context because every cross-rank tensor movement in this study traverses PCIe (Gen3, 16 GB/s bidirectional) or host memory, never NVLink or RDMA.
+----------------------- Single Node (Table 1) ----------------------+
| |
| +-------------------------------------------------------------+ |
| | Intel Core i9-10900 (10 cores) | 32 GiB DDR4-2933 | |
| +-------------------------------------------------------------+ |
| | |
| v PCIe Gen3 x16 (16 GB/s bidirectional) |
| +--------+ +--------+ +--------+ +--------+ |
| | GPU 0 | | GPU 1 | | GPU 2 | | GPU 3 | |
| | RTX | | RTX | | RTX | | RTX | |
| | 3080 | | 3080 | | 3080 | | 3080 | |
| | 12 GiB | | 12 GiB | | 12 GiB | | 12 GiB | |
| +--------+ +--------+ +--------+ +--------+ |
| |
| No NVLink between RTX 3080s. |
| No InfiniBand/RoCE NIC. |
| All inter-GPU traffic = PCIe peer copy or staged through |
| host memory. |
+--------------------------------------------------------------------+
Software stack (Section 4):
+------------------------------------------------+
| PyTorch 2.0.1 | application
+------------------------------------------------+
| NCCL 2.4 | GLOO | OpenMPI 4.1.4 / | collective libs
| | | MPICH 3.3 (+ CUDA- |
| | | aware OpenMPI) |
+------------------------------------------------+
| CUDA 11.3 | NVIDIA driver 515.48 | GPU runtime
+------------------------------------------------+
| Bare metal | Singularity | Docker | container layer
| | | (single + cross) |
+------------------------------------------------+
| Linux + RTX 3080 PCIe Gen3 x16 | hardware
+------------------------------------------------+
^ Fig 2: System under test — 4x RTX 3080 over PCIe-only, varied across
four virtualization environments (bare metal / Singularity /
single-docker / cross-docker). NCCL 2.4 is the version studied.
This testbed is closer to a research workstation than to an HPC cluster. The implication for DynamICCL: the regimes the paper exposes most clearly are PCIe-bound intra-node and virtualization- boundary-bound — neither of which is the regime where Ring vs. Tree algorithm choice matters most (that regime needs NVLink + IB and many ranks). What the paper does expose strongly is the cudaMemcpy overhead breakdown and the cross-container latency penalty — features DynamICCL's Agent-2 should consume.
3. Design-Space Diagram (workload x configuration x topology axes)
The independent variables form a 5-dimensional sweep. The paper does not explicitly enumerate it as a design space, but every figure / table fixes 4 of the 5 axes and varies the fifth.
DESIGN SPACE (5 axes)
+-------------------------------------------------------------+
| |
| Axis 1: ENVIRONMENT (4 levels) |
| [bare metal] [Singularity] [single-docker] [cross-docker]|
| |
| Axis 2: BACKEND LIBRARY (5 levels) |
| [MPICH] [OpenMPI] [CUDA-aware MPI] [GLOO] [NCCL 2.4] |
| |
| Axis 3: PARALLELISM ARCHITECTURE (2 levels) |
| [Parameter Server] [Ring All-Reduce] |
| |
| Axis 4: COLLECTIVE / SUBROUTINE (3 levels) |
| [Bcast] [Gather / Allgather] [Allreduce] |
| |
| Axis 5: nGPU (4 levels) |
| [1] [2] [3] [4] |
| |
| Held FIXED (no sweep): |
| - tensor size: 1 GiB |
| - data type: float (random) |
| - NCCL algorithm/protocol/nChannels/numThreads: |
| DEFAULT (NCCL 2.4 internal selection -- not swept!) |
| - intra-node only (no inter-node experiments) |
| - DL model: ResNet-18, CIFAR-10, batch=32, epochs=10 |
| |
+-------------------------------------------------------------+
^ Fig 3: Design space — 4 x 5 x 2 x 3 x 4 = 480 cells maximum,
not all populated (e.g. NCCL+PS+gather requires the chief-worker
workaround; GLOO is PyTorch-only). Note Axis 5 caps at 4 GPUs;
the paper does not vary message size or any NCCL knob.
The crucial absence: the paper does not sweep NCCL
knobs. No NCCL_ALGO, no NCCL_PROTO,
no nChannels, no numThreads, no chunkSize. Every NCCL number reported is
at NCCL 2.4 default selection. This means the paper is silent on the
exact action space DynamICCL's Agent-2 chooses from. What the paper
does tell us is which library (NCCL vs MPI vs GLOO) wins at the
higher abstraction layer in each environment — useful as a
prior for which backend a DynamICCL deployment should target,
but not as evidence about the within-NCCL configuration regime.
The paper's true contribution to the DynamICCL state vector is Axis 1 (environment) and Axis 3 (PS vs ring) — both of which substantially affect NCCL latency holding all NCCL knobs fixed, which means they are exogenous features the agent must observe but cannot control.
4. Measurement Control Flow Through One Experiment
Reproduced from the paper's Figures 4, 8, 9 — Linux-shell flow on the
left, PyTorch flow on the right. The branching on
Using NCCL? and Using CUDA-aware OpenMPI? is
the heart of the methodology: the same 1 GiB tensor takes a different
memory-routing path depending on backend, and the timer captures this
difference.
Linux-shell Allreduce flow (Fig 4c) PyTorch DL flow (Fig 9b)
+-------------------------------+ +------------------------+
| (1) Generate 1 GiB tensor in | | (1) Load CIFAR-10 in |
| each worker GPU memory | | each node |
+---------------+---------------+ +-----------+------------+
| |
v v
Using MPI? (2) Forward + backward
+-------+ ResNet-18 batch=32
|yes |no| |
v v v
Using CUDA- Call nccl_Allreduce (3) Call All-Reduce
aware MPI? to execute reduce (NCCL, MPI, or
+-------+ sum GLOO backend)
|yes |no| |
v v v
Skip Call cudaMemcpy Each iter: (4) Average parameters
cuda- D->H to copy data repeat to in each node
Memcpy to CPU mem 391 batches |
| | (10 epochs) v
v v (5) Max epoch?
Call MPI_Allreduce |
on CPU memory v
| END
v
Call cudaMemcpy
H->D back to GPU
|
v
END
^ Fig 4: Two control flows — synthetic Linux-shell test (left,
Fig 4c) and PyTorch DDP training loop (right, Fig 9b). The shell
test exposes per-call latency components in isolation; the
PyTorch test exposes aggregate training-time impact at 391 steps.
The paper deliberately runs both flows so the reader can attribute DL-time differences to specific subroutine costs measured under the shell flow. This is a clean methodological pattern — the equivalent of separating microbenchmark and end-to-end benchmark — and is directly useful for DynamICCL's evaluation strategy.
5. Quantitative Results — Where Each Library Wins
These are the numbers that should be loaded into DynamICCL's training simulator as priors for the (environment, backend) feature combinations.
5.1 Linux Shell Allreduce, 4 GPUs, Bare Metal
| Backend | Latency (s) | Component breakdown (Table 4) |
|---|---|---|
| MPICH | 3.877 | 2.483 allreduce + 0.639 H->D + 0.755 D->H |
| OpenMPI | 3.296 | 1.903 allreduce + 0.639 H->D + 0.755 D->H |
| CUDA-aware MPI | 3.226 | 3.226 allreduce, no cudaMemcpy |
| NCCL | 2.285 | 2.285 allreduce, no cudaMemcpy |
NCCL is 78% faster than MPICH for allreduce on bare metal — the strongest single result in the paper. The cause is the elimination of the H<->D cudaMemcpy round-trip (1.394 s combined, ~36% of MPICH's total).
5.2 PyTorch Allreduce, 4 GPUs, Bare Metal (Fig 21)
| Backend | Latency (s) |
|---|---|
| MPI | 2.80 |
| GLOO | 1.61 |
| NCCL | 0.647 |
NCCL is 332% faster than MPI and 149% faster than GLOO. Quoted finding: "in PyTorch, NCCL showcased a substantial performance advantage, with a latency difference of 345% when compared to MPI" (Section 7.3). For all-reduce, NCCL dominates regardless of environment — this is a robust regime.
5.3 NCCL Cross-Docker Penalty (the "wall")
| Subroutine | Bare metal NCCL (s) | Cross-docker NCCL (s) | Increase |
|---|---|---|---|
| Bcast | 1.008 (Linux shell) | 2.384 | +137% / 213% quoted |
| Allgather | 3.448 (Linux shell) | 5.135 | +49% |
| Allreduce | 2.285 (Linux shell) | 2.200 | -3.7% (flat) |
In PyTorch (Fig 23): bcast +89%, allgather +54%, allreduce +131% when crossing the cross-docker boundary on 4 GPUs. The headline "213% higher latency compared to single docker" applies specifically to NCCL bcast.
5.4 GLOO Cross-Docker Surprise
GLOO_Gather is 36% faster in cross-docker than in single-docker (Section 7.3 finding 2). This is opposite to NCCL's behavior and is the only configuration where containerization helps. It exists because GLOO is CPU-memory-based; cross-docker isolation removes some intra-container memory contention.
5.5 Full DL Training (ResNet-18, CIFAR-10, 10 epochs, 4 GPUs)
| Architecture | Backend | Bare metal training (s) | Cross-docker (s) |
|---|---|---|---|
| Parameter Server | MPI | 1095 | 1447 (x1.32) |
| Parameter Server | GLOO | 1676 | 1386 (x0.82) |
| Parameter Server | NCCL | 503 | 833 (x1.66) |
| Ring All-Reduce | MPI | 384 | 451 (x1.17) |
| Ring All-Reduce | GLOO | 650 | 750 (x1.15) |
| Ring All-Reduce | NCCL | 186 | 283 (x1.51) |
Ring all-reduce is uniformly faster than PS (NCCL: 186 s vs 503 s on bare metal — 2.7x speedup). The gap widens because PS imposes two collective phases (broadcast + gather) per iteration vs. a single allreduce phase.
5.6 Best / Worst Summary (Tables 7-12 distilled)
Best PyTorch-DL latencies: NCCL bcast 9.98 s (singularity), NCCL gather 356 s (singularity), NCCL allreduce 94.10 s (single-docker). Worst PyTorch-DL: MPI bcast 363 s (cross-docker), MPI gather 1084 s (cross-docker), GLOO allreduce 617 s (cross-docker). The pattern: multi-GPU-per-container always beats single-GPU-per-container.
6. Configuration-Regime Trade-off Tables
6.1 Backend Choice by Architecture (paper's central trade-off)
| Dimension | MPI/CUDA-aware MPI | GLOO | NCCL | Winner (DynamICCL) |
|---|---|---|---|---|
| Allreduce latency (bare metal) | 3.23 s | 1.61 s | 0.647 s | NCCL |
| PS broadcast (bare metal) | 1.84 s (CUDA-MPI) | 1.20 s (PT) | 1.01 s | NCCL |
| PS gather (bare metal) | 2.225 s | 1.20 s (PT) | 3.45 s | MPI (CUDA-aware) |
| Cross-docker robustness | Stable | Improves | Degrades 213% | GLOO/MPI |
| GPU resource efficiency | High | High | -1 GPU for PS | MPI/GLOO |
| FSDP / large-scale training | N/A | N/A | De facto choice | NCCL |
For DynamICCL, prefer NCCL because Agent-2's optimization target is collective primitive selection within NCCL — the paper's evidence for NCCL allreduce dominance (78%-345% lead) is exactly the regime where DynamICCL applies. The GLOO and MPI numbers serve as a floor: if NCCL with default config underperforms GLOO, that is the strongest possible signal that the agent's chosen NCCL knobs are wrong.
6.2 Architecture Choice (PS vs Ring)
| Dimension | Parameter Server | Ring All-Reduce | Winner (DynamICCL) |
|---|---|---|---|
| Training time (NCCL, BM) | 503 s | 186 s | Ring |
| GPU resource utilization | -1 GPU for PS (NCCL only) | All GPUs as workers | Ring |
| Cross-docker degradation | x1.66 (NCCL) | x1.51 (NCCL) | Ring |
| Comm pattern complexity | broadcast + gather | single allreduce | Ring |
| Step count (NCCL) | 521 | 391 | Ring |
For DynamICCL, prefer Ring because the paper's data (and DDL practice) confirms ring-allreduce as the dominant data-parallel pattern for intra-node training. Agent-2's training distribution should oversample ring-allreduce regimes accordingly.
6.3 Virtualization Environment
| Dimension | Bare metal | Singularity | Single-docker | Cross-docker | Winner (DynamICCL) |
|---|---|---|---|---|---|
| NCCL allreduce (PyTorch) | 0.647 s | ~0.62 s | 0.64 s | 1.49 s | Bare metal / Sing. |
| NCCL bcast (PyTorch) | 1.30 s | 1.06 s | 1.27 s | 2.60 s | Singularity |
| Best DL allreduce time | 186 s | 162 s | 164 s | 283 s | Singularity |
| Cross-container required | No | No | No | YES | -- |
For DynamICCL, prefer treating "is_cross_docker" as a binary state feature — it shifts NCCL latency by 50-200% with no change in workload, and the agent must observe it to make correct decisions.
7. Bottlenecks & Insights Surfaced by the Measurements
7.1 cudaMemcpy is the dominant cost in non-CUDA-aware MPI
Tables 2-4 break out cudaMemcpy as a separate component. For MPICH broadcast, cudaMemcpy is 41% of total time (0.653 s of 1.598 s). For MPICH allreduce, cudaMemcpy H<->D is 35% (1.394 s of 3.877 s). This single insight justifies CUDA-aware variants and validates NCCL's GPU-direct architecture: eliminating the H<->D round-trip is worth at least one full-collective-time savings.
7.2 NCCL on Cross-Docker hits a structural wall
The 213% bcast latency increase is "structural" — it appears whenever NCCL must cross container boundaries. The paper implies this is due to NCCL's reliance on shared memory IPC for intra-node GPU coordination, which is unavailable across docker namespaces. This is actionable intelligence: an RL agent should recognize this regime and either back off from NCCL-aggressive configs or escalate to a fallback.
7.3 GPU-resource cost of PS+NCCL
Section 6.1 / Fig 10b: NCCL requires a GPU for the parameter server, reducing worker count from N to N-1 (Table 5: 521 worker steps vs 391). This is invisible if you only look at per-call latency; only the end-to-end DL training time (Table 5) exposes the throughput penalty. Pattern for DynamICCL: end-to-end iteration time is the correct reward signal, not per-call latency in isolation.
7.4 PCIe-only intra-node is the tested topology
Because there is no NVLink and no NIC, the paper's measurements upper-bound NCCL bandwidth at ~16 GB/s (PCIe Gen3 x16). On NVLink (~600 GB/s) the relative ordering of ring-vs-tree, LL-vs-LL128 may differ. The reported NCCL allreduce of 0.647 s for 1 GiB across 4 GPUs corresponds to roughly 6 GB/s effective ring bandwidth — well below PCIe peak, suggesting protocol/sync overhead dominates at this scale.
7.5 Singularity > Single-docker > Bare metal in some cases
Several Singularity numbers beat bare metal (e.g., MPICH allreduce 2.97 s vs 3.87 s on bare metal — Section 5.2; "MPICH recorded a latency of 2.97 s, which is 30% lower than that of bare metal"). This is counter-intuitive but explained by HPC-tuned filesystem and namespace defaults in Singularity. The signal for DynamICCL: the container runtime is itself a configuration regime that can change NCCL behavior.
8. Limitations of the Methodology
| Limitation | Implication for DynamICCL |
|---|---|
| Single fixed message size (1 GiB) | No data on small-message regime where LL/LL128 |
| matter; no message-size sensitivity surface | |
| No NCCL knob sweep | No ground truth on (algo, proto, nCh) choices |
| 4 GPUs max, intra-node only | No scaling-degradation data; no Ring vs Tree |
| crossover; no inter-node fabric | |
| No NVLink | Cannot validate intra-node high-BW regime |
| No repetition counts / variance | Cannot estimate measurement noise floor for |
| the RL reward signal | |
| Single workload (ResNet-18) | No model-size sensitivity |
| NCCL 2.4 (older — June 2024 paper) | Recent NCCL versions may have shifted defaults |
| No GPU/network telemetry | Cannot supply rich state features; only end- |
| to-end latency available |
The most consequential limitation is the missing knob sweep. The paper validates that NCCL is the right library but provides zero evidence about which NCCL configuration is best — which is exactly the question DynamICCL answers.
9. What to Borrow for DynamICCL
The paper is methodologically modest but contributes three concrete items to DynamICCL's design: telemetry features that should enter Agent-2's state vector, evaluation patterns DynamICCL should adopt, and configuration regimes where the policy must be most aggressive.
9.1 State-vector features the paper validates as predictive
These features change NCCL latency without any change in NCCL knobs, which means they are exogenous regime indicators the agent must observe in order to choose knobs correctly.
Add to Agent-2 state vector s_t:
+-----------------------------------------------------+
| is_cross_container : bool (Sec 7.3 finding 1)|
| is_singularity : bool (Sec 5.2) |
| is_bare_metal : bool (baseline) |
| is_pcie_only_intra : bool (no NVLink) |
| parallelism_arch : {PS, Ring} (Sec 6.3) |
| cudamemcpy_observed_s: float (Tables 2-4) |
| prev_allreduce_lat_s : float (k=8 history) |
| collective_type : enum (already there) |
+-----------------------------------------------------+
^ Fig 5: Borrowed state features. The first four are env binaries
the agent observes once at startup; cudamemcpy_observed_s and
prev_allreduce_lat_s are runtime features updated per call.
The cudaMemcpy observation is the most novel addition: it is a backend symptom — high cudaMemcpy time means the backend chose a non-CUDA-aware path, which Agent-2 can correlate with its own algorithm/protocol selections.
9.2 Evaluation patterns DynamICCL should reuse
The paper's parallel shell + PyTorch measurement pattern is the correct evaluation harness layout for DynamICCL. Specifically:
| Pattern (paper) | DynamICCL adoption |
|---|---|
| Microbenchmark in Linux shell: time API directly | NCCL-tests microbenchmark per (algo, proto, nCh) cell |
| End-to-end test in PyTorch DDP | Real workload (e.g. Llama-7B step time) with same config grid |
| Component breakdown (call vs cudaMemcpy vs send) | Component breakdown (kernel vs proxy vs network) per channel |
| 4 environments x 5 backends x 3 collectives | Cluster x algo x proto x nCh x msg-size grid |
| Best/worst tables (Tables 7-12) | Best/worst regime tables for Agent-2 sanity checking |
The "two harnesses, one workload" pattern is what lets the paper attribute DL-time differences to specific microbenchmark costs. DynamICCL needs the same: a fast microbenchmark simulator (per Pensieve/borrow note 6.4) plus a real DL training loop, with correlated outputs.
9.3 Configuration regimes where Agent-2 should be most aggressive
The paper identifies three regimes where the exogenous environment (not NCCL knobs) flips the optimal backend by an order of magnitude. Agent-2 should treat these as "high-policy-gradient" regions — where exploration and exploitation deliver the largest reward swings.
Regime A — NCCL Cross-Docker: NCCL latency rises 213% (bcast) and 131% (allreduce). In this regime Agent-2 should aggressively explore reducing nChannels (less SHM contention across container boundaries) and switching from LL128 to Simple (lower coordination density). The paper does not test these; Agent-2 must discover them.
Regime B — Parameter Server with NCCL: -1 GPU + extra broadcast/gather phase. Agent-2 may have no way to fix this within NCCL knobs alone; the right action is to surface it to a higher-level controller. Implication: DynamICCL should expose a "non-actionable_regime" flag for the user when no NCCL knob can recover from the architectural overhead.
Regime C — Singularity vs Bare metal: Singularity is sometimes faster than bare metal. The agent should not assume bare metal is always optimal; it should learn from observed latencies that container choice can be a free win.
9.4 The 1 GiB / 4 GPU sweet spot is not where Agent-2 is most useful
The paper shows that at 1 GiB on 4 GPUs intra-node, NCCL with default config beats every alternative by 78-345%. Agent-2's marginal value is small in this regime — the default config is already near optimal. Agent-2's training data should under-sample this regime and oversample the regimes the paper does not test: small messages (<64 KiB), large rank counts (>=16), inter-node fabrics, and cross-docker NCCL.
9.5 End-to-end iteration time as the reward (not per-call latency)
Section 6.3 + Table 5 demonstrate that per-call latency can mislead: NCCL+PS has a fast per-call broadcast (12 s in Table 5) but loses 1 GPU and runs 521 instead of 391 worker steps. Only end-to-end training time exposes this. Agent-2's reward must include training-step-time, not just collective-latency. This is consistent with the Pensieve-borrow principle (reward = actual metric, not proxy) and the paper provides empirical evidence for it.
9.6 Backend-floor sanity check
Quoted: "NCCL achieves up to 345% lower execution time in all-reduce
operations compared to other libraries" (Abstract). DynamICCL can use
this as a sanity floor: if Agent-2's chosen NCCL config underperforms
GLOO+ring at the same DL workload, the action selection is broken. A
runtime guard
if observed_lat > 2.0 * gloo_baseline_lat: revert to NCCL_DEFAULT
is a cheap safety net the paper's data justifies.
10. Analogy
The paper is a wind-tunnel test of a generic airframe with the control surfaces locked. The investigators measure how the airframe (1 GiB tensor on 4 GPUs) flies through different atmospheres (bare metal, Singularity, single-docker, cross-docker) using different engines (MPI, GLOO, NCCL) — but the rudder, ailerons, and elevator are bolted in their default position. The result is a clean map of which engine wins in which atmosphere, but no information about how to fly the airframe. DynamICCL is the autopilot that the wind-tunnel study cannot replace: it operates the locked control surfaces in real time. The paper's value to DynamICCL is therefore the atmospheric map — knowing where the wind shears (cross-docker) and where the air is calm (bare-metal allreduce at 1 GiB) — so the autopilot knows where to apply the most authority. The paper itself, however, never demonstrates that the autopilot is necessary: at its single tested operating point, the locked-default airframe already flies well.
Summary of Borrowed Patterns
| Pattern from Lee & Lee (2024) | DynamICCL application |
|---|---|
| cudaMemcpy as latency component (Tables 2-4) | Add cudamemcpy_observed_s to Agent-2 state vector |
| Cross-docker boundary as latency multiplier (Sec 7.3) | is_cross_container as binary state feature |
| Two-harness eval (Linux shell + PyTorch) | Microbenchmark + real DL workload, correlated |
| End-to-end DL time vs per-call latency (Table 5) | Reward includes training-step-time, not just collective latency |
| NCCL+PS GPU-resource penalty (Sec 6.1) | "non_actionable_regime" flag when NCCL knobs cannot recover |
| Singularity beats bare metal (Sec 5.2) | container_runtime as ordinal state feature |
| 1 GiB / 4 GPU is "default-optimal" | Under-sample this regime in Agent-2 training data |
| GLOO baseline as floor (Fig 21) | Runtime safety guard: revert to NCCL_DEFAULT if NCCL > 2x GLOO |
| ResNet-18 / CIFAR-10 baseline workload | Reuse as DynamICCL's smoke-test workload for plugin correctness |
| 391 worker steps as a reproducibility anchor | Standardize on a fixed step budget for cross-config DL comparisons |