Architecture & Measurement-Design Analysis

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Source: De Sensi, D.; Pichetti, L.; Vella, F.; De Matteis, T.; Ren, Z.; Fusco, L.; Turisini, M.; Cesarini, D.; Lust, K.; Trivedi, A.; Roweth, D.; Spiga, F.; Di Girolamo, S.; Hoefler, T. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '24), Atlanta, GA, USA, November 17-22, 2024, 12 pages. DOI / arXiv: arXiv:2408.14090v2 [cs.DC] (15 Nov 2024); SC24 proceedings, IEEE 979-8-3503-5291-7/24. Code / artifact: Authors release the custom GPU-to-GPU benchmark with the paper as an artifact (mentioned at the end of Sec. III-A: "We publicly released the code as part of the paper artifact"). Authors: Sapienza Univ. of Rome (De Sensi); Univ. of Trento (Pichetti, Vella); Vrije Univ. Amsterdam (De Matteis, Ren); ETH Zurich (Fusco, Hoefler); CINECA (Turisini, Cesarini); Univ. of Antwerp (Lust); IBM Research Europe (Trivedi); HPE Cray (Roweth); NVIDIA (Spiga, Di Girolamo). Reader: Direct PDF read via PyMuPDF (gemini-reader CLI not available on this host; codex-reader is the documented fallback but the PDF is short — 12 main pages plus references — so direct PDF page extraction was used). Full text extracted page-by-page from 0051_GPU2GPU_communication.pdf. Analyst: Vishwakarma Date: 2026-05-04


Table of Contents

  1. System Architecture (the unified custom benchmark + tuned software stack)
  2. Target-Hardware / SUT (Alps, Leonardo, LUMI — three contrasting interconnects)
  3. Design-Space Diagram (axes swept, axes held fixed)
  4. Algorithm / Control Flow Diagrams (ping-pong, intra-node and inter-node collectives, SL-controlled noise probe)
  5. Quantitative Results — Empirical Findings by Regime
  6. Configuration-Regime Trade-off Tables
  7. Bottlenecks & Insights Surfaced by the Measurements
  8. Limitations of the Methodology
  9. Note on NCCL Tuning
  10. Analogy

1. System Architecture (the unified custom benchmark + tuned software stack)

The paper's "system" is a purpose-built GPU-to-GPU performance characterization harness that runs the same communication benchmark across three architecturally different supercomputers, through four alternative software paths, at scales from 2 GPUs to 4,096 GPUs. The architecture is organized around one structural commitment: *one benchmark binary that issues the same logical communication call (ping-pong, alltoall, allreduce) over four interchangeable mechanisms — Trivial Staging, Device-Device Copy, CCL (NCCL/RCCL), and GPU-Aware MPI — so that performance differences across the four can be attributed to the software stack rather than the workload. This is a deliberate departure from the existing tools (OSU, nccl-/rccl-tests) which lock the user into one mechanism and one timing convention.

+---------------- GPU-to-GPU Benchmark Harness (one logical run) ----------------+
|                                                                                |
|  +----------------------- Workload Driver (custom) -----------------------+    |
|  |                                                                        |    |
|  |   One MPI process per GPU; affinity = closest core to GPU              |    |
|  |   Per-iteration timing via MPI_Wtime (resolution measured              |    |
|  |     experimentally: 25 ns LUMI/Leonardo, 30 ns Alps)                   |    |
|  |   Per-experiment iteration count: 100 to 1,000 (grows with size)       |    |
|  |   Reduction across ranks: max time / min goodput                        |    |
|  |   Excludes: communicator creation, buffer allocation, handle exchange  |    |
|  |   Includes: GPU sync before stop_timer (except MPI -- implicit)        |    |
|  +-----------------------------+------------------------------------------+    |
|                                |                                                |
|                                v                                                |
|  +-------------------- Mechanism-Switch Layer (4 paths) ------------------+    |
|  |                                                                        |    |
|  |   (1) Trivial Staging       : GPU->host->MPI->host->GPU (baseline)    |    |
|  |   (2) Device-Device Copy    : shared mem handles, cudaMemcpy/         |    |
|  |                               hipMemcpy directly between GPU memories |    |
|  |   (3) *CCL (NCCL on Alps/Leonardo, RCCL on LUMI)                      |    |
|  |   (4) GPU-Aware MPI         : MPICH (Alps/LUMI), Open MPI+UCX         |    |
|  |                               (Leonardo)                              |    |
|  +-----------------------------+------------------------------------------+    |
|                                |                                                |
|                                v                                                |
|  +--------------------- Tuned Software Stack -----------------------------+    |
|  |                                                                        |    |
|  |   *CCL knobs set per system (Sec. III-B):                              |    |
|  |     NCCL_IGNORE_CPU_AFFINITY=1   (Alps, LUMI)  -> +1.6x AT, +6x AR    |    |
|  |     NCCL_NET_GDR_LEVEL=3          (Alps, LUMI)  -> +2x AT, +3x AR     |    |
|  |     NCCL_NCHANNELS_PER_PEER=32    (LUMI intra-node P2P) -> +3.5x       |    |
|  |     NCCL_IB_SL=<non-default>      (Leonardo, Sec. VI) -- noise probe   |    |
|  |                                                                        |    |
|  |   GPU-Aware MPI knobs (Sec. III-B):                                    |    |
|  |     MPICH_GPU_IPC_THRESHOLD=1     (Alps small msgs <4 KiB)  -> +2x    |    |
|  |     MPICH_GPU_ALLREDUCE_BLK_SIZE=128 MiB (Alps)             -> +50%   |    |
|  |     HSA_ENABLE_SDMA=0             (LUMI)                   -> +3x    |    |
|  |     LD_LIBRARY_PATH fix for GDRCopy (Leonardo)              -> +6x    |    |
|  |     UCX_IB_SL=<non-default>       (Leonardo, Sec. VI)                  |    |
|  +-----------------------------+------------------------------------------+    |
|                                |                                                |
|                                v                                                |
|  +---------------------- Per-System Backend Bind --------------------------+   |
|  |                                                                         |   |
|  |   Alps     : Cray MPICH 8.1.28 + libfabric 1.15.2 + CUDA 12.3 +         |   |
|  |              aws-ofi-nccl plugin                                        |   |
|  |   Leonardo : Open MPI 4.1.4 + UCX 1.13.0 + CUDA 12.1 + UCC absent      |   |
|  |   LUMI     : Cray clang 16.0.1 + Cray MPICH 8.1.27 + libfabric +        |   |
|  |              ROCm 5.7.1.1 + aws-ofi-rccl 1.4                            |   |
|  +-----------------------------+------------------------------------------+    |
|                                |                                                |
|                                v                                                |
|  +---------------------- Data Reduction & Reporting ----------------------+    |
|  |                                                                        |    |
|  |   Per-iter timings -> mean, IQR (shaded band), p5/p25/p75/p95          |    |
|  |   Goodput := unidirectional bytes / runtime (collectives) or           |    |
|  |               bytes / (runtime/2) (ping-pong); always Gb/s             |    |
|  |   Comparison overlays:                                                 |    |
|  |     - Expected goodput (analytic, see Sec. IV-A,IV-C and V-C)          |    |
|  |     - Trivial staging upper bound (analytic via host-bw model)         |    |
|  |   Output artifacts:                                                    |    |
|  |     - 3-panel figures (Alps | Leonardo | LUMI)                         |    |
|  |     - Heatmap (msg_size x #GPUs) for *CCL/GPU-Aware ratio (Fig. 11)    |    |
|  |     - 8 numbered "Observations" summarizing regime winners              |    |
|  +------------------------------------------------------------------------+    |
+--------------------------------------------------------------------------------+
^ Fig 1: Custom GPU-to-GPU benchmark harness. The "Mechanism-Switch
  Layer" is the core innovation: the same logical communication is
  expressed across four backends so the harness can attribute
  measured differences to software-stack choices rather than to
  workload variations.

The harness is unusual in three ways. First, it deliberately replaces the standard OSU + nccl-tests + rccl-tests duo because those tools lack a unified abstraction over device-device copy (OSU has none) and do not export per-iteration timings (a prerequisite for the noise analysis in Sec. VI). Second, it embeds system-specific tuning into the harness itself rather than reporting only out-of-the-box numbers — the per-system environment-variable lists in Sec. III-B are part of the experimental method, not an appendix. Third, it deliberately runs the same logical communication call through four mechanisms, so cross-mechanism comparisons isolate the software stack while holding the workload, hardware, and topology fixed.

The metric definition is consistent across all figures: "we report the maximum time (or minimum goodput) across all the participating ranks", a recommendation directly traceable to Hoefler & Belli's "Twelve Ways" benchmarking guidelines [23] cited at the end of Sec. III-A. This guarantees that goodput numbers reflect the slowest-rank-bottleneck, the regime that any production collective will actually experience.

The tuning knobs surface a structural truth that the paper makes explicit in Observation 1: *default CCL/MPI configurations are not always optimal, and manual tuning can improve performance up to an order of magnitude. The eight environment-variable adjustments listed above are not micro-optimizations — they are the difference between using and not using the system's available bandwidth. This is the same unbridged-default pattern seen in the NCCL configuration survey (paper 0018) and the demystification study (paper 0011): the library ships with conservative defaults that the application owner must override to reach line rate.


2. Target-Hardware / SUT (three contrasting interconnects)

The SUT is three real production-class supercomputers, each ranked top-10 on the June 2024 Top500, deliberately chosen to span the full range of contemporary GPU-to-GPU interconnect technology: NVLink 4.0 (Alps), NVLink 3.0 (Leonardo), AMD Infinity Fabric (LUMI); Slingshot-11 (Alps, LUMI) versus InfiniBand HDR (Leonardo); Dragonfly (Alps, LUMI) versus Dragonfly+ (Leonardo). This three-way architectural split is the paper's central design choice — it turns each "Observation" into a cross-architecture claim rather than a single-cluster anecdote.

+----------------- ALPS (CSCS, #6 Top500, NVIDIA H100 + Slingshot) ---------------+
|                                                                                  |
|  Per-node:                                                                       |
|    4x GH200 Grace Hopper Superchip; each: 72-core Grace + H100 + 96 GB HBM3      |
|    Intra-node: NVLink 4.0, 6 links (200 Gb/s each) per pair = 1.2 Tb/s pair-BW   |
|    All-to-all topology between 4 GPUs (max edge-forwarding-index = 1)            |
|    Inter-node: 4x HPE Cray Cassini-1 NIC, 200 Gb/s each = 800 Gb/s/node          |
|    NVLink-C2C between Grace CPU and H100: 3.6 Tb/s                               |
|    PCIe 5.0 x16 (512 Gb/s) per GPU                                               |
|                                                                                  |
|  Cluster: HPE Cray Slingshot-11, Dragonfly topology, 24 groups, 270 PFlop/s     |
|  Test partition: Santis early-access, 512 nodes (= 2,048 GPUs accessible)        |
|                                                                                  |
+----------------------------------------------------------------------------------+

+--------------- LEONARDO (CINECA, #7 Top500, NVIDIA A100 + IB HDR) --------------+
|                                                                                  |
|  Per-node:                                                                       |
|    1x 32-core Xeon 8358 + 4x A100 (special SKU); 512 GB DDR4 + 64 GB HBM2e/GPU  |
|    Intra-node: NVLink 3.0, 4 links (200 Gb/s each) per pair = 800 Gb/s pair-BW  |
|    All-to-all topology between 4 GPUs (max edge-forwarding-index = 1)            |
|    Inter-node: 2x dual-port Connect-X6 NIC (4x 100 Gb/s ports/node)              |
|    PCIe 4.0 x16 (256 Gb/s) per GPU                                                |
|                                                                                  |
|  Cluster: NVIDIA InfiniBand HDR, Dragonfly+ topology, 23 groups, 240 PFlop/s    |
|  Each group: 2-level fat tree (18 spine + 18 leaf switches, 180 nodes/group)    |
|  Booster GPU partition: 3,456 nodes (= 13,824 GPUs total)                       |
|  Test partition: up to 256 nodes / 1,024 GPUs (user limit)                       |
|                                                                                  |
+----------------------------------------------------------------------------------+

+----------------- LUMI-G (CSC, #5 Top500, AMD MI250X + Slingshot) ---------------+
|                                                                                  |
|  Per-node:                                                                       |
|    1x 64-core EPYC 7A53 "Trento" (4 NUMA) + 4x MI250X = 8 GCDs total            |
|    Each GCD: 64 GB HBM (128 GB/MI250X module)                                    |
|    Each GCD <-> NUMA: 288 Gb/s Infinity Fabric                                   |
|    Intra-node GCD<->GCD: 1 to 4 IF links of 400 Gb/s each (NOT FULLY CONNECTED) |
|    Inter-node: 1 NIC per MI250X module = 4x Cassini-1 NICs/node                  |
|                                                                                  |
|  Cluster: HPE Cray Slingshot-11, Dragonfly topology, 24 groups,                  |
|           124 nodes/group, 380 PFlop/s, LUMI-G partition = 2,978 nodes          |
|  Test partition: up to 512 nodes / 4,096 GPUs (user limit)                       |
|                                                                                  |
+----------------------------------------------------------------------------------+
^ Fig 2: Three-cluster SUT spanning {NVIDIA NVLink, AMD Infinity
  Fabric} x {Slingshot, InfiniBand} x {Dragonfly, Dragonfly+}.
  Alps and Leonardo are fully connected at the GPU level;
  LUMI-G is *partially* connected (1-4 IF links), which makes
  it the only case where the optimal allreduce algorithm differs
  from a tree (it must use Rabenseifner-style ring).
+------- LUMI-G Intra-Node GCD Connectivity (Fig. 2 of paper) -----------+
|                                                                        |
|       GCD0 ===4=== GCD1                  GCD4 ===4=== GCD5             |
|        |  \      /  |                     |  \      /  |               |
|        2   \    /   2                     2   \    /   2               |
|        |    \  /    |                     |    \  /    |               |
|       GCD2 ===4=== GCD3                  GCD6 ===4=== GCD7             |
|                                                                        |
|        ===4=== : 4 IF links (1.6 Tb/s)                                  |
|         ---2--- : 2 IF links (800 Gb/s)                                 |
|         ---1--- : 1 IF link  (400 Gb/s)  (between halves, e.g. 0<->4)  |
|                                                                        |
|   Edge-forwarding-index analysis (Sec. IV-A):                          |
|     Most-loaded link in alltoall = link between GCD 1<->5 and 7<->3,   |
|     used by 4 distinct shortest paths -> per-pair alltoall BW capped   |
|     at 400/4 = 100 Gb/s, total per-GCD alltoall BW = 6 * 100 = 600 Gb/s|
+------------------------------------------------------------------------+
^ Fig 3: LUMI-G's non-fully-connected intra-node graph. The peak
  intra-node alltoall goodput is set not by the per-GCD injection
  bandwidth but by the most loaded internal link, an analytic
  bottleneck the authors compute via the edge-forwarding index.

The flat-versus-hierarchical interconnect property is the most consequential SUT difference. Alps and Leonardo present the classical "fully connected 4-GPU clique" assumed in most NCCL/HiCCL performance models, where every link sees one path and the optimal collective is a tree. LUMI-G presents a graph where the same analysis no longer applies: a Rabenseifner-style ring on four edge-disjoint bidirectional rings is required to saturate, and the expected goodput per GCD is 800 Gb/s (Sec. IV-C) — substantially below A100's per-GPU figure on Leonardo despite identical injection bandwidth.

The reported per-system software stack pins the experiment to a specific cross-section of vendor optimizations:

Cluster MPI *CCL Net plugin OS toolchain
Alps Cray MPICH 8.1.28 NCCL via aws-ofi-nccl libfabric 1.15.2 CUDA 12.3
Leonardo Open MPI 4.1.4 + UCX 1.13.0 NCCL (no aws-ofi) CUDA 12.1, no UCC
LUMI Cray MPICH 8.1.27 RCCL via aws-ofi-rccl 1.4 libfabric Cray clang 16, ROCm 5.7.1.1

The asymmetry across columns is itself a result: Open MPI on Leonardo lacks UCC, which forces allreduce to fall back to host reduction (Sec. IV-D), exposing one of the paper's largest performance gaps. This is not an artifact — it is the deployed configuration on a real top-10 system.


3. Design-Space Diagram (axes swept, axes held fixed)

The independent variables form a 5-dimensional sweep. Every figure in the paper fixes a cluster and a primitive, and varies (msg size, scale, mechanism) along three of the remaining axes.

                   DESIGN SPACE (5 axes + held-fixed)
  +---------------------------------------------------------------+
  |                                                               |
  |  Axis 1: CLUSTER / SUT (3 levels)                             |
  |    [Alps]      H100 + NVLink 4.0 + Slingshot Dragonfly        |
  |    [Leonardo]  A100 + NVLink 3.0 + IB HDR Dragonfly+         |
  |    [LUMI]      MI250X + Infinity Fabric + Slingshot Dragonfly|
  |                                                               |
  |  Axis 2: COMMUNICATION MECHANISM (4 levels, Fig 1)            |
  |    [Trivial Staging]  GPU->host->MPI->host->GPU              |
  |    [Device-Device]    cuda/hipMemcpy via shared handles      |
  |    [*CCL]             NCCL (Alps, Leonardo) / RCCL (LUMI)    |
  |    [GPU-Aware MPI]    MPICH (Alps/LUMI), OMPI+UCX (Leonardo)|
  |                                                               |
  |  Axis 3: PRIMITIVE (3 levels)                                 |
  |    [P2P ping-pong]    Sec. III-C, V-A                        |
  |    [Alltoall]         Sec. IV-B, V-C                          |
  |    [Allreduce]        Sec. IV-D, V-D                          |
  |                                                               |
  |  Axis 4: MESSAGE SIZE (12 levels, log scale)                  |
  |    1 B, 8 B, 64 B, 512 B, 4 KiB, 32 KiB,                     |
  |    256 KiB, 2 MiB, 16 MiB, 128 MiB, 1 GiB                    |
  |                                                               |
  |  Axis 5: nGPU / SCALE (10+ levels, powers of 2)               |
  |    Intra-node:  2, 4, 8 (LUMI only)                           |
  |    Inter-node:  8, 16, 32, 64, 128, 256, 512, 1024,           |
  |                 2048, 4096                                    |
  |                                                               |
  |  Hidden Axis (Sec. V-B, "Network Distance"): 3 levels          |
  |    [Same switch]   GPU pair under one ToR                     |
  |    [Diff. switch]  same Dragonfly group                       |
  |    [Diff. group]   different Dragonfly groups                 |
  |                                                               |
  |  Held FIXED (no sweep):                                       |
  |    - Routing algorithm (per-cluster default; minimal +        |
  |      adaptive on Slingshot, RIA + adaptive on IB HDR)         |
  |    - Service Level (default = 0; non-default used as          |
  |      isolation probe for noise analysis only)                 |
  |    - 1 process per GPU (intra-node), 1 process per NIC        |
  |      (inter-node host-mem variants)                           |
  |    - Tuned env-vars (see Sec. III-B; per-cluster, fixed       |
  |      across runs once tuned)                                  |
  |    - Iteration counts: 100-1,000 per data point;              |
  |      "max-time across ranks" reduction                        |
  |                                                               |
  +---------------------------------------------------------------+
^ Fig 4: 5-axis design space. The hidden axis (network distance)
  is varied via job-placement control, not via knob settings,
  and is the input that exposes the noise findings of Sec. VI.

Two structural choices define the measurement scope. First, the mechanism axis is treated as a first-class independent variable on par with the hardware and primitive axes. Most prior characterizations fix the mechanism at the start (OSU = MPI; nccl-tests = NCCL); this paper varies it within the same harness so that the cross-mechanism ratio (e.g., the *CCL/GPU-Aware MPI heatmap, Fig. 11) becomes a derived measurement rather than an inferred ratio across incompatible tools. Second, the network-distance axis is exposed through job placement rather than synthetic noise injection — the paper compares "same switch" / "diff. switch" / "diff. group" by asking the scheduler to place GPUs accordingly, which means the distance variable is indirectly entangled with the routing algorithm and the production network's queueing state at the time of each measurement.

For DynamICCL, the mechanism axis becomes a per-call exogenous flag (the runtime knows whether NCCL or MPI is in use), the distance axis becomes a topology-derived state feature, and the message-size axis remains the primary trigger for the agent's protocol/algorithm selection.


4. Algorithm / Control Flow Diagrams

4.1 Intra-node ping-pong (Sec. III-A, the unit benchmark)

  START (one cell: e.g., Leonardo / GPU-Aware MPI / 4 KiB)
       |
       v
  (1) Allocate sendbuf, recvbuf in GPU memory (cudaMalloc/hipMalloc)
       |
       v
  (2) Pin host staging buffers if mechanism = Trivial Staging
       |
       v
  (3) Exchange handles between rank 0 and rank 1 (only for
      Device-Device Copy: cudaIpcOpenMemHandle / hipIpcOpenMemHandle)
       |
       v
  (4) Synchronize ranks (MPI_Barrier)
       |
       v
  (5) WARMUP: a small number of iterations to prime kernels and
      transport buffers (count not specified; consistent with
      nccl-tests practice)
       |
       v
  (6) MEASUREMENT LOOP, N in {100, ..., 1,000} depending on size:
        for i in 1..N:
           t0 = MPI_Wtime()
           rank-0: SEND msg, RECV msg
           rank-1: RECV msg, SEND msg
           sync GPU stream (cudaStreamSynchronize / hipStreamSynchronize)
              -- skipped for MPI: implicit on completion
           t1 = MPI_Wtime()
           record (t1 - t0)
       |
       v
  (7) Reduce: report MAX time across the two ranks
       (Hoefler & Belli "Twelve Ways" recommendation)
       |
       v
  (8) Compute goodput = bytes / (max_time / 2) for unidirectional
       |
       v
  END  -> single point on Fig. 3 / Fig. 7
^ Fig 5: Per-cell ping-pong control flow. The "max-time across
  ranks" reduction is load-bearing because it captures the
  slowest-rank tail latency, which is what real collectives
  encounter under network noise (Sec. VI).

4.2 Intra-node alltoall (Sec. IV-B)

  START (one cell: e.g., LUMI / RCCL / 2 MiB / 8 GCDs)
       |
       v
  (1) Each GPU allocates one sendbuf of size n*p (n bytes per peer,
      p peers) and one recvbuf of size n*p
       |
       v
  (2) Mechanism dispatch:
        MPI         -> MPI_Alltoall (native impl)
        RCCL        -> ncclGroup{Start,End} + ncclSendRecv (native AT)
        NCCL        -> documented "trivial" pattern: each GPU loops
                       over all peers, issues async ncclSend+ncclRecv
                       within a ncclGroupStart/End -- NCCL has no
                       native alltoall
        Device-Dev  -> same trivial pattern with cudaMemcpyAsync
                       between peer GPUs (handles pre-exchanged)
       |
       v
  (3) MPI_Barrier; t0 = MPI_Wtime()
       |
       v
  (4) Issue alltoall via dispatched mechanism
       |
       v
  (5) For *CCL / Device-Device, synchronize GPU stream before stop
      For MPI, completion is implicit at MPI_Alltoall return
       |
       v
  (6) t1 = MPI_Wtime(); record (t1-t0); reduce MAX across ranks
       |
       v
  (7) goodput = total_bytes_sent / max_time
       |
       v
  END
^ Fig 6: Alltoall control flow. The two NCCL/RCCL pathways
  diverge: RCCL has native alltoall; NCCL emulates via a
  send/recv group operation -- the documentation-recommended
  pattern. The benchmark uses identical issue patterns across
  Device-Device and NCCL for fair comparison.

4.3 Intra-node allreduce (Sec. IV-D)

  START (one cell: e.g., Alps / NCCL / 1 GiB / 4 GPUs)
       |
       v
  (1) Each GPU allocates a single buffer of n bytes (in-place)
       |
       v
  (2) Mechanism dispatch:
        MPI         -> MPI_Allreduce (Open MPI on Leonardo: host
                       reduction; MPICH on Alps/LUMI: GPU staging
                       buffer up to MPICH_GPU_ALLREDUCE_BLK_SIZE)
        *CCL        -> ncclAllReduce (NCCL/RCCL)
        Device-Dev  -> tree reduce to GPU 0 (no pipelining), then
                       broadcast back; reference-only impl
       |
       v
  (3) Barrier; timer start
       |
       v
  (4) For LARGE messages, the EXPECTED-goodput model (Sec. IV-C) is:
        - Alps, Leonardo (fully connected, P=4):
            pipelined ternary tree reduce + ternary tree broadcast
            peak goodput = sum of out-link BW from any one GPU
        - LUMI (partial connectivity, P=8):
            Rabenseifner = ring reduce-scatter + ring allgather
            on 4 edge-disjoint bidirectional rings via 400 Gb/s IF
            peak = 800 Gb/s per GCD (because Rabenseifner doubles
                   the bytes on the wire)
       |
       v
  (5) Issue allreduce via mechanism; sync stream; stop timer
       |
       v
  (6) Reduce MAX time; goodput = bytes / max_time (NOT halved as
      for ping-pong because allreduce is a single collective)
       |
       v
  END
^ Fig 7: Allreduce control flow. The expected-goodput model
  switches between tree-pipelined (Alps, Leonardo) and ring-based
  Rabenseifner (LUMI) depending on the intra-node graph's
  connectivity, illustrating that the right algorithm is a
  function of the topology not of the message size.

4.4 Inter-node ping-pong with network-distance probe (Sec. V-B)

  START (probe: same-switch | diff-switch | diff-group)
       |
       v
  (1) Job-launcher requests node placement to satisfy distance class
        - same switch: Slurm hint --gres + locality flag
        - diff switch / same group: explicit topology-aware job spec
        - diff group: deliberately spread across Dragonfly groups
       |
       v
  (2) Verify placement via topology query (Slurm + libfabric)
       |
       v
  (3) Run ping-pong (Sec. III-A), 1 B for latency, 1 GiB for goodput
       |
       v
  (4) Record per-iter timings (NOT just summary) -> needed for
      box-plot percentiles (p5, p25, median, p75, p95) and for
      noise-detection in Sec. VI
       |
       v
  (5) Repeat hundreds of iterations to populate distribution
       |
       v
  END  -> one box on Fig. 8 (six per cluster: GPU vs host x distance)
^ Fig 8: Network-distance probe. The per-iter telemetry is the
  novel feature absent from OSU/nccl-tests. Without it, the
  noise tail of Leonardo (max 132 us at 1 B) would be invisible
  in a mean-only benchmark.

4.5 Service-level isolation as a noise probe (Sec. VI-A)

  HYPOTHESIS: variable queueing delays at switch buffers cause the
              Leonardo tail in Fig. 8.
       |
       v
  (1) BASELINE: run 1 GiB allreduce on default SL (SL=0); record
      goodput distribution
       |
       v
  (2) PROBE: rerun on non-default SL via NCCL_IB_SL / UCX_IB_SL env
      vars; record distribution; same job placement, same iters
       |
       v
  (3) Compare distributions (Fig. 12-13):
        - If SL switch removes tail        -> noise = queue contention
        - If SL switch leaves tail         -> noise has another source
       |
       v
  (4) STRESS: launch a concurrent victim+aggressor pair on 128 GPUs
      each; vary aggressor pattern in {alltoall, incast}; vary SL
      sharing in {shared SL, separated SLs}
       |
       v
  (5) Measure victim allreduce goodput under each combination
       (Fig. 12 result: shared-SL incast crushes allreduce; the
        gap closes when the two share a switch, regardless of SL,
        confirming the bottleneck is switch buffer congestion)
       |
       v
  END
^ Fig 9: Service-level isolation control flow. The "natural
  experiment" insight (Sec. VI-B) -- that all production traffic
  on Leonardo defaults to SL=0 -- means switching to a
  non-default SL effectively measures the same job on an empty
  network, giving the unique production-vs-isolated comparison
  in Fig. 13.

4.6 Sweep dispatcher (the loop that produced every figure)

  for cluster in {Alps, Leonardo, LUMI}:                  # Axis 1
    apply per-cluster env vars from Sec. III-B
    for primitive in {p2p, alltoall, allreduce}:           # Axis 3
      for mechanism in {Staging, DevDev, *CCL, GPU-MPI}:   # Axis 2
        for nGPU in scale_grid[cluster, primitive]:        # Axis 5
          for msg_size in size_grid[primitive]:            # Axis 4
            run flow Fig. 5 / 6 / 7
            record (mean, IQR, p5, p25, median, p75, p95)
            log per-iter timings (for noise analysis)
        # accumulate into one panel of one figure
  # post-process: heatmaps, ratio plots, observation extraction
^ Fig 10: The outer dispatcher. Note the per-cluster env-var
  apply step at the top -- tuning is treated as part of the
  experimental method, not as a one-off setup task.

5. Quantitative Results — Empirical Findings by Regime

The paper reports its results as eight numbered Observations (Obs. 1-8) that summarize per-regime winners. Each observation maps to a specific figure and a specific cell of the design space; the table below condenses them with the supporting numbers extracted verbatim from the prose.

5.1 Tuning amplifies measured performance (Obs. 1)

The per-knob deltas extracted verbatim from Sec. III-B:

Knob (system) Effect
NCCL_IGNORE_CPU_AFFINITY=1 (Alps, LUMI) up to 1.6x AT, up to 6x AR (>= 2 nodes)
NCCL_NET_GDR_LEVEL=3 (Alps, LUMI) +2x AT, +3x AR
NCCL_NCHANNELS_PER_PEER=32 (LUMI intra-node P2P) +3.5x
MPICH_GPU_IPC_THRESHOLD=1 (Alps small msgs <4 KiB) -2x runtime
MPICH_GPU_ALLREDUCE_BLK_SIZE=128 MiB (Alps single-node AR) +50%
HSA_ENABLE_SDMA=0 (LUMI) up to +3x
GDRCopy LD_LIBRARY_PATH fix (Leonardo, small msgs) up to +6x

The headline summary (Sec. III-B):

"The default choices made by *CCL and GPU-Aware MPI are not always optimal, and manual tuning can improve performance up to an order of magnitude."

The ~10x gap between default and tuned is the strongest evidence in the paper that the configuration surface is non-trivial — and it is exactly the surface a runtime tuner is positioned to navigate.

5.2 Intra-node P2P (Obs. 2)

"GPU-Aware MPI provides the highest goodput for intra-node point-to-point transfers on all the analyzed systems. For small transfers, the optimal solution changes across the systems, depending on architectural features and specific optimization implemented by MPI."

Concrete: on Leonardo, GPU-Aware MPI beats NCCL by up to 2x on medium messages (Sec. III-C). The paper attributes this to GDRCopy on Leonardo and to MPICH's host-memcpy fast path on LUMI, which exploits AMD's permitted CPU load/store to GPU HBM — a feature absent from NVIDIA's H100 on Alps.

5.3 LUMI RCCL bandwidth misestimation (Obs. 3)

"On LUMI, RCCL point-to-point communication primitives do not correctly determine the bandwidth available between GPUs on the same node, thus underutilizing the available bandwidth."

Diagnostic: GPU 0 and GPU 6 versus GPU 0 and GPU 7 have identical nominal bandwidth, but RCCL achieves significantly higher goodput toward GPU 6 because its internal estimator counts hops, not parallel paths. GPU-Aware MPI and Device-Device achieve ~70% of nominal goodput on every pair; RCCL drops below 50% on some pairs. This is a concrete library bug that DynamICCL-style tuners cannot fix from above — it lives in RCCL's topology graph builder.

5.4 Single-node collectives (Obs. 4)

"For single node collectives, *CCL outperforms GPU-Aware MPI in most cases, except for small collectives on LUMI."

The exception: on LUMI, GPU-Aware MPI is up to 3x faster than RCCL on small alltoall (Sec. IV-B), consistent with the small-P2P finding. The paper's interpretation (Sec. IV-D): "*CCL collectives are optimized for the specific GPU models" while MPI's allreduce underperforms because it leans on host-side aggregation (particularly Open MPI on Leonardo, which lacks UCC).

5.5 Inter-node P2P (Obs. 5)

"On inter-node point-to-point communications, MPI outperforms *CCL by up to one order of magnitude on small transfers, and by up to 3x on larger transfers."

The cause is GPU-kernel launch and management overhead in *CCL on the inter-node path. For DynamICCL this is a clean piece of prior knowledge: for small inter-node messages, MPI's lightweight P2P is strictly better — but DynamICCL operates inside NCCL, so the path it cannot take is "switch to MPI". It can, however, choose protocols (LL, LL128) that minimize NCCL's per-call setup overhead.

5.6 Network-distance impact (Obs. 6)

Cluster Same-switch latency Diff-group latency Goodput change Variability
Alps 4.33 us mean 5.56 us mean (+28%) -1% tight
Leonardo 2.03 us mean 4.23 us mean (+2x) -17% (395 -> 328 Gb/s) wide; max 132 us, min 216 Gb/s
LUMI 3.71 us mean 4.18 us mean (+13%) -1% tight

"On Alps and LUMI, GPU's network location has a marginal impact on average performance (below 30% for latency and 1% for goodput). On the other hand, on Leonardo, the average latency increases by up to 2x when the GPUs are in different groups rather than under the same switch. Similarly, the average goodput decreases by 17%. This is mainly due to network performance variability caused by network noise."

The Slingshot-vs-IB-HDR delta: Slingshot is "largely unaffected by network noise", whereas Leonardo's IB HDR Dragonfly+ shows tail latency up to 132 us at 1 B — a 65x degradation of the worst case relative to the median.

The host-memory variant of the same probe (Fig. 8b) confirms the GPU-management overhead on Slingshot: Leonardo's same-switch host-memory latency is 1.02 us versus 3.66 us on Alps, attributed to Slingshot's Ethernet-derived protocol overhead (larger headers). The relative GPU-vs-host gap is therefore informative: a large gap indicates the bottleneck is GPU management; a small gap indicates the bottleneck is the network itself.

5.7 Inter-node alltoall scalability (Sec. V-C, Fig. 9)

The 2 MiB alltoall scaling, by cluster + mechanism, with the asymptotic per-GPU goodput as the upper bound (200 Gb/s on Alps, 100 Gb/s on Leonardo and LUMI):

Cluster Mechanism Max nGPUs reached Behaviour
Alps NCCL 256 (stalls at 512) Hits asymptote near 200 Gb/s before stall
Alps GPU-Aware MPI 2,048 ~75% asymptotic eff. up to 1,024
Leonardo NCCL 1,024 ~75% efficiency, gradual decline above
Leonardo GPU-Aware MPI 1,024 (user limit) Lower than NCCL
LUMI RCCL 512 (stalls at 1,024) Slightly below 75% efficiency
LUMI GPU-Aware MPI 4,096 (user limit) Closes the gap with RCCL at large nGPUs

The benchmark stalls at 512 (NCCL) and 1,024 (RCCL) in alltoall — confirmed both in the authors' custom benchmark and in the official nccl-tests / rccl-tests — point to a connection-table explosion in the alltoall pattern, since the same scales work for allreduce.

5.8 Inter-node allreduce scalability (Sec. V-D, Fig. 10)

"On Leonardo, we observe an extremely low goodput for GPU-Aware MPI. As discussed in Sec. IV-D, this is due to Open MPI copying the buffer from the device to host memory and then running the allreduce on the host."

"We also observe a sharp drop in *CCL performance on Alps and LUMI from 256 to 512 GPUs."

The 256-to-512 drop is not an algorithm change (verified by re-running with the same algorithm explicitly fixed), and the goodput "steadily decreases between 256 and 512 GPUs, rather than dropping abruptly." The implication is a saturating effect in the collective implementation itself, not a topology phase transition.

5.9 *CCL vs GPU-Aware MPI ratio map on LUMI (Sec. V-E, Fig. 11)

The Fig. 11 heatmap reports the RCCL/GPU-Aware-MPI goodput ratio, with a sharp inversion of the trend around 32 KiB:

Alltoall ratio at 32 KiB (LUMI): 0.07-0.36 (RCCL slower) Alltoall ratio at 16 MiB (LUMI): 0.93-1.55 (RCCL faster)

Allreduce ratio at 1 B (LUMI): 0.09-0.10 (RCCL ~10x slower) Allreduce ratio at 1 GiB (LUMI): 1.77-2.88 (RCCL ~2-3x faster)

"There is a sharp inversion of the trend around 32 KiB, which we believe might be mitigated by tuning the allreduce algorithm selection."

This is the cleanest evidence in the paper that RCCL/NCCL algorithm selection is a tunable knob with multiplicative leverage. The same workload, same hardware, same cluster, switching between mechanism = MPI and mechanism = *CCL changes goodput by a factor of 2.88 on LUMI.

"On Alps and Leonardo, instead, NCCL outperformed GPU-Aware MPI regardless of the message size and node count."

So the LUMI inversion is RCCL-specific, not a universal small-msg *CCL weakness. NCCL on the NVIDIA platforms does not show the inversion at all. This isolates the finding to RCCL's algorithm selection on AMD MI250X.

5.10 Network-noise impact at scale (Obs. 8)

nGPU Alltoall default-SL goodput Alltoall non-default-SL Allreduce default Allreduce non-default
8 - 64 identical identical identical identical
1,024 -20% from non-default (clean baseline) -50% from non-default (clean baseline)

"On 1,024 GPUs, network noise causes an additional 20% performance drop on alltoall, and a 50% drop on the allreduce."

The 50%-drop figure is the headline number. It quantifies, for the first time on a real production multi-GPU system, the cost of sharing the inter-node fabric with concurrent jobs. This is the "production cost of multi-tenancy" measured directly, not via synthetic injection — the paper stresses this distinction explicitly in Sec. VI-B.

5.11 The summary count of "Observations"

The paper's eight observations form a compact regime atlas:

Obs Regime Winner / Finding
1 Default vs tuned Tuning yields up to 10x; default never optimal
2 Intra-node P2P GPU-Aware MPI > *CCL for goodput; small-msg winner varies
3 LUMI intra-node P2P RCCL underestimates available bandwidth
4 Intra-node collectives *CCL > MPI in most cases; LUMI small-msg exception
5 Inter-node P2P MPI > *CCL: 10x small, 3x large
6 Network distance Marginal on Alps/LUMI; up to 2x latency / -17% BW on Leonardo
7 Multi-node *CCL vs MPI on collectives *CCL > MPI; gap shrinks with scale; *CCL stalls at scale
8 Network noise (Leonardo) -20% AT, -50% AR at 1,024 GPUs

6. Configuration-Regime Trade-off Tables

6.1 Mechanism choice (per-cell)

Dimension Trivial Staging Device-Device *CCL GPU-Aware MPI Winner (DynamICCL)
Intra-node P2P large-msg Loses 10x Saturates Comparable to MPI Best MPI
Intra-node P2P small-msg LUMI Slow Slow Loses to MPI Best (3x) MPI
Intra-node P2P small-msg Alps Slow n/a (no peer access) Comparable to MPI Comparable Either
Intra-node alltoall large-msg Loses Reference Best (Alps,LUMI) Comparable Leo. *CCL
Intra-node allreduce Loses Reference Best Worst on Leo. *CCL
Inter-node P2P small-msg n/a n/a Loses 10x Best MPI
Inter-node P2P large-msg n/a n/a Loses 3x Best MPI
Inter-node alltoall scaling >=512 GPU n/a n/a Stalls (NCCL/RCCL) Reaches 4,096 (LUMI) MPI in fallback role
Inter-node alltoall <=256 GPU n/a n/a Best (Alps, Leo) Slower *CCL
Inter-node allreduce on Leonardo n/a n/a Best Host-fallback (very slow) *CCL
Inter-node allreduce on Alps/LUMI n/a n/a Best up to 256 Slower *CCL up to 256, watch 256-512 dropout

For DynamICCL, prefer treating mechanism as exogenous state, not action. DynamICCL operates within NCCL/RCCL — it cannot switch to MPI. The paper's cells where MPI wins are the cells where DynamICCL inherits a structural disadvantage; the cells where *CCL wins (which is most large-msg intra-node and most collectives below 256-512 GPUs) are the cells where DynamICCL has the most leverage.

6.2 Topology / connectivity (intra-node graph)

Dimension Fully connected (Alps, Leonardo) Partially connected (LUMI) Winner (DynamICCL)
Edge-forwarding-index 1 up to 4 (GCD 1<->5, 7<->3) --
Optimal allreduce algo Pipelined ternary tree Rabenseifner (ring RS + ring AG) LUMI -> ring family
Peak intra-node BW per GPU ~3 * out-link BW 800 Gb/s (4 edge-disj. bidir. rings) --
Per-pair P2P BW Uniform Ranges 1x-4x IF link NVIDIA boxes have predictable behaviour
Algorithm-selection cost Low (one good algo) High (must pick by graph) LUMI demands more knob agility

For DynamICCL, prefer to gate algorithm choice on a topology descriptor. The paper exposes intra_node_graph_class in {fully_connected, partially_connected_with_rings} as a non-actionable state feature; the right NCCL algorithm for an intra-node allreduce is a deterministic function of this class.

6.3 Message-size regime (small / medium / large)

Dimension Small (<= 32 KiB) Medium (32 KiB - 2 MiB) Large (>= 2 MiB) Winner (DynamICCL)
Bottleneck Per-call setup latency Mixed Link bandwidth --
Best mechanism (intra-node) MPI on LUMI; *CCL/MPI tie elsewhere *CCL *CCL --
Best mechanism (inter-node P2P) MPI (10x) MPI (3x) MPI (3x) MPI
RCCL/MPI ratio (LUMI allreduce) 0.09-0.10 0.34-0.71 1.77-2.88 --
Recommended NCCL protocol* LL LL128 Simple Aligns with paper trends
Recommended NCCL algo* Tree Mixed Ring Aligns with paper trends

*Last two rows are extrapolations from NCCL behavior cited in papers 0011 (Demystifying NCCL) and 0018 (CollComm Config Survey), not measured in this paper.

For DynamICCL, prefer msg-size-bin x cluster as the joint key. The same knob (algorithm) flips its preferred value at exactly the same message size threshold (32 KiB) on which the LUMI heatmap inverts (Fig. 11). This is direct empirical evidence that a single fixed default is suboptimal across the size grid — and that the right algorithm depends on the cluster (no inversion on Alps or Leonardo).

6.4 Network-noise sensitivity

Dimension Slingshot (Alps, LUMI) InfiniBand HDR Dragonfly+ (Leonardo) Winner (DynamICCL)
Median latency same-switch 4.33 us / 3.71 us 2.03 us IB
Median latency diff-group 5.56 us / 4.18 us 4.23 us IB
95th-percentile diff-group tight (~1us delta) up to 132 us at 1 B Slingshot
Goodput drop diff-group -1% -17% Slingshot
Allreduce noise drop @1024 GPU not measured (insensitive) -50% vs isolated Slingshot
Mitigation availability Routing-engine-driven SL switch (only if SL is unshared) Slingshot

For DynamICCL, prefer to encode is_lossy_routing_fabric and is_dragonfly_plus as cluster-level state features. The paper shows that Leonardo's noise tail completely changes the meaning of "observed latency" — an agent trained on isolated runs would overfit to the median and be miscalibrated against the production distribution.

6.5 Scale regime

Dimension Small scale (<= 32 GPUs) Medium (32 - 256 GPUs) Large (>= 1,024 GPUs) Winner (DynamICCL)
Intra-node fraction of comm High Mixed Low (asymptotic limit) --
Software bottleneck Library defaults Library + topology Network noise / saturation --
Most leveraged knob Library / mechanism Algorithm + protocol Routing / SL placement --
*CCL vs MPI gap (alltoall) Wide Narrowing Often vanishes --
Likelihood of stalls Negligible Watch >= 256 (NCCL AT) Confirmed (NCCL 512, RCCL 1024) DynamICCL: alert mode
256->512 *CCL allreduce dropout n/a (Sec. V-D) sharp drop Persistent --

For DynamICCL, prefer to over-explore the medium-to-large crossover band (256 - 1,024 GPUs). This is the regime where the paper's own data has the most surprising behavior (the unexplained 256-to-512 drop in *CCL allreduce; the alltoall stall at 512/1024). Static defaults must mis-tune for at least one of these regimes; an RL agent has the most marginal value here.


7. Bottlenecks & Insights Surfaced by the Measurements

7.1 Default configurations are off the optimum by up to an order of magnitude

The seven environment-variable changes in Sec. III-B together yield multipliers of 1.6x, 6x, 2x, 3x, 3.5x, 2x, 50%, 3x, 6x — most of which are independent and compose. The "10x" headline number is not hyperbole; it reflects the geometric product of several configuration mistakes that cancel out only when all are corrected. This is the strongest empirical case for "the configuration surface is large, and the default sits far from any peak" — the exact premise that motivates a runtime tuner.

7.2 Per-GPU bandwidth heterogeneity inside a node is real (LUMI)

The LUMI MI250X node is not a uniform fabric. The 8 GCDs are connected with 1, 2, or 4 IF links depending on the pair, giving goodput differences of up to 4x between same-node "neighbors". The paper exposes this in Fig. 4 and Sec. III-D: GPU 0 reaches ~1,200 Gb/s to GPU 1 (4 links) but ~300 Gb/s to GPU 5 (1 link). For DynamICCL, intra-node topology cannot be modeled as uniform. The agent's state vector needs intra_node_distance or peer_link_count per call, mirroring the multi-rail awareness HiCCL exposes (paper 0021).

7.3 RCCL's hop-based bandwidth estimator misses parallel paths

Sec. III-D identifies the underlying bug: RCCL's internal topology graph counts hops, not paths. So GPU 0 -> GPU 6 (4 paths) and GPU 0 -> GPU 7 (4 paths) are estimated identically by hops but realized very differently in measurement. The fix lives inside RCCL's NCCL_DEBUG_SUBSYS=GRAPH path-discovery logic. A runtime tuner above RCCL cannot recover this: the graph is decided at init. This is a clean instance of a class of bugs that DynamICCL cannot fix (it can only set knobs that are already exposed; it cannot change the algorithm's input model).

7.4 Open MPI on Leonardo silently falls back to host allreduce

Sec. IV-D documents that Open MPI 4.1.4 on Leonardo "runs the allreduce on the host" because UCC is not deployed. The result is catastrophic for inter-node allreduce performance (Fig. 10). This is a deployment-level decision masquerading as a library-level choice — and it is invisible to the user without opening the debug output. The lesson is structural: the same software label ("Open MPI 4.1.4") can mean different things on different systems depending on which optional components were built. DynamICCL's state needs mpi_has_ucc as a binary feature — not because DynamICCL would choose MPI over NCCL, but because the application above DynamICCL might switch backends based on this flag, which changes the workload distribution NCCL sees.

7.5 The 32 KiB algorithm-inversion threshold (LUMI)

Fig. 11's heatmap inverts at 32 KiB — RCCL is 10x slower than MPI below it, and 1.5-2.9x faster above. The paper attributes this to "the allreduce algorithm selection". This is the single clearest piece of empirical evidence in the paper that NCCL/RCCL algorithm selection has multiplicative impact, and the default threshold is mis-tuned on at least one production system. DynamICCL's state must include msg_size_bin and cluster_id; the action over algorithm and protocol must be conditioned on the joint key.

7.6 The unexplained 256-to-512 *CCL allreduce drop

Sec. V-D documents a goodput drop on Alps and LUMI between 256 and 512 GPUs that cannot be explained by an algorithm change (the authors fixed the algorithm and reproduced the same drop). The authors do not pin down the cause but frame it as a saturation effect inside the collective implementation. For DynamICCL, the regime 256-1,024 GPUs is exactly the high-leverage zone: the paper's own measurements show static defaults fail here, the authors cannot explain why, and the drop reproduces under a fixed algorithm — implying nChannels / numThreads / chunkSize tuning is the remaining lever.

7.7 The alltoall stall at 512/1,024 GPUs is a connection-count problem

The benchmark stalls at 512 NCCL-AT and 1,024 RCCL-AT, in the authors' code and in the official nccl-tests / rccl-tests. The authors hypothesize "the higher number of connections that must be kept active in the alltoall compared to the allreduce". This identifies a different kind of bottleneck: *a hard scaling limit in the connection-tracking data structure inside CCL. It is not a tuning question; it is a re-design question. For DynamICCL, the implication is that the action space at >=512 GPUs alltoall must exclude certain configurations or fall back to a hierarchical pattern (alltoall via reduce-scatter + allgather chains) — exactly the kind of compositional rewrite HiCCL provides above NCCL.

7.8 Network noise produces a 50% allreduce drop at 1,024 GPUs

The Sec. VI measurement is a clean separation: same workload, same binary, same hardware, same scale, same algorithm — only the service level changes. The result is -50% allreduce goodput on the default SL versus a non-default SL, in production conditions. This is the upper bound on what any in-NCCL tuner can recover on a noisy fabric — the noise cost is exogenous, not knob-controllable. DynamICCL can adapt to it (re-tune for the noisy regime) but cannot eliminate it; that requires routing-level fixes (Hopper- style predictive load balancing, paper 0030 ref) or scheduling- level fixes (group-aware Slurm placement).

7.9 The "host-mem latency" probe isolates GPU-management overhead

Fig. 8b's host-memory probe is the cleanest tool in the paper. The gap between Fig. 8a (GPU memory) and Fig. 8b (host memory) at identical job placement is the GPU-management overhead. On Alps and LUMI it is several microseconds (3.66 - 1.02 = 2.64 us extra from GPU management on Slingshot relative to InfiniBand at same-switch); on Leonardo it is also large but the IB latency is already lower. This decomposition principle is the right tool for a DynamICCL state-feature designer: every observed latency should be split into (a) network transit, (b) library overhead, (c) GPU-management overhead. The agent needs all three to choose between knobs that target different layers.

7.10 Tuning is a partially shared resource between sites

Sec. III-B closes with: "The optimization of some of these parameters involved discussions with HPC site support teams and Cray/HPE, NVIDIA, and AMD engineers... Understanding and resolving some of these unusual behaviors took several days of investigation." This is the same insight the demystification paper (0011) and the configuration survey (0018) reach: good defaults are tribal knowledge, not documentation. The case for an automated tuner is not that it discovers new optimization, it is that it removes the "several days of investigation" from the user's critical path.


8. Limitations of the Methodology

Limitation Implication
Three clusters, all top-10 No coverage of mid-range / older clusters; no PCIe-only systems
GPU peer-access disabled on Alps Sec. III-D notes Device-Device-Copy data not collected on Alps
Alps still in commissioning Some Alps numbers will shift with further tuning (paper acknowledges)
Leonardo node cap = 256 (1,024 GPUs) Inter-node allreduce on Leonardo is not measured beyond 1,024
LUMI access cap = 512 nodes (4,096 GPUs) Strong-scaling beyond 4,096 unmeasured
Alltoall stalls at 512 (NCCL) / 1,024 (RCCL) No alltoall data above these scales -- a measurement boundary
Network-distance probe relies on Slurm hint Job placement may not match user intent precisely
Service-level analysis only on Leonardo Slingshot and IB-other clusters unanalyzed for noise
No NCCL knob sweep beyond the cited env-vars Algo / proto / nChannels / numThreads / chunkSize not swept
One mechanism wins one cell No analysis of the right knob within a mechanism
Iteration count 100-1,000 Reasonable, but tail beyond p99.9 not characterized
Production noise sampling is opportunistic Different time windows would produce different distributions
Mechanism comparison conflates many factors "MPI vs NCCL" includes plugin layer (UCX, libfabric, OFI) differences
Single MPI version per system No MPICH-vs-OMPI ablation (per cluster, only one stack tested)
No cost model fit The expected-goodput dashed lines are analytic targets, not fitted predictions
Single workload class (microbenchmark) No real DL training / scientific app to validate the regime atlas
No repetition over multiple weeks Diurnal variability of production workload not characterized
Routing algorithm fixed at vendor default Adaptive routing variants on Slingshot / IB not compared

The most consequential omission for DynamICCL is the absence of any within-NCCL knob sweep. Paper 0051 documents that mechanisms (NCCL vs MPI vs Device-Device) produce up to 10x differences and that the default tuning of these mechanisms yields up to 10x gains, but it never asks "given NCCL is the chosen mechanism, what is the best (algorithm, protocol, nChannels, numThreads, chunkSize) tuple for this cell?" That is the precise complement DynamICCL is positioned to answer — and the paper provides the calibrated cells against which any tuner must be benchmarked.


9. Note on NCCL Tuning

This paper directly demonstrates that NCCL's algorithm-selection threshold around 32 KiB on LUMI is mis-tuned: RCCL is 10x slower than GPU-Aware MPI for allreduce below ~32 KiB and 2-3x faster above ~16 MiB, with the inversion happening at exactly the size where the paper says "tuning the allreduce algorithm selection" would close the gap (Sec. V-E). The same paper shows that NCCL_NCHANNELS_PER_PEER=32 alone yields 3.5x on LUMI intra-node P2P, and that NCCL_NET_GDR_LEVEL=3 yields 2x-3x on collectives. These are concrete, per-cluster, per-message-size cells where a runtime tuner has measured leverage; the paper's contribution is to identify the cells, not to fill them in.


10. Analogy

The paper is a performance road test of three different car chassis on the same closed track, with the same drivers running the same lap routes through different transmissions. The three chassis (Alps with NVIDIA H100 + NVLink 4.0, Leonardo with A100 + NVLink 3.0, LUMI with MI250X + Infinity Fabric) are tested by the same drivers running the same lap routes (ping-pong, alltoall, allreduce), but each lap is run through four different transmissions (Trivial Staging, Device-Device Copy, *CCL, GPU-Aware MPI). The investigators record lap times by msg size and scale, and then publish a regime atlas: which transmission to use on which chassis at which speed. The eight Observations are the "if you are driving this chassis at this speed in this gear, you should expect this lap time" entries. The 32 KiB inversion on LUMI is the Mulsanne-corner of this dataset — the place where one transmission stops working and another takes over, identified by direct lap-time measurement rather than by manufacturer specification. The paper's silence on transmission internals is the design space DynamICCL inhabits: every cell is measured with the transmission's factory tune-up, and an adaptive engine controller (the runtime tuner) is the natural next layer above the test bench. The paper's contribution to that controller is the chassis-and-track atlas — a calibrated map of where the road is rough (Leonardo's diff-group regime, the 256-512 *CCL allreduce dropout, the 512/1,024 alltoall stall) and where it is smooth (Alps and LUMI same-switch P2P, intra-node *CCL collectives below 256 GPUs) — so the controller knows where to apply the most authority.