Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects — Detailed Summary
- Authors: Daniele De Sensi (Sapienza/ETH), Lorenzo Pichetti (Trento), Flavio Vella (Trento), Tiziano De Matteis (VU Amsterdam), Zebin Ren (VU Amsterdam), Luigi Fusco (ETH), Matteo Turisini (CINECA), Daniele Cesarini (CINECA), Kurt Lust (Antwerp), Animesh Trivedi (IBM/VU Amsterdam), Duncan Roweth (HPE Cray), Filippo Spiga (NVIDIA), Salvatore Di Girolamo (NVIDIA), Torsten Hoefler (ETH).
- Venue: SC24 — International Conference for High Performance Computing, Networking, Storage, and Analysis (Atlanta, GA, Nov 17–22, 2024).
- Year / DOI: 2024; IEEE 979-8-3503-5291-7/24; arXiv:2408.14090v2 (15 Nov 2024).
Per-section, paragraph-level breakdown. Every paragraph in the paper produces at least one bullet. All quantitative numbers are preserved verbatim from the PDF.
Abstract
- Exascale supercomputers increasingly pack many GPUs per node, connected by dedicated intra-node networks reaching a few terabits per second.
- Maximizing efficiency is hard because of the diversity of
technologies (NVLink, Infinity Fabric, Slingshot-11, InfiniBand HDR),
design options, and software layers (MPI vs. NCCL/RCCL, henceforth
*CCL). - The paper characterizes three SC24-era systems — Alps (NVIDIA H100), Leonardo (NVIDIA A100), LUMI (AMD MI250X) — using a custom benchmark on intra- and inter-node networks, scaling up to 4,096 GPUs.
- Headline message: there is untapped bandwidth and many optimization opportunities, ranging from network routing to software-stack tuning.
I. Introduction
- Supercomputers underpin advances in ML, scientific computing, and big-data analytics; architecture is co-evolving with those workloads.
- Top500 / pre-exascale systems now host up to 8 GPUs per node with up to 3.6 Tb/s per direction, and total counts up to 75,000 GPUs.
- Moving data across so many GPUs is hard for three reasons: interconnect/topology/hardware diversity, non-trivial mapping of communication to mechanism, and large-scale congestion / network noise effects that reduce scalability.
- On the software side, options span trivial host-staged copies,
GPU-Aware MPI, and
*CCL(NCCL on NVIDIA, RCCL on AMD) — but the right choice and the maturity of each is unclear. - The paper characterizes three architectures: Alps (NVIDIA H100 + HPE Cray Slingshot-11), Leonardo (NVIDIA A100 + InfiniBand HDR), LUMI (AMD MI250X + Slingshot-11), benchmarking up to 4,096 GPUs.
- Contributions are: (1) a detailed analysis of intra- and inter-node
data movement across D2D copies,
*CCL, and GPU-Aware MPI; (2) a quantification of network-noise impact on scalability; (3) eight numbered observations distilled for system architects, researchers, and developers.
II. Systems Description
- All three interconnects are full-duplex; the paper consistently reports unidirectional bandwidth in bits per second.
- Table I (June-2024 Top500 rankings: Alps #6, Leonardo #7, LUMI #5) summarizes per-system characteristics.
Table I — Main characteristics of the analyzed systems
| Feature | Alps (#6) | Leonardo (#7) | LUMI (#5) |
|---|---|---|---|
| CPU | 72-core NVIDIA Grace | 32-core Intel Ice Lake Xeon 8358 | 64-core AMD Trento EPYC 7A53 |
| GPU | 4× NVIDIA Hopper H100 (GH200) | 4× NVIDIA Ampere A100 (special SKU) | 4× AMD MI250X (8 GCDs) |
| NICs | 4× HPE Cray Cassini-1 200 Gb/s | 2× dual-port NVIDIA ConnectX-6 (4×100 Gb/s) | 4× HPE Cray Cassini-1 200 Gb/s |
| Intra-node | NVLink 4.0, 6×200 Gb/s per pair (1.2 Tb/s) | NVLink 3.0, 4×200 Gb/s per pair (800 Gb/s) | Infinity Fabric, 1–4 links of 400 Gb/s |
| Inter-node | Slingshot-11 (Dragonfly) | InfiniBand HDR (Dragonfly+) | Slingshot-11 (Dragonfly) |
| Software | Cray MPICH 8.1.28, libfabric 1.15.2, CUDA 12.3, aws-ofi-nccl plugin | Open MPI 4.1.4 (over UCX 1.13.0), CUDA 12.1 | Cray clang 16.0.1, Cray MPICH 8.1.27, libfabric 1.15.2, ROCm 5.7.1, aws-ofi-rccl plugin v1.4 |
II-A. Alps
- Alps is a 270 PFlop/s supercomputer at #6 on the June-2024 Top500, deployed by CSCS; the paper used the early-access "Santis" partition with up to 512 nodes, so some numbers may shift before production.
- Node architecture: Four GH200 Grace Hopper Superchips per node, fully connected by NVLink 4.0; each pair has six 200 Gb/s links, totalling 1.2 Tb/s unidirectional between any GPU pair. Each GH200 carries 96 GB HBM3 plus 120 GB LPDDR5X. Each node is a single 8-domain NUMA system with 288 CPU cores and 4 GPUs.
- Inter-node: Each node has one HPE Cray Cassini-1 200 Gb/s NIC per GH200 and is connected through Slingshot-11 in a Dragonfly topology; each switch has 16 endpoint ports, 31 intra-group ports, and 17 inter-group ports.
II-B. Leonardo
- Leonardo is a 240 PFlop/s system at #7 on the June-2024 Top500, owned by EuroHPC and hosted at CINECA; the study uses the Booster partition (3,456 nodes).
- Node architecture: One Intel Xeon 8358 CPU and four NVIDIA A100 GPUs (13,824 GPUs system-wide), 512 GB CPU memory in 8 DDR4 slots, 64 GB HBM2e per GPU. Within a node, GPUs use NVLink 3.0 (four 200 Gb/s links per pair, 800 Gb/s total) and a per-GPU 256 Gb/s 16-lane PCIe Gen4 bus to the CPU and NIC.
- Inter-node: Each node has two dual-port NVIDIA ConnectX-6 NICs (four 100 Gb/s ports), all attached to the same switch, treated as four separate NICs. The Dragonfly+ topology has 23 groups; each group is a two-level fat-tree of 180 nodes with 18 spine and 18 leaf switches; switches expose 40×200 Gb/s ports (configurable as 2×100 Gb/s); leaves connect 40 100 Gb/s ports to 10 nodes (4 GPUs each) and 18 200 Gb/s ports up to spines (with 2×200 Gb/s ports unused). Spine switches use 18 200 Gb/s ports down to leaves and 22 200 Gb/s ports to other groups' spines.
II-C. LUMI
- LUMI is a 380 PFlop/s system at #5 on the June-2024 Top500, EuroHPC + CSC Finland; the paper uses the LUMI-G partition (2,978 nodes).
- Node architecture: One 64-core AMD EPYC 7A53 "Trento" CPU with 4 NUMA domains (each 128 GB DDR4) and 4 MI250X modules; each MI250X has 2 GCDs (8 GCDs / node), each with 64 GB HBM (128 GB per module). For the rest of the paper, a LUMI node is treated as an 8-GPU node. Each GCD attaches to a NUMA domain via a 288 Gb/s Infinity Fabric link; GCDs are connected to each other with one to four 400 Gb/s IF links (heterogeneous topology, see Fig. 2).
- Inter-node: Each MI250X has one 200 Gb/s Cassini-1 NIC; the Slingshot-11 fabric is a Dragonfly with 24 groups × 124 nodes. Switches: 16 endpoint, 31 intra-group, 17 inter-group ports.
III. Intra-Node Point-to-Point Performance
- The paper opens its measurement campaign with intra-node point-to-point: methodology, tuning, latency/goodput, and the GCD-location effect on LUMI.
III-A. Benchmarking Methodology
- Each GPU is bound to one MPI process with affinity to the closest CPU core. Each transfer size is repeated 100–1,000 times depending on size. Reported metric: max time / min goodput across ranks; bandwidth is unidirectional in Gb/s; communicator-creation time is excluded.
- Four mechanisms are evaluated:
- Trivial Staging — host-pinned bounce buffers, no pipelining, store-and-forward; serves as baseline.
- Device-Device (D2D) Copy — IPC memory handles shared across processes for direct device-to-device async transfers (in alltoall, every GPU launches direct copies to all peers).
*CCL— NCCL on NVIDIA, RCCL on AMD.- GPU-Aware MPI — direct device-buffer transfers via Cray MPICH / Open MPI.
- The authors built a custom benchmark because OSU lacks D2D, and
nccl-tests/rccl-testsonly support*CCLand do not expose per-iteration timing — needed for the noise study in Sec. VI. Timing usesMPI_Wtimewith measured resolutions of 25 ns on LUMI and Leonardo, 30 ns on Alps. One-time costs (buffer allocation, exchange) are excluded; for*CCLthe timing brackets the group start/end (consistent withnccl-tests/rccl-tests). Code is publicly released as an artifact.
III-B. Performance Tuning
- Default configurations did not unlock the full hardware; tuning required searching environment variables and analyzing each system. Cooperation with site teams and Cray/HPE / NVIDIA / AMD engineers was required.
*CCL: On Alps and LUMI, settingNCCL_IGNORE_CPU_AFFINITY=1(so*CCLoverrides the Slurm-set affinity) gave up to 1.6× on alltoall and up to 6× on allreduce starting at two nodes. SettingNCCL_NET_GDR_LEVEL=3(extending the GPU-direct RDMA reach with the NIC) produced 2× on alltoall and 3× on allreduce on Alps. On LUMI,NCCL_NCHANNELS_PER_PEER=32improved P2P tests by 3.5×.- GPU-Aware MPI: On Alps, forcing device-device
copies regardless of size (
MPICH_GPU_IPC_THRESHOLD=1) cut runtime by 2× for transfers <4 KiB. Increasing the GPU-attached staging buffer used by MPICH for kernel-based allreduce optimizations to 128 MiB (MPICH_GPU_ALLREDUCE_BLK_SIZE) gave +50% on single-node allreduce. On LUMI, disabling System Direct Memory Access (HSA_ENABLE_SDMA=0) gave up to 3×. On Leonardo, UCX was failing to load GDRCopy because it was installed at the wrong path; fixingLD_LIBRARY_PATHimproved small-message performance up to 6×. - Some optima are still unexplained: Alps runtime did not increase monotonically with message size, prompting further IPC-threshold investigation. The tuning campaign took several days.
Observation 1. Achieving good performance on multi-GPU systems requires non-trivial tuning, which depends on the system, message size, communication library, and number of nodes. The default choices made by
*CCLand GPU-Aware MPI are not always optimal, and manual tuning can improve performance up to an order of magnitude.
III-C. Point-to-Point Latency and Goodput
- Fig. 3 shows ping-pong unidirectional goodput between two same-node GPUs; goodput = bytes / (runtime/2). Inset: small-message runtime in microseconds. Each point is a mean across runs; shaded band is interquartile range. Dashed lines: nominal GPU–GPU goodput and the trivial-staging expected goodput. On Leonardo, peak per-pair bandwidth is 800 Gb/s (4×200 Gb/s NVLink 3.0 links).
- On LUMI the peak depends on the chosen pair; the experiment uses GCDs 0 and 1 (4×400 Gb/s IF links). Disabling SDMA lets a GCD use multiple IF links concurrently; on Alps, peer access for explicit D2D copies was not yet enabled in the early-access state, so D2D experiments are omitted there.
- Goodput finding: Trivial staging is up to an order
of magnitude lower than the other implementations because the data must
go through host memory;
*CCL, GPU-Aware MPI, and D2D copy are comparable at large size. On Leonardo, GPU-Aware MPI medium-message goodput is up to 2× higher than NCCL. - Latency finding: For small messages,
*CCLand MPI are similar on Alps but show a large gap on Leonardo and LUMI. On Leonardo this is due to GDRCopy. On LUMI, Cray MPICH transfers small same-node buffers via host memory (rather than D2D copy), using an optimized memcpy where the CPU directly issues load/store ops to GPU HBM — possible on AMD but not NVIDIA — so on Alps CPU load/store to GPU memory is forbidden, raising small-message latency.
Observation 2. GPU-Aware MPI provides the highest goodput for intra-node point-to-point transfers on all the analyzed systems. For small transfers, the optimal solution changes across the systems, depending on architectural features and specific optimization implemented by MPI.
III-D. Impact of GPU Location on LUMI
- LUMI's 8 GCDs are partially connected: between any two GCDs there are 1, 2, 3, or 4 IF links. Fig. 4 shows GPU 0 → GPU 1..7 unidirectional goodput for a 1 GiB buffer, with a per-pair dashed nominal-goodput line.
- Trivial staging shows no variation (it never moves data directly between GPUs). GPU-Aware MPI and D2D copy reach ~70% of nominal on every pair. RCCL, however, achieves less than half the goodput of GPU-Aware MPI / D2D in some cases (for example GPU 0 ↔︎ GPU 5).
- Looking at RCCL debug info
(
NCCL_DEBUG_SUBSYS=INIT,GRAPHandNCCL_DEBUG=INFO), the library appears to estimate available bandwidth from hop count rather than path diversity (number of links connecting two GCDs). This is helpful for collectives where multiple GPUs share links concurrently, but underutilizes the interconnect for sparse / point-to-point patterns.
Observation 3. On LUMI, RCCL point-to-point primitives do not correctly determine the bandwidth available between GPUs on the same node, thus underutilizing the available bandwidth.
IV. Intra-Node Collectives Performance
- Collective goodput is defined as the buffer size divided by runtime; the section covers expected and measured performance for alltoall (IV-A, IV-B) and allreduce (IV-C, IV-D).
IV-A. Alltoall — Expected Goodput
- Expected goodput is computed from the edge forwarding index (max number of paths crossing any edge): for an alltoall under shortest-path routing, this is an estimate of worst-case peak bandwidth on the heaviest link.
- On Alps and Leonardo every GPU pair is connected directly (max edge forwarding index = 1); thus expected peak alltoall goodput equals GPU injection bandwidth.
- On LUMI the most loaded link is GCD 1 ↔︎ GCD 5 (and GCD 7 ↔︎ GCD 3), used by four shortest paths. Each IF link is 400 Gb/s, so the alltoall peak between any GCD pair is 100 Gb/s. Because each GCD can send on six IF links simultaneously, the expected per-GPU alltoall goodput is 600 Gb/s. Per-GPU injection bandwidth on LUMI's MI250X equals that of an A100, but the partial intra-node connectivity raises the edge forwarding index and lowers the effective alltoall ceiling.
IV-B. Alltoall — Measured Goodput
- Fig. 5 gives the measured intra-node alltoall goodput; dashed lines mark the expected goodput.
- NCCL does not natively provide an alltoall, so a trivial multi-send algorithm (each GPU sends to all others simultaneously, as suggested in the documentation) is used; the same trivial algorithm is used with D2D copies; no measurable difference was found vs. RCCL's native alltoall.
- On Alps and LUMI,
*CCLprovides the best large-transfer performance because*CCLcollectives are tuned for the target — communications are batched per-topology and the number of in-flight chunks during pipelined ops is matched to per-pair bandwidth. MPI does not apply such fine-grained tuning, so it does not exploit the full intra-node bandwidth. On Leonardo,*CCLis slightly worse than MPI. For small transfers on Alps and Leonardo,*CCLis comparable to MPI; on LUMI, GPU-Aware MPI is up to 3× faster than*CCLfor small transfers, consistent with the P2P observation.
IV-C. Allreduce — Expected Goodput
- On Alps and Leonardo (full mesh) the optimal large-message allreduce algorithm is a pipelined ternary tree: one GPU acts as root, the other three as leaves, reduce then broadcast; expected peak is the sum of bandwidth of all outgoing GPU links.
- On LUMI the optimal large-message algorithm is Rabenseifner (ring reduce-scatter then ring allgather) with four edge-disjoint bidirectional rings, each on 400 Gb/s IF links. Because Rabenseifner sends twice the buffer bytes, the peak is 800 Gb/s.
IV-D. Allreduce — Measured Goodput
- Fig. 6 shows allreduce goodput vs. message size. On Alps and
Leonardo,
*CCLoutperforms MPI at every size. On LUMI, GPU-Aware MPI has the lowest small-transfer runtime, while*CCLleads on large transfers but is far from the 800 Gb/s expected peak. - GPU-Aware MPI exhibits low performance on all systems and the gap to
*CCLis larger for allreduce than alltoall, because allreduce involves aggregation:*CCLperforms aggregation on the GPUs, while MPI does it less efficiently. On Leonardo the gap is even larger because Open MPI runs the allreduce on the host (similar to the trivial staging baseline). Open MPI on Leonardo does not support UCC. The host-side reduction on GPU 0 is unpipelined; the authors keep it as a reference data point for completeness. - The measured-vs-expected gap is larger for collectives than for P2P, indicating room for collective-algorithm optimization. LUMI is closer to its expected peak partly because LUMI's expected peak is itself lower (the ceiling is easier to saturate).
Observation 4. For single-node collectives,
*CCLoutperforms GPU-Aware MPI in most cases, except for small collectives on LUMI. Indeed,*CCLcollectives are optimized for the specific GPU models. Nevertheless, there is still room for collective algorithms optimization.
V. Inter-Node Performance
- The section analyzes inter-node P2P (V-A), the impact of network
distance (V-B), alltoall (V-C), allreduce (V-D), and a global MPI vs.
*CCLcomparison (V-E), scaling up to 4,096 GPUs.
V-A. Unidirectional Latency and Goodput
- Methodology: ping-pong between two nodes; one MPI process per available GPU; affinity is fixed so each rank uses its closest GPU + closest NIC. To isolate GPU-management overhead, the analysis is also rerun with host-memory buffers (one MPI process per NIC). Results: Fig. 7 (per-node total goodput summed across NICs; latency in inset).
- MPI gives the highest goodput and lowest latency on every system,
regardless of buffer location, because of
*CCL's extra GPU kernel-launch and management overhead.
Observation 5. On inter-node point-to-point communications, MPI outperforms
*CCLby up to one order of magnitude on small transfers, and by up to 3× on larger transfers.
V-B. Impact of Network Distance on Performance
- Three configurations: same switch, same group / different switch, and different group. Results in Fig. 8 (a: GPU buffers; b: host buffers).
- Each box's whiskers show 5th/95th percentile, the cross marks the mean, the median is the middle line, the box edges are quartiles, and the notch shows the 95% CI of the median; outliers are listed numerically as min/max.
- GPU-buffer subsection (Fig. 8a): Same-switch latency: 3.7us–5.7us across all systems; same-group different-switch on Alps and LUMI raises latency by 28% (e.g. 4.33us → 5.56us on Alps); different-group on Leonardo raises latency 2× (2.03us → 4.23us). Goodput drops are 1% on Alps and LUMI, 17% on Leonardo (395 → 328 Gb/s). All three systems reach 95% of theoretical peak bandwidth at same-switch distance.
- The Leonardo variability is attributed to network noise — interference from other concurrent jobs sharing the inter-node fabric (analyzed in Sec. VI). Leonardo's 95th-percentile latency exceeds 8us when GPUs are in different groups, with a maximum of 132us. Minimum measured goodput drops as low as 216 Gb/s.
- Host-buffer subsection (Fig. 8b): Latency on Leonardo is more than 3× lower than on Alps and LUMI at same-switch distance (1.02us vs. 3.66us). The gap is attributed to Slingshot's Ethernet-based protocol, which carries higher overhead than InfiniBand (e.g. larger headers). Alps's host-memory latency is slightly higher than LUMI's because Alps is not yet fully optimized.
Observation 6. On Alps and LUMI, GPU's network location has a marginal impact on average performance (below 30% for latency and 1% for goodput). On the other hand, on Leonardo, the average latency increases by up to 2× when the GPUs are in different groups rather than under the same switch. Similarly, the average goodput decreases by 17%. This is mainly due to network performance variability caused by network noise.
V-C. Alltoall
- Fig. 9 plots a 2 MiB alltoall vs. GPU count; "asymptotic" expected goodput is the per-GPU inter-node bandwidth (200 Gb/s on Alps, 100 Gb/s on Leonardo and LUMI).
- Scale limits: Leonardo measurements stop at 1,024 GPUs (256-node
user limit); Alps GPU-Aware MPI stops at 2,048 GPUs (512-node
early-access limit); NCCL alltoall stops at 256 GPUs because at 512+
GPUs the official
nccl-testsand the paper benchmark both stall (also seen byrccl-tests); the allreduce collective is not affected. LUMI MPI stops at 4,096; RCCL stalls at 1,024+ GPUs (alltoall only). *CCLoutperforms GPU-Aware MPI on all systems because it exploits the intra-node interconnect more effectively; the gap shrinks as GPU count grows (intra-node mass becomes smaller relative to inter-node). On Alps and Leonardo,*CCLreaches ~75% efficiency at 1,024 GPUs; on LUMI efficiency is slightly lower.
V-D. Allreduce
- Fig. 10 plots a 1 GiB allreduce vs. GPU count.
*CCLoutperforms GPU-Aware MPI on all systems (same architectural reasons as alltoall). On Leonardo, GPU-Aware MPI shows extremely low goodput because Open MPI copies the buffer to host memory and runs the allreduce on the host. - A sharp drop in
*CCLperformance is observed on Alps and LUMI between 256 and 512 GPUs; this is not an algorithm switch (the same drop occurs when the algorithm is held fixed) — goodput steadily decreases between 256 and 512 GPUs.
V-E. Comparison between MPI
and *CCL
- Fig. 11 reports the RCCL/MPI ratio across alltoall and allreduce vector sizes and node counts on LUMI.
- RCCL beats MPI by up to 4× on large vectors; for small collectives, GPU-Aware MPI is up to 10× faster. The crossover for RCCL vs. MPI on LUMI sits around 32 KiB. On Alps and Leonardo, NCCL outperforms GPU-Aware MPI regardless of message size and node count.
Observation 7.
*CCLexploits the intra-node GPU-GPU interconnect more effectively than MPI, being specifically optimized for the target devices. Those advantages are more evident at smaller node counts and for larger transfers, for which the performance of intra-node communications has a higher weight on the overall performance. However, we experienced instability at large node counts for the alltoall on both NCCL and RCCL.
VI. Network Congestion and Noise
- Sec. V-B showed that Leonardo is hurt by network noise; this section quantifies the impact on collectives. No similar analysis is performed on Alps/LUMI because previous work (and Sec. V-B) shows Slingshot is largely unaffected by such noise.
VI-A. Performance Isolation through Service Level Selection
- The variability observed in V-B is queuing delay caused by other jobs' packets. InfiniBand service levels are used to mark traffic class; switches map them to virtual lanes, each with separate buffering and flow control. Round-robin arbitration between virtual lanes ensures that traffic on a lightly used SL sees lower queueing delay.
- Selecting a low-utilized SL reduces noise impact. Default Leonardo
traffic is mapped to SL 0; setting
NCCL_IB_SLfor NCCL orUCX_IB_SLfor MPI to a non-default SL reduced performance variability to <1% in the same-experiment rerun (Fig. 8 reproduction). Adaptive routing is enabled on every SL on Leonardo, so the noise reduction is not attributable to enabling/disabling adaptive routing. - The mitigation is effective only because Leonardo's default has all traffic on the same SL. If other applications were also using the non-default SL, the variability would return. To demonstrate, an allreduce on 128 GPUs is run alongside an alltoall or incast on another 128 GPUs that share the same SL; on the default SL the allreduce goodput drops; on the non-default SL the same drop occurs (Fig. 12). When the two applications are placed so their network switches do not overlap, the incast no longer affects the allreduce — but on Leonardo's at-scale Dragonfly+, switch sharing is unavoidable.
- Long-term solution must be improvements in the adaptive routing algorithm. Slurm on Leonardo already knows the switch and Dragonfly+ group of each node and could optimize placement.
VI-B. Noise Impact at Scale
- Prior work used simulation or synthetic noise; this paper uses real production noise by comparing the default and non-default service levels (the non-default SL approximates an empty-network scenario).
- Fig. 13 shows 2 MiB alltoall and 1 GiB allreduce on default vs. non-default SL. At small GPU counts the gap is small (few inter-switch flows). At 1,024 GPUs, network noise causes an additional 20% drop on alltoall and 50% drop on allreduce. Running on a non-default SL is only a temporary fix because all production traffic shares the default SL today.
- A long-term solution requires improving adaptive routing and Slurm-aware job placement.
Observation 8. Network noise decreases the goodput of allreduce and alltoall up to 50%.
VII. State of the Art
VII-A. Intra-Node Interconnect
- Pearson [41] characterizes interconnect-bandwidth heterogeneity
within multi-GPU MI250X nodes; Siefert et al. [42] analyze intra-node
GPU-GPU performance on several systems but neither study compares
*CCLto MPI nor inter-node performance. - Atchley et al. [5] characterize the Frontier supercomputer (network, storage, intra-node), but using GPCNet [10] over MPI for collectives. The buffers tested were on host memory, so the test does not reflect GPU-Aware MPI.
VII-B. Inter-Node Interconnect
- Li et al. [44] evaluate modern NVIDIA GPU interconnect technologies on different systems including Summit; Khorassani et al. [17] compare MPI implementations to RCCL on the Spock cluster (Slingshot + AMD MI100). Both are limited in scale (8 and 16 nodes); the present work is at-scale.
- Several works analyze network noise and inter-job interference [5], [37], [12], [38], [40], [36], [11], [46], often providing mitigation. Most simulate or synthesize noise; the present work uses real production-noise impact.
VII-C. Other
- OSU [27], NCCL/RCCL tests [28], [29], Tartan [47], and others [48] are general benchmarks but the present paper is a system-characterization study, not a benchmarking study.
- ML, linear-algebra, computational-physics, biology, and data-management workload studies analyze multi-GPU performance but do not isolate interconnect bottlenecks.
VIII. Discussion
- The benchmarks are general; conclusions about tuning and stack
choice are broadly applicable to any multi-GPU system. Tuning impact and
the MPI-vs-
*CCLcrossover should appear elsewhere. - The authors explicitly call out GPU-Aware MPI allreduce as having room for improvement due to suboptimal host-GPU interaction during data aggregation.
- None of the studied systems uses a fat-tree network. Most conclusions hold regardless of topology, with two exceptions: (1) very large fat-trees may have slightly higher latency due to greater diameter than Dragonfly/Dragonfly+; (2) the routing algorithm and the use of network service levels may differ on other technologies.
IX. Conclusions
- Default software configuration on all three systems failed to fully exploit hardware potential; tuning effort is required at every scale.
- Each communication library has its own optimization set:
*CCLwins on collectives, GPU-Aware MPI wins on point-to-point. Exceptions exist (e.g. LUMI MPI beats RCCL on small collectives). - Some HPC networks remain susceptible to network noise — up to 50% degradation at scale on Leonardo.
- The work helps users of these supercomputers run efficiently and provides actionable insights for system + software designers.
Appendix — Artifact Description / Artifact Evaluation
- Artifact: Benchmark code, Slurm scripts, post-processing tools on Zenodo (10.5281/zenodo.13312325).
- Reproduction: Up to 4,096-GPU runs require ~10
hours wall-clock; need Slurm + MPI +
*CCL. - Coordinates: LUMI/Alps coordinates from
/etc/cray/xname(x####c#s#b#n#); Leonardo switch mapping fromtopology.conf.
Cross-Cutting Quantitative Take-Aways
| Take-away | Source |
|---|---|
| Manual tuning improves performance up to 1 order of magnitude | Obs. 1 |
| GPU-Aware MPI on Leonardo: medium-message goodput up to 2× higher than NCCL | III-C |
| RCCL underutilizes IF on LUMI for sparse pairs (<50% of MPI/D2D) | III-D, Obs. 3 |
*CCL wins single-node
collectives in most cases (except small on LUMI) |
Obs. 4 |
| MPI wins inter-node P2P: up to 10× on small, 3× on large | Obs. 5 |
| Same-switch reaches 95% of theoretical bandwidth on every system | V-B |
| Leonardo: distance → 2× latency, 17% goodput drop, 132us max | V-B |
| 32 KiB is the LUMI MPI ↔︎ RCCL crossover | V-E |
*CCL reaches ~75% efficiency
at 1,024 GPUs on Alps & Leonardo (alltoall) |
V-C |
*CCL alltoall stalls at 512+
GPUs on NCCL & RCCL |
V-C |
Sharp *CCL allreduce drop 256
→ 512 GPUs on Alps/LUMI (not algorithm switch) |
V-D |
| Network noise on Leonardo: 20% on alltoall, 50% on allreduce at 1,024 GPUs | VI-B, Obs. 8 |
Service-level switch
(NCCL_IB_SL, UCX_IB_SL) reduces variability to
<1% |
VI-A |
Note on NCCL Tuning
The paper provides directly actionable NCCL configuration evidence
for HPC GPU clusters. Setting NCCL_IGNORE_CPU_AFFINITY=1
improved allreduce by up to 6× and alltoall by up to 1.6× starting at
two nodes; NCCL_NET_GDR_LEVEL=3 gave 2× alltoall / 3×
allreduce on Alps; NCCL_NCHANNELS_PER_PEER=32 gave 3.5× on
LUMI P2P; and NCCL_IB_SL (mapping NCCL traffic to a
non-default InfiniBand service level) reduced variability from up to 50%
production-noise degradation to <1% on Leonardo (Obs. 1, Obs. 8).
These are real env-vars whose impact is large, system-specific, and not
knowable from defaults — strong empirical support that NCCL tuning is
per-system, per-message-size, and per-scale, not a one-shot
decision.