Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects — Detailed Summary

Per-section, paragraph-level breakdown. Every paragraph in the paper produces at least one bullet. All quantitative numbers are preserved verbatim from the PDF.


Abstract


I. Introduction


II. Systems Description

Table I — Main characteristics of the analyzed systems

Feature Alps (#6) Leonardo (#7) LUMI (#5)
CPU 72-core NVIDIA Grace 32-core Intel Ice Lake Xeon 8358 64-core AMD Trento EPYC 7A53
GPU 4× NVIDIA Hopper H100 (GH200) 4× NVIDIA Ampere A100 (special SKU) 4× AMD MI250X (8 GCDs)
NICs 4× HPE Cray Cassini-1 200 Gb/s 2× dual-port NVIDIA ConnectX-6 (4×100 Gb/s) 4× HPE Cray Cassini-1 200 Gb/s
Intra-node NVLink 4.0, 6×200 Gb/s per pair (1.2 Tb/s) NVLink 3.0, 4×200 Gb/s per pair (800 Gb/s) Infinity Fabric, 1–4 links of 400 Gb/s
Inter-node Slingshot-11 (Dragonfly) InfiniBand HDR (Dragonfly+) Slingshot-11 (Dragonfly)
Software Cray MPICH 8.1.28, libfabric 1.15.2, CUDA 12.3, aws-ofi-nccl plugin Open MPI 4.1.4 (over UCX 1.13.0), CUDA 12.1 Cray clang 16.0.1, Cray MPICH 8.1.27, libfabric 1.15.2, ROCm 5.7.1, aws-ofi-rccl plugin v1.4

II-A. Alps

II-B. Leonardo

II-C. LUMI


III. Intra-Node Point-to-Point Performance

III-A. Benchmarking Methodology

III-B. Performance Tuning

Observation 1. Achieving good performance on multi-GPU systems requires non-trivial tuning, which depends on the system, message size, communication library, and number of nodes. The default choices made by *CCL and GPU-Aware MPI are not always optimal, and manual tuning can improve performance up to an order of magnitude.

III-C. Point-to-Point Latency and Goodput

Observation 2. GPU-Aware MPI provides the highest goodput for intra-node point-to-point transfers on all the analyzed systems. For small transfers, the optimal solution changes across the systems, depending on architectural features and specific optimization implemented by MPI.

III-D. Impact of GPU Location on LUMI

Observation 3. On LUMI, RCCL point-to-point primitives do not correctly determine the bandwidth available between GPUs on the same node, thus underutilizing the available bandwidth.


IV. Intra-Node Collectives Performance

IV-A. Alltoall — Expected Goodput

IV-B. Alltoall — Measured Goodput

IV-C. Allreduce — Expected Goodput

IV-D. Allreduce — Measured Goodput

Observation 4. For single-node collectives, *CCL outperforms GPU-Aware MPI in most cases, except for small collectives on LUMI. Indeed, *CCL collectives are optimized for the specific GPU models. Nevertheless, there is still room for collective algorithms optimization.


V. Inter-Node Performance

V-A. Unidirectional Latency and Goodput

Observation 5. On inter-node point-to-point communications, MPI outperforms *CCL by up to one order of magnitude on small transfers, and by up to 3× on larger transfers.

V-B. Impact of Network Distance on Performance

Observation 6. On Alps and LUMI, GPU's network location has a marginal impact on average performance (below 30% for latency and 1% for goodput). On the other hand, on Leonardo, the average latency increases by up to 2× when the GPUs are in different groups rather than under the same switch. Similarly, the average goodput decreases by 17%. This is mainly due to network performance variability caused by network noise.

V-C. Alltoall

V-D. Allreduce

V-E. Comparison between MPI and *CCL

Observation 7. *CCL exploits the intra-node GPU-GPU interconnect more effectively than MPI, being specifically optimized for the target devices. Those advantages are more evident at smaller node counts and for larger transfers, for which the performance of intra-node communications has a higher weight on the overall performance. However, we experienced instability at large node counts for the alltoall on both NCCL and RCCL.


VI. Network Congestion and Noise

VI-A. Performance Isolation through Service Level Selection

VI-B. Noise Impact at Scale

Observation 8. Network noise decreases the goodput of allreduce and alltoall up to 50%.


VII. State of the Art

VII-A. Intra-Node Interconnect

VII-B. Inter-Node Interconnect

VII-C. Other


VIII. Discussion


IX. Conclusions


Appendix — Artifact Description / Artifact Evaluation


Cross-Cutting Quantitative Take-Aways

Take-away Source
Manual tuning improves performance up to 1 order of magnitude Obs. 1
GPU-Aware MPI on Leonardo: medium-message goodput up to 2× higher than NCCL III-C
RCCL underutilizes IF on LUMI for sparse pairs (<50% of MPI/D2D) III-D, Obs. 3
*CCL wins single-node collectives in most cases (except small on LUMI) Obs. 4
MPI wins inter-node P2P: up to 10× on small, 3× on large Obs. 5
Same-switch reaches 95% of theoretical bandwidth on every system V-B
Leonardo: distance → 2× latency, 17% goodput drop, 132us max V-B
32 KiB is the LUMI MPI ↔︎ RCCL crossover V-E
*CCL reaches ~75% efficiency at 1,024 GPUs on Alps & Leonardo (alltoall) V-C
*CCL alltoall stalls at 512+ GPUs on NCCL & RCCL V-C
Sharp *CCL allreduce drop 256 → 512 GPUs on Alps/LUMI (not algorithm switch) V-D
Network noise on Leonardo: 20% on alltoall, 50% on allreduce at 1,024 GPUs VI-B, Obs. 8
Service-level switch (NCCL_IB_SL, UCX_IB_SL) reduces variability to <1% VI-A

Note on NCCL Tuning

The paper provides directly actionable NCCL configuration evidence for HPC GPU clusters. Setting NCCL_IGNORE_CPU_AFFINITY=1 improved allreduce by up to 6× and alltoall by up to 1.6× starting at two nodes; NCCL_NET_GDR_LEVEL=3 gave 2× alltoall / 3× allreduce on Alps; NCCL_NCHANNELS_PER_PEER=32 gave 3.5× on LUMI P2P; and NCCL_IB_SL (mapping NCCL traffic to a non-default InfiniBand service level) reduced variability from up to 50% production-noise degradation to <1% on Leonardo (Obs. 1, Obs. 8). These are real env-vars whose impact is large, system-specific, and not knowable from defaults — strong empirical support that NCCL tuning is per-system, per-message-size, and per-scale, not a one-shot decision.