Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

De Sensi, Pichetti, Vella, De Matteis, Ren, Fusco, Turisini, Cesarini, Lust, Trivedi, Roweth, Spiga, Di Girolamo, Hoefler | Sapienza/ETH/Trento/VU/CINECA/HPE/NVIDIA | SC24, Atlanta GA, Nov 17–22 2024 | arXiv:2408.14090v2


Problem

Modern exascale and pre-exascale supercomputers pack up to 8 GPUs per node, connected by dedicated intra-node networks reaching a few terabits per second per direction (up to 3.6 Tb/s) and inter-node networks scaling to tens of thousands of GPUs (up to 75,000). Realising this hardware in production is non-trivial because (1) intra-node interconnects span heterogeneous technologies (NVIDIA NVLink, AMD Infinity Fabric); (2) inter-node fabrics span HPE Slingshot-11 and NVIDIA InfiniBand HDR with different topologies (Dragonfly vs. Dragonfly+); (3) the user-facing software is split between GPU-Aware MPI and *CCL (NCCL on NVIDIA, RCCL on AMD), each with different defaults and maturity levels; and (4) at scale, network noise from competing jobs can severely degrade collective performance. Prior characterisations focus on a single technology, single library, or fewer than 16 nodes — none compares the full multi-stack landscape at production scale.


Core Insight

Default software configurations leave large amounts of bandwidth on the floor on every system tested. The right answer depends on transfer size, communication pattern, library, and node count: *CCL is the better choice for collectives on most systems and sizes, while GPU-Aware MPI dominates inter-node point-to-point and small intra-node transfers. Topology matters too: Slingshot-based Dragonfly is largely immune to network noise, while Leonardo's InfiniBand HDR Dragonfly+ loses up to 50% of allreduce goodput at 1,024 GPUs to production noise — partially recoverable by routing traffic to a non-default service level. Achieving good performance is a per-system, per-message-size, per-scale tuning problem, not a one-shot choice. The paper distils this empirical landscape into eight numbered observations spanning tuning, P2P, collectives, distance effects, and noise.


Method

The authors built a custom point-to-point and collective benchmark from scratch (because OSU lacks D2D copies and nccl-tests/rccl-tests lack per-iteration timing needed for noise studies) and ran it on three SC24-era flagship systems with four communication mechanisms:

Three systems were profiled, scaling up to 4,096 GPUs:

Performance tuning involved sweeping environment variables (NCCL_IGNORE_CPU_AFFINITY, NCCL_NET_GDR_LEVEL, NCCL_NCHANNELS_PER_PEER, MPICH_GPU_IPC_THRESHOLD, MPICH_GPU_ALLREDUCE_BLK_SIZE, HSA_ENABLE_SDMA, LD_LIBRARY_PATH for GDRCopy/UCX, NCCL_IB_SL and UCX_IB_SL for InfiniBand service-level selection). Tuning required cooperation with HPC site teams and Cray/HPE, NVIDIA, and AMD engineers and took several days of investigation. Eight numbered observations are distilled.


Experimental Setup

Parameter Value
Systems Alps (270 PFlop/s, #6), Leonardo (240 PFlop/s, #7), LUMI (380 PFlop/s, #5) — all June-2024 Top500
GPUs/node 4 H100 (Alps), 4 A100 (Leonardo), 4 MI250X = 8 GCDs (LUMI)
NICs/node 4× Cassini-1 200 Gb/s (Alps, LUMI), 4× ConnectX-6 100 Gb/s (Leonardo)
Intra-node fabric NVLink 4.0 6×200 Gb/s = 1.2 Tb/s (Alps); NVLink 3.0 4×200 Gb/s = 800 Gb/s (Leonardo); IF 1-4×400 Gb/s (LUMI)
Inter-node fabric Slingshot-11 Dragonfly (Alps, LUMI), InfiniBand HDR Dragonfly+ 23 groups × 180 nodes (Leonardo)
Mechanisms Trivial Staging, D2D Copy, *CCL, GPU-Aware MPI
Software Cray MPICH 8.1.28 + CUDA 12.3 (Alps); Open MPI 4.1.4 + UCX 1.13.0 + CUDA 12.1 (Leonardo); Cray MPICH 8.1.27 + ROCm 5.7.1 (LUMI)
Scale Up to 512 nodes / 2,048 GPUs on Alps; up to 256 nodes / 1,024 GPUs on Leonardo; up to 4,096 GPUs on LUMI
Repetitions 100–1,000 per transfer size; max time / min goodput across ranks
Timer MPI_Wtime, 25 ns resolution on LUMI/Leonardo, 30 ns on Alps
Metric Unidirectional goodput (Gb/s) and latency (us)

Headline Quantitative Results

Tuning gains (Obs. 1)

Knob System Effect
NCCL_IGNORE_CPU_AFFINITY=1 Alps, LUMI up to 1.6× alltoall, 6× allreduce (≥2 nodes)
NCCL_NET_GDR_LEVEL=3 Alps 2× alltoall, 3× allreduce
NCCL_NCHANNELS_PER_PEER=32 LUMI 3.5× P2P
MPICH_GPU_IPC_THRESHOLD=1 Alps 2× for transfers < 4 KiB
MPICH_GPU_ALLREDUCE_BLK_SIZE=128 MiB Alps +50% on single-node allreduce
HSA_ENABLE_SDMA=0 LUMI up to 3×
Fix LD_LIBRARY_PATH for GDRCopy/UCX Leonardo up to 6× small msg

Intra-node P2P (Obs. 2, 3)

GPU-Aware MPI achieves the highest goodput on every system. On Leonardo, GPU-Aware MPI medium-message goodput is up to 2× higher than NCCL. Trivial staging is up to one order of magnitude lower than every other mechanism. On LUMI, RCCL achieves less than half the goodput of MPI/D2D in some pairs (e.g. GPU 0 ↔︎ 5) because RCCL estimates available bandwidth from hop count rather than path diversity — fine for collectives but underutilising for sparse P2P. Small-message latency is similar on Alps but differs on Leonardo and LUMI: Leonardo benefits from GDRCopy, while LUMI uses an optimized CPU-to-HBM memcpy for small same-node buffers (NVIDIA does not allow CPU load/store to GPU memory, so Alps does not benefit).

Intra-node collectives (Obs. 4)

*CCL beats GPU-Aware MPI in most cases, except small collectives on LUMI where GPU-Aware MPI is up to 3× faster. On Alps/Leonardo (full mesh) the optimal allreduce algorithm is a pipelined ternary tree with the expected peak equal to the sum of bandwidth of all outgoing GPU links. On LUMI (partial mesh) Rabenseifner with four edge-disjoint bidirectional rings on 400 Gb/s IF gives an 800 Gb/s expected peak. Alltoall expected peaks: equal to GPU injection bandwidth on Alps/Leonardo; on LUMI the most-loaded link is GCD 1 ↔︎ GCD 5 (used by 4 paths), so per-pair alltoall peak is 100 Gb/s and per-GPU peak is 600 Gb/s.

Inter-node P2P (Obs. 5, 6)

MPI outperforms *CCL by up to 10× on small transfers and 3× on large transfers, regardless of buffer location, due to *CCL's GPU kernel-launch overhead. All systems reach 95% of theoretical peak at same-switch distance. Network-distance impact: 28% latency increase on Alps/LUMI different-switch with only 1% goodput drop; 2× latency on Leonardo different-group (2.03us → 4.23us), 17% goodput drop (395 → 328 Gb/s), 132us maximum latency, 216 Gb/s minimum goodput. Slingshot's Ethernet-based protocol gives higher base latency than InfiniBand (3.66us vs. 1.02us same-switch host-buffer).

Inter-node collectives (Obs. 7)

*CCL outperforms GPU-Aware MPI on every system; the gap shrinks as GPU count grows because intra-node mass becomes a smaller fraction of total work. *CCL reaches ~75% efficiency at 1,024 GPUs on Alps and Leonardo (2 MiB alltoall). Sharp *CCL allreduce drop between 256 and 512 GPUs on Alps and LUMI (not an algorithm switch — same drop with the same algorithm fixed). NCCL/RCCL alltoall stalls at 512+ GPUs (also seen by official nccl-tests/rccl-tests); allreduce is unaffected. On LUMI, RCCL beats MPI up to 4× on large vectors but MPI is up to 10× faster on small collectives; the crossover is ~32 KiB. On Alps and Leonardo, NCCL outperforms GPU-Aware MPI regardless of message size and node count.

Network noise (Obs. 8)

On Leonardo at 1,024 GPUs, real production noise causes a 20% drop on alltoall and a 50% drop on allreduce. Routing traffic to a non-default IB service level (NCCL_IB_SL, UCX_IB_SL) reduces variability to <1%. The fix is fragile: it works only because production traffic is concentrated on the default SL today. Adaptive routing is enabled on every SL on Leonardo, so the gain is not attributable to enabling/disabling adaptive routing. A long-term solution requires improvements in adaptive-routing algorithms and Slurm-aware placement (Slurm already knows each node's switch and Dragonfly+ group).


Limitations


Open Problems


Note on NCCL Tuning

The paper provides directly actionable NCCL configuration evidence for HPC GPU clusters. Setting NCCL_IGNORE_CPU_AFFINITY=1 improved allreduce by up to 6× and alltoall by up to 1.6× starting at two nodes; NCCL_NET_GDR_LEVEL=3 gave 2× alltoall / 3× allreduce on Alps; NCCL_NCHANNELS_PER_PEER=32 gave 3.5× on LUMI P2P; and NCCL_IB_SL (mapping NCCL traffic to a non-default InfiniBand service level) reduced production-noise variability from up to 50% allreduce degradation to <1% on Leonardo (Obs. 1, Obs. 8). These knobs are large-effect, system-specific, and not knowable from defaults — strong empirical support that NCCL tuning is per-system, per-message-size, and per-scale rather than a one-shot decision.