Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

De Sensi, Pichetti, Vella, De Matteis, Ren, Fusco, Turisini, Cesarini, Lust, Trivedi, Roweth, Spiga, Di Girolamo, Hoefler | Sapienza/ETH/Trento/VU/CINECA/HPE/NVIDIA | SC24, Atlanta GA, Nov 17–22 2024 | arXiv:2408.14090v2

Problem

Modern exascale and pre-exascale supercomputers pack up to 8 GPUs per node, connected by dedicated intra-node networks reaching a few terabits per second per direction (up to 3.6 Tb/s) and inter-node networks scaling to tens of thousands of GPUs (up to 75,000). Realising this hardware in production is non-trivial because (1) intra-node interconnects span heterogeneous technologies (NVIDIA NVLink, AMD Infinity Fabric); (2) inter-node fabrics span HPE Slingshot-11 and NVIDIA InfiniBand HDR with different topologies (Dragonfly vs. Dragonfly+); (3) the user-facing software is split between GPU-Aware MPI and *CCL (NCCL on NVIDIA, RCCL on AMD), each with different defaults and maturity levels; and (4) at scale, network noise from competing jobs can severely degrade collective performance. Prior characterisations focus on a single technology, single library, or fewer than 16 nodes — none compares the full multi-stack landscape at production scale.

Core Insight

Default software configurations leave large amounts of bandwidth on the floor on every system tested. The right answer depends on transfer size, communication pattern, library, and node count: *CCL is the better choice for collectives on most systems and sizes, while GPU-Aware MPI dominates inter-node point-to-point and small intra-node transfers. Topology matters too: Slingshot-based Dragonfly is largely immune to network noise, while Leonardo's InfiniBand HDR Dragonfly+ loses up to 50% of allreduce goodput at 1,024 GPUs to production noise — partially recoverable by routing traffic to a non-default service level. Achieving good performance is a per-system, per-message-size, per-scale tuning problem, not a one-shot choice. The paper distils this empirical landscape into eight numbered observations spanning tuning, P2P, collectives, distance effects, and noise.

Method

The authors built a custom point-to-point and collective benchmark from scratch (because OSU lacks D2D copies and nccl-tests/rccl-tests lack per-iteration timing needed for noise studies) and ran it on three SC24-era flagship systems with four communication mechanisms:

Trivial Staging — host-pinned bounce buffer baseline (no pipelining).
Device-to-Device (D2D) Copy — IPC handles for direct async transfers.
*CCL — NCCL on Alps/Leonardo, RCCL on LUMI.
GPU-Aware MPI — Cray MPICH on Alps/LUMI, Open MPI (UCX) on Leonardo.

Three systems were profiled, scaling up to 4,096 GPUs:

Alps — NVIDIA H100 (GH200) + Slingshot-11 Dragonfly, CSCS, early-access Santis partition.
Leonardo — NVIDIA A100 + InfiniBand HDR Dragonfly+, EuroHPC/CINECA, Booster partition.
LUMI — AMD MI250X (8 GCDs/node) + Slingshot-11 Dragonfly, EuroHPC/CSC, LUMI-G partition.

Performance tuning involved sweeping environment variables (NCCL_IGNORE_CPU_AFFINITY, NCCL_NET_GDR_LEVEL, NCCL_NCHANNELS_PER_PEER, MPICH_GPU_IPC_THRESHOLD, MPICH_GPU_ALLREDUCE_BLK_SIZE, HSA_ENABLE_SDMA, LD_LIBRARY_PATH for GDRCopy/UCX, NCCL_IB_SL and UCX_IB_SL for InfiniBand service-level selection). Tuning required cooperation with HPC site teams and Cray/HPE, NVIDIA, and AMD engineers and took several days of investigation. Eight numbered observations are distilled.

Experimental Setup

Parameter	Value
Systems	Alps (270 PFlop/s, #6), Leonardo (240 PFlop/s, #7), LUMI (380 PFlop/s, #5) — all June-2024 Top500
GPUs/node	4 H100 (Alps), 4 A100 (Leonardo), 4 MI250X = 8 GCDs (LUMI)
NICs/node	4× Cassini-1 200 Gb/s (Alps, LUMI), 4× ConnectX-6 100 Gb/s (Leonardo)
Intra-node fabric	NVLink 4.0 6×200 Gb/s = 1.2 Tb/s (Alps); NVLink 3.0 4×200 Gb/s = 800 Gb/s (Leonardo); IF 1-4×400 Gb/s (LUMI)
Inter-node fabric	Slingshot-11 Dragonfly (Alps, LUMI), InfiniBand HDR Dragonfly+ 23 groups × 180 nodes (Leonardo)
Mechanisms	Trivial Staging, D2D Copy, `*CCL`, GPU-Aware MPI
Software	Cray MPICH 8.1.28 + CUDA 12.3 (Alps); Open MPI 4.1.4 + UCX 1.13.0 + CUDA 12.1 (Leonardo); Cray MPICH 8.1.27 + ROCm 5.7.1 (LUMI)
Scale	Up to 512 nodes / 2,048 GPUs on Alps; up to 256 nodes / 1,024 GPUs on Leonardo; up to 4,096 GPUs on LUMI
Repetitions	100–1,000 per transfer size; max time / min goodput across ranks
Timer	`MPI_Wtime`, 25 ns resolution on LUMI/Leonardo, 30 ns on Alps
Metric	Unidirectional goodput (Gb/s) and latency (us)

Headline Quantitative Results

Tuning gains (Obs. 1)

Knob	System	Effect
`NCCL_IGNORE_CPU_AFFINITY=1`	Alps, LUMI	up to 1.6× alltoall, 6× allreduce (≥2 nodes)
`NCCL_NET_GDR_LEVEL=3`	Alps	2× alltoall, 3× allreduce
`NCCL_NCHANNELS_PER_PEER=32`	LUMI	3.5× P2P
`MPICH_GPU_IPC_THRESHOLD=1`	Alps	2× for transfers < 4 KiB
`MPICH_GPU_ALLREDUCE_BLK_SIZE=128 MiB`	Alps	+50% on single-node allreduce
`HSA_ENABLE_SDMA=0`	LUMI	up to 3×
Fix `LD_LIBRARY_PATH` for GDRCopy/UCX	Leonardo	up to 6× small msg

Intra-node P2P (Obs. 2, 3)

GPU-Aware MPI achieves the highest goodput on every system. On Leonardo, GPU-Aware MPI medium-message goodput is up to 2× higher than NCCL. Trivial staging is up to one order of magnitude lower than every other mechanism. On LUMI, RCCL achieves less than half the goodput of MPI/D2D in some pairs (e.g. GPU 0 ↔︎ 5) because RCCL estimates available bandwidth from hop count rather than path diversity — fine for collectives but underutilising for sparse P2P. Small-message latency is similar on Alps but differs on Leonardo and LUMI: Leonardo benefits from GDRCopy, while LUMI uses an optimized CPU-to-HBM memcpy for small same-node buffers (NVIDIA does not allow CPU load/store to GPU memory, so Alps does not benefit).

Intra-node collectives (Obs. 4)

*CCL beats GPU-Aware MPI in most cases, except small collectives on LUMI where GPU-Aware MPI is up to 3× faster. On Alps/Leonardo (full mesh) the optimal allreduce algorithm is a pipelined ternary tree with the expected peak equal to the sum of bandwidth of all outgoing GPU links. On LUMI (partial mesh) Rabenseifner with four edge-disjoint bidirectional rings on 400 Gb/s IF gives an 800 Gb/s expected peak. Alltoall expected peaks: equal to GPU injection bandwidth on Alps/Leonardo; on LUMI the most-loaded link is GCD 1 ↔︎ GCD 5 (used by 4 paths), so per-pair alltoall peak is 100 Gb/s and per-GPU peak is 600 Gb/s.

Inter-node P2P (Obs. 5, 6)

MPI outperforms *CCL by up to 10× on small transfers and 3× on large transfers, regardless of buffer location, due to *CCL's GPU kernel-launch overhead. All systems reach 95% of theoretical peak at same-switch distance. Network-distance impact: 28% latency increase on Alps/LUMI different-switch with only 1% goodput drop; 2× latency on Leonardo different-group (2.03us → 4.23us), 17% goodput drop (395 → 328 Gb/s), 132us maximum latency, 216 Gb/s minimum goodput. Slingshot's Ethernet-based protocol gives higher base latency than InfiniBand (3.66us vs. 1.02us same-switch host-buffer).

Inter-node collectives (Obs. 7)

*CCL outperforms GPU-Aware MPI on every system; the gap shrinks as GPU count grows because intra-node mass becomes a smaller fraction of total work. *CCL reaches ~75% efficiency at 1,024 GPUs on Alps and Leonardo (2 MiB alltoall). Sharp *CCL allreduce drop between 256 and 512 GPUs on Alps and LUMI (not an algorithm switch — same drop with the same algorithm fixed). NCCL/RCCL alltoall stalls at 512+ GPUs (also seen by official nccl-tests/rccl-tests); allreduce is unaffected. On LUMI, RCCL beats MPI up to 4× on large vectors but MPI is up to 10× faster on small collectives; the crossover is ~32 KiB. On Alps and Leonardo, NCCL outperforms GPU-Aware MPI regardless of message size and node count.

Network noise (Obs. 8)

On Leonardo at 1,024 GPUs, real production noise causes a 20% drop on alltoall and a 50% drop on allreduce. Routing traffic to a non-default IB service level (NCCL_IB_SL, UCX_IB_SL) reduces variability to <1%. The fix is fragile: it works only because production traffic is concentrated on the default SL today. Adaptive routing is enabled on every SL on Leonardo, so the gain is not attributable to enabling/disabling adaptive routing. A long-term solution requires improvements in adaptive-routing algorithms and Slurm-aware placement (Slurm already knows each node's switch and Dragonfly+ group).

Limitations

Three architectures only — no fat-tree characterisation; the discussion notes very large fat-trees may have higher latency than Dragonfly/Dragonfly+ due to greater diameter.
Alps was profiled on early-access (Santis partition, 512 nodes); some optimizations are pending and runtime was non-monotonic in message size.
Scale ceilings: Leonardo capped at 1,024 GPUs (256-node user limit); Alps GPU-Aware MPI capped at 2,048 GPUs (512-node early-access); only LUMI MPI ran to 4,096 GPUs.
*CCL alltoall stalls at 512+ GPUs on both NCCL and RCCL — full inter-node alltoall scalability could not be measured.
Production noise was studied only on Leonardo; Alps/LUMI Slingshot were assumed largely noise-immune based on prior work and the data in V-B.
The non-default-SL mitigation works only because Leonardo currently maps almost all production jobs to SL 0; if other workloads migrated to the same alternate SL, the gain would shrink.

Open Problems

Adaptive routing improvements for InfiniBand Dragonfly+ that mitigate noise without manual SL selection or Slurm-aware placement.
GPU-Aware MPI allreduce has room for improvement, especially the host-GPU interaction during data aggregation (Open MPI on Leonardo runs aggregation on the host, similar to the trivial staging baseline).
*CCL connection management at large scale: the alltoall stall at 512+ GPUs on both NCCL and RCCL likely stems from the number of active connections required for native alltoall.
RCCL bandwidth estimation on partial-mesh intra-node fabrics — currently uses hop count rather than path diversity, underutilising sparse-pattern P2P (e.g. GPU 0 ↔︎ 5 on LUMI).
Sharp *CCL allreduce performance drop between 256 and 512 GPUs on Alps and LUMI is unexplained (not an algorithm switch).
Early-access tuning of Alps where runtime is non-monotonic in message size, suggesting the IPC threshold and GPU-Aware MPI kernel policies need further investigation.
Collective-algorithm optimisation gap — measured collective performance is consistently further from the expected peak than measured P2P performance, indicating headroom in the collective implementations themselves (Sec. IV-D).

Note on NCCL Tuning

The paper provides directly actionable NCCL configuration evidence for HPC GPU clusters. Setting NCCL_IGNORE_CPU_AFFINITY=1 improved allreduce by up to 6× and alltoall by up to 1.6× starting at two nodes; NCCL_NET_GDR_LEVEL=3 gave 2× alltoall / 3× allreduce on Alps; NCCL_NCHANNELS_PER_PEER=32 gave 3.5× on LUMI P2P; and NCCL_IB_SL (mapping NCCL traffic to a non-default InfiniBand service level) reduced production-noise variability from up to 50% allreduce degradation to <1% on Leonardo (Obs. 1, Obs. 8). These knobs are large-effect, system-specific, and not knowable from defaults — strong empirical support that NCCL tuning is per-system, per-message-size, and per-scale rather than a one-shot decision.