Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
De Sensi, Pichetti, Vella, De Matteis, Ren, Fusco, Turisini, Cesarini, Lust, Trivedi, Roweth, Spiga, Di Girolamo, Hoefler | Sapienza/ETH/Trento/VU/CINECA/HPE/NVIDIA | SC24, Atlanta GA, Nov 17–22 2024 | arXiv:2408.14090v2
Problem
Modern exascale and pre-exascale supercomputers pack up to 8 GPUs per
node, connected by dedicated intra-node networks reaching a few terabits
per second per direction (up to 3.6 Tb/s) and inter-node networks
scaling to tens of thousands of GPUs (up to 75,000). Realising this
hardware in production is non-trivial because (1) intra-node
interconnects span heterogeneous technologies (NVIDIA NVLink, AMD
Infinity Fabric); (2) inter-node fabrics span HPE Slingshot-11 and
NVIDIA InfiniBand HDR with different topologies (Dragonfly vs.
Dragonfly+); (3) the user-facing software is split between GPU-Aware MPI
and *CCL (NCCL on NVIDIA, RCCL on AMD), each with different
defaults and maturity levels; and (4) at scale, network noise from
competing jobs can severely degrade collective performance. Prior
characterisations focus on a single technology, single library, or fewer
than 16 nodes — none compares the full multi-stack landscape at
production scale.
Core Insight
Default software configurations leave large amounts of bandwidth on
the floor on every system tested. The right answer depends on transfer
size, communication pattern, library, and node count: *CCL
is the better choice for collectives on most systems and sizes, while
GPU-Aware MPI dominates inter-node point-to-point and small intra-node
transfers. Topology matters too: Slingshot-based Dragonfly is largely
immune to network noise, while Leonardo's InfiniBand HDR Dragonfly+
loses up to 50% of allreduce goodput at 1,024 GPUs to production noise —
partially recoverable by routing traffic to a non-default service level.
Achieving good performance is a per-system, per-message-size, per-scale
tuning problem, not a one-shot choice. The paper distils this empirical
landscape into eight numbered observations spanning tuning, P2P,
collectives, distance effects, and noise.
Method
The authors built a custom point-to-point and collective benchmark
from scratch (because OSU lacks D2D copies and
nccl-tests/rccl-tests lack per-iteration
timing needed for noise studies) and ran it on three SC24-era flagship
systems with four communication mechanisms:
- Trivial Staging — host-pinned bounce buffer baseline (no pipelining).
- Device-to-Device (D2D) Copy — IPC handles for direct async transfers.
*CCL— NCCL on Alps/Leonardo, RCCL on LUMI.- GPU-Aware MPI — Cray MPICH on Alps/LUMI, Open MPI (UCX) on Leonardo.
Three systems were profiled, scaling up to 4,096 GPUs:
- Alps — NVIDIA H100 (GH200) + Slingshot-11 Dragonfly, CSCS, early-access Santis partition.
- Leonardo — NVIDIA A100 + InfiniBand HDR Dragonfly+, EuroHPC/CINECA, Booster partition.
- LUMI — AMD MI250X (8 GCDs/node) + Slingshot-11 Dragonfly, EuroHPC/CSC, LUMI-G partition.
Performance tuning involved sweeping environment variables
(NCCL_IGNORE_CPU_AFFINITY, NCCL_NET_GDR_LEVEL,
NCCL_NCHANNELS_PER_PEER,
MPICH_GPU_IPC_THRESHOLD,
MPICH_GPU_ALLREDUCE_BLK_SIZE, HSA_ENABLE_SDMA,
LD_LIBRARY_PATH for GDRCopy/UCX, NCCL_IB_SL
and UCX_IB_SL for InfiniBand service-level selection).
Tuning required cooperation with HPC site teams and Cray/HPE, NVIDIA,
and AMD engineers and took several days of investigation. Eight numbered
observations are distilled.
Experimental Setup
| Parameter | Value |
|---|---|
| Systems | Alps (270 PFlop/s, #6), Leonardo (240 PFlop/s, #7), LUMI (380 PFlop/s, #5) — all June-2024 Top500 |
| GPUs/node | 4 H100 (Alps), 4 A100 (Leonardo), 4 MI250X = 8 GCDs (LUMI) |
| NICs/node | 4× Cassini-1 200 Gb/s (Alps, LUMI), 4× ConnectX-6 100 Gb/s (Leonardo) |
| Intra-node fabric | NVLink 4.0 6×200 Gb/s = 1.2 Tb/s (Alps); NVLink 3.0 4×200 Gb/s = 800 Gb/s (Leonardo); IF 1-4×400 Gb/s (LUMI) |
| Inter-node fabric | Slingshot-11 Dragonfly (Alps, LUMI), InfiniBand HDR Dragonfly+ 23 groups × 180 nodes (Leonardo) |
| Mechanisms | Trivial Staging, D2D Copy,
*CCL, GPU-Aware MPI |
| Software | Cray MPICH 8.1.28 + CUDA 12.3 (Alps); Open MPI 4.1.4 + UCX 1.13.0 + CUDA 12.1 (Leonardo); Cray MPICH 8.1.27 + ROCm 5.7.1 (LUMI) |
| Scale | Up to 512 nodes / 2,048 GPUs on Alps; up to 256 nodes / 1,024 GPUs on Leonardo; up to 4,096 GPUs on LUMI |
| Repetitions | 100–1,000 per transfer size; max time / min goodput across ranks |
| Timer | MPI_Wtime, 25 ns resolution
on LUMI/Leonardo, 30 ns on Alps |
| Metric | Unidirectional goodput (Gb/s) and latency (us) |
Headline Quantitative Results
Tuning gains (Obs. 1)
| Knob | System | Effect |
|---|---|---|
NCCL_IGNORE_CPU_AFFINITY=1 |
Alps, LUMI | up to 1.6× alltoall, 6× allreduce (≥2 nodes) |
NCCL_NET_GDR_LEVEL=3 |
Alps | 2× alltoall, 3× allreduce |
NCCL_NCHANNELS_PER_PEER=32 |
LUMI | 3.5× P2P |
MPICH_GPU_IPC_THRESHOLD=1 |
Alps | 2× for transfers < 4 KiB |
MPICH_GPU_ALLREDUCE_BLK_SIZE=128 MiB |
Alps | +50% on single-node allreduce |
HSA_ENABLE_SDMA=0 |
LUMI | up to 3× |
Fix LD_LIBRARY_PATH for
GDRCopy/UCX |
Leonardo | up to 6× small msg |
Intra-node P2P (Obs. 2, 3)
GPU-Aware MPI achieves the highest goodput on every system. On Leonardo, GPU-Aware MPI medium-message goodput is up to 2× higher than NCCL. Trivial staging is up to one order of magnitude lower than every other mechanism. On LUMI, RCCL achieves less than half the goodput of MPI/D2D in some pairs (e.g. GPU 0 ↔︎ 5) because RCCL estimates available bandwidth from hop count rather than path diversity — fine for collectives but underutilising for sparse P2P. Small-message latency is similar on Alps but differs on Leonardo and LUMI: Leonardo benefits from GDRCopy, while LUMI uses an optimized CPU-to-HBM memcpy for small same-node buffers (NVIDIA does not allow CPU load/store to GPU memory, so Alps does not benefit).
Intra-node collectives (Obs. 4)
*CCL beats GPU-Aware MPI in most cases, except small
collectives on LUMI where GPU-Aware MPI is up to 3× faster. On
Alps/Leonardo (full mesh) the optimal allreduce algorithm is a pipelined
ternary tree with the expected peak equal to the sum of bandwidth of all
outgoing GPU links. On LUMI (partial mesh) Rabenseifner with four
edge-disjoint bidirectional rings on 400 Gb/s IF gives an 800 Gb/s
expected peak. Alltoall expected peaks: equal to GPU injection bandwidth
on Alps/Leonardo; on LUMI the most-loaded link is GCD 1 ↔︎ GCD 5 (used by
4 paths), so per-pair alltoall peak is 100 Gb/s and per-GPU peak is 600
Gb/s.
Inter-node P2P (Obs. 5, 6)
MPI outperforms *CCL by up to 10× on small transfers and
3× on large transfers, regardless of buffer location, due to
*CCL's GPU kernel-launch overhead. All systems reach 95% of
theoretical peak at same-switch distance. Network-distance impact: 28%
latency increase on Alps/LUMI different-switch with only 1% goodput
drop; 2× latency on Leonardo different-group (2.03us → 4.23us), 17%
goodput drop (395 → 328 Gb/s), 132us maximum latency, 216 Gb/s minimum
goodput. Slingshot's Ethernet-based protocol gives higher base latency
than InfiniBand (3.66us vs. 1.02us same-switch host-buffer).
Inter-node collectives (Obs. 7)
*CCL outperforms GPU-Aware MPI on every system; the gap
shrinks as GPU count grows because intra-node mass becomes a smaller
fraction of total work. *CCL reaches ~75% efficiency at
1,024 GPUs on Alps and Leonardo (2 MiB alltoall). Sharp
*CCL allreduce drop between 256 and 512 GPUs on Alps and
LUMI (not an algorithm switch — same drop with the same algorithm
fixed). NCCL/RCCL alltoall stalls at 512+ GPUs (also seen by official
nccl-tests/rccl-tests); allreduce is
unaffected. On LUMI, RCCL beats MPI up to 4× on large vectors but MPI is
up to 10× faster on small collectives; the crossover is ~32 KiB. On Alps
and Leonardo, NCCL outperforms GPU-Aware MPI regardless of message size
and node count.
Network noise (Obs. 8)
On Leonardo at 1,024 GPUs, real production noise causes a 20% drop on
alltoall and a 50% drop on allreduce. Routing traffic to a non-default
IB service level (NCCL_IB_SL, UCX_IB_SL)
reduces variability to <1%. The fix is fragile: it works only because
production traffic is concentrated on the default SL today. Adaptive
routing is enabled on every SL on Leonardo, so the gain is not
attributable to enabling/disabling adaptive routing. A long-term
solution requires improvements in adaptive-routing algorithms and
Slurm-aware placement (Slurm already knows each node's switch and
Dragonfly+ group).
Limitations
- Three architectures only — no fat-tree characterisation; the discussion notes very large fat-trees may have higher latency than Dragonfly/Dragonfly+ due to greater diameter.
- Alps was profiled on early-access (Santis partition, 512 nodes); some optimizations are pending and runtime was non-monotonic in message size.
- Scale ceilings: Leonardo capped at 1,024 GPUs (256-node user limit); Alps GPU-Aware MPI capped at 2,048 GPUs (512-node early-access); only LUMI MPI ran to 4,096 GPUs.
*CCLalltoall stalls at 512+ GPUs on both NCCL and RCCL — full inter-node alltoall scalability could not be measured.- Production noise was studied only on Leonardo; Alps/LUMI Slingshot were assumed largely noise-immune based on prior work and the data in V-B.
- The non-default-SL mitigation works only because Leonardo currently maps almost all production jobs to SL 0; if other workloads migrated to the same alternate SL, the gain would shrink.
Open Problems
- Adaptive routing improvements for InfiniBand Dragonfly+ that mitigate noise without manual SL selection or Slurm-aware placement.
- GPU-Aware MPI allreduce has room for improvement, especially the host-GPU interaction during data aggregation (Open MPI on Leonardo runs aggregation on the host, similar to the trivial staging baseline).
*CCLconnection management at large scale: the alltoall stall at 512+ GPUs on both NCCL and RCCL likely stems from the number of active connections required for native alltoall.- RCCL bandwidth estimation on partial-mesh intra-node fabrics — currently uses hop count rather than path diversity, underutilising sparse-pattern P2P (e.g. GPU 0 ↔︎ 5 on LUMI).
- Sharp
*CCLallreduce performance drop between 256 and 512 GPUs on Alps and LUMI is unexplained (not an algorithm switch). - Early-access tuning of Alps where runtime is non-monotonic in message size, suggesting the IPC threshold and GPU-Aware MPI kernel policies need further investigation.
- Collective-algorithm optimisation gap — measured collective performance is consistently further from the expected peak than measured P2P performance, indicating headroom in the collective implementations themselves (Sec. IV-D).
Note on NCCL Tuning
The paper provides directly actionable NCCL configuration evidence
for HPC GPU clusters. Setting NCCL_IGNORE_CPU_AFFINITY=1
improved allreduce by up to 6× and alltoall by up to 1.6× starting at
two nodes; NCCL_NET_GDR_LEVEL=3 gave 2× alltoall / 3×
allreduce on Alps; NCCL_NCHANNELS_PER_PEER=32 gave 3.5× on
LUMI P2P; and NCCL_IB_SL (mapping NCCL traffic to a
non-default InfiniBand service level) reduced production-noise
variability from up to 50% allreduce degradation to <1% on Leonardo
(Obs. 1, Obs. 8). These knobs are large-effect, system-specific, and not
knowable from defaults — strong empirical support that NCCL tuning is
per-system, per-message-size, and per-scale rather than a one-shot
decision.