HiCCL: A Hierarchical Collective Communication Library — Detailed Summary
Mert Hidayetoglu, Simon Garcia de Gonzalo, Elliott Slaughter, Pinku Surana, Wen-mei Hwu, William Gropp, Alex Aiken | Stanford, Sandia, SLAC, UIUC, Nvidia | IPDPS 2025
Per-section summary organized by paper structure. Each section includes paragraph-level bullet points.
Abstract
- HPC network architectures have evolved into deep multilevel hierarchies (intra-die, intra-package, intra-node NVLink/IF, multi-NIC inter-node, multi-island), and writing one optimized collective per system per vendor is no longer tractable.
- Existing libraries either lack portability (NCCL, RCCL, OneCCL — vertically integrated per vendor) or lack performance (GPU-aware MPI — generic, fails to saturate hierarchical fabrics).
- HiCCL decouples collective logic from network optimization through a compositional API built on three primitives: multicast, reduction, and fence.
- A specified network hierarchy is used to factorize primitives into point-to-point operations; striping and pipelining optimizations streamline execution.
- HiCCL achieves 17x geomean throughput over specialized GPU-aware MPI, and meets or beats vendor libraries (NCCL, RCCL, OneCCL) — while preserving cross-vendor portability with a single source description.
1. Introduction
Why hierarchy matters:
- Bandwidth in modern clusters is highly non-uniform: NVLink/NVSwitch deliver hundreds of GB/s intra-node, while inter-node Slingshot/IB delivers tens to a few hundred GB/s spread across multiple NICs.
- Naive collective implementations move data redundantly across the slowest level of the hierarchy. Efficient ones move one copy across the slow link and redistribute it within fast levels.
Two gaps in existing libraries:
- Performance gap (MPI): GPU-aware MPI is general but misses hierarchy- specific optimizations and cannot saturate leadership-class fabrics.
- Portability gap (vendor libs): NCCL/RCCL/OneCCL are well-tuned but locked to one vendor; an All-Reduce optimized for Nvidia must be redesigned for AMD or Intel.
HiCCL's positioning: a single user-level description that compiles into hierarchy-aware schedules across vendors, calling whichever backend (NCCL, RCCL, MPI, IPC) is appropriate at each level.
2. Background and Motivation
Hierarchy levels in modern systems:
- GPU dies (e.g., MI250x has 2 dies per package).
- GPU devices within a node.
- NUMA / multi-CPU groupings within a node.
- Nodes connected by inter-node fabric.
- Racks / islands connected by higher-level switches.
Direct vs. hierarchical communication (Figure 1):
- Direct: every cross-node send goes over the slow link individually — slow link is replicated p times.
- Hierarchical: one copy crosses the slow link, then is redistributed inside the destination node using fast intra-node links.
Vendor library limits:
- NCCL implements ring/tree algorithms tuned to Nvidia hardware; switching to AMD requires RCCL with potentially different algorithm choices.
- OneCCL on Intel PVC is significantly behind NCCL/RCCL on its respective hardware (motivating the 12.1x speedup HiCCL achieves over OneCCL).
3. HiCCL Programming Model
3.1 Three Compositional Primitives
| Primitive | Signature | Semantics |
|---|---|---|
| Multicast | M(i, j, d) |
Root GPU i sends d bytes to set of leaf GPUs j |
| Reduction | R(i, j, d, op) |
Leaf GPUs i are reduced (op) to root GPU j |
| Fence | fence() |
Barrier expressing dependency: subsequent primitives wait for prior step |
- All collectives are expressed as compositions of these three.
- Multicasts and reductions registered between fences execute in parallel.
- Fences enforce data dependencies between steps.
3.2 Programming via
HiCCL::Comm<T>
- A
HiCCL::Comm<T>object holds the compositional spec. - Users register primitives, optionally separated by
fence()for multi-step collectives. - Example — All-Reduce of size N over p GPUs:
- Step 1: register p reductions (each GPU i collects a 1/p slice into itself from all others) → effectively a Reduce-Scatter.
fence().- Step 2: register p multicasts (each GPU broadcasts its 1/p slice to all others) → effectively an All-Gather.
3.3 Library API Surface
- HiCCL exposes a small, user-level C++ API.
- The same source description compiles for Nvidia, AMD, and Intel backends — no vendor-conditional code in the user program.
4. Optimization Engine
4.1 The Five Optimization Parameters (Action Space)
| # | Parameter | Type | What it controls |
|---|---|---|---|
| 1 | Hierarchy factorization | Vector of ints (f1, f2, ..., fk) | How p GPUs are grouped at each level (e.g., {2, 6, 2} for 24 GPUs) |
| 2 | Backend library per level | Vector of enums (L1, L2, ..., Lk) | Which library handles each level (IPC / NCCL / RCCL / OneCCL / MPI) |
| 3 | Striping factor s | Int | Number of parallel stripes (multi-NIC / multi-rail exploitation) |
| 4 | Ring size n | Int | Virtual ring topology size across nodes |
| 5 | Pipeline depth m | Int | Number of chunks for inter-level overlap |
Critical quote: "HiCCL does not automatically select these parameters... however, we found that we were able to reuse the same description of the network hierarchy for all collective communication operations on a particular machine."
4.2 Compilation / Lowering Pipeline
- Factorization: each high-level M/R primitive is recursively broken down along the hierarchy vector — a global multicast becomes a tree of nested multicasts at each level.
- Mapping: each level is assigned a backend library;
the corresponding library calls (e.g.,
ncclBroadcast,MPI_Isend,cudaMemcpyPeerAsync) are emitted. - Scheduling: HiCCL builds a DAG of point-to-point sends/recvs honoring fences and per-level dependencies.
- Striping: payload is split into s parallel stripes, each routed through a different NIC/path; required to saturate multi-rail nodes.
- Pipelining: payload further partitioned into m channels; channels are issued in warm-up, steady-state, wind-down phases so cheap intra-node hops are overlapped with expensive inter-node hops.
4.3 Backend Integration
- HiCCL is not a transport. It schedules and emits
calls to existing backends:
- Inter-node: MPI, NCCL, RCCL, OneCCL.
- Intra-node: IPC primitives
(
cudaMemcpy, HIP IPC, Intel Level Zero IPC) for direct GPU-to-GPU copies bypassing host memory.
- This is why HiCCL is portable: at each level of any machine, some backend exists, and HiCCL emits to it.
4.4 Pipelining Mechanics
- Pipeline depth m trades latency for throughput: small m → low latency, low utilization; large m → high utilization but each chunk is small.
- The paper sweeps m and shows m = 32 is often optimal on Perlmutter.
- Hiding intra-node ops behind inter-node transfer requires steady-state where intra-node time < inter-node time per chunk; this constrains m as a function of bandwidth ratios at adjacent levels.
5. Composition Examples
5.1 All-Reduce Composition (Figure 4)
- DAG: 3 reductions → fence → 3 multicasts (for p=3 GPUs).
- Generalizes: All-Reduce = Reduce-Scatter (R primitives) + fence + All-Gather (M primitives).
5.2 Other Standard Collectives
| Collective | Composition |
|---|---|
| Broadcast | Single M primitive (recursively factored) |
| Reduce | Single R primitive (recursively factored) |
| Reduce-Scatter | p R primitives in one step |
| All-Gather | p M primitives in one step |
| All-Reduce | Reduce-Scatter + fence + All-Gather |
| Scatter | p M primitives, root = source |
| Gather | p R primitives, root = sink |
| All-to-All | p^2 personalized M primitives |
5.3 Tree Structures (Figure 5)
- For p=24 GPUs, multiple factorizations are valid:
- {24} — flat, 1 level.
- {4, 6} — 4 nodes, 6 GPUs each.
- {2, 6, 2} — 2 racks, 6 nodes, 2 GPUs each.
- {2, 2, 3, 2} — 4-level hierarchy.
- Choice depends on physical topology and bandwidth per level.
6. Evaluation
6.1 Hardware Platforms
| System | Vendor | GPUs/node | NICs/node | NIC bandwidth |
|---|---|---|---|---|
| Delta | Nvidia A100 | 4 | 1 | 25 GB/s Slingshot |
| Perlmutter | Nvidia A100 | 4 | 4 | 100 GB/s Slingshot |
| Frontier | AMD MI250x (16 dies) | 8 | 4 | 100 GB/s Slingshot |
| Aurora | Intel PVC | 12 | 8 | 200 GB/s Slingshot |
6.2 Methodology
- Collectives tested: All-Reduce, All-Gather, Reduce-Scatter, Broadcast, Reduce, Gather, Scatter, All-to-All.
- Message sizes: focus on large-message (> 1 MB) throughput regime.
- Baselines: Cray MPI / OpenMPI / MPICH (GPU-aware), NCCL (Delta, Perlmutter), RCCL (Frontier), OneCCL (Aurora).
- Metric: effective throughput in GB/s at the transport level.
6.3 Headline Results
- vs. GPU-aware MPI: 12.5x – 48.02x speedup, 17x geomean.
- vs. NCCL (Delta): 1.26x.
- vs. RCCL (Frontier): 1.55x.
- vs. OneCCL (Aurora): 12.1x — vendor library is far behind.
- Approaches theoretical interconnect bandwidth bound on each system.
6.4 Pipeline Depth Sweep
- Sweeping m demonstrates the textbook pipelining tradeoff: throughput rises with m up to a point where intra-node overhead is fully hidden, then saturates.
- m = 32 is typical optimum on Perlmutter for All-Reduce.
6.5 Scaling (Figure 10)
- All-Reduce scales to 256 nodes.
- At extreme scale, pipelining becomes more valuable: more hierarchy levels amplify the cost of unhidden serial work.
6.6 Auto-tuning Discussion
- HiCCL does not automatically select parameters in this paper.
- Parameters are largely machine-dependent (factorization, backend mapping) and are chosen once per machine; per-collective tuning of (s, m) is the active knob.
7. Limitations and Future Work
- Latency regimes: HiCCL's optimization targets large-message throughput; small-message latency can still favor latency-tuned MPI implementations, particularly above 256 nodes where serial fence cost dominates.
- Manual parameter selection: the five knobs (factorization, backend mapping, striping, ring size, pipeline depth) are user-supplied. The paper notes that automated selection / auto-tuning is the natural next step.
- Reduction kernels: HiCCL uses generic GPU reduction kernels; NCCL's stream-fused reductions retain a small advantage in specific scenarios.
8. Key Figures
| Figure | Content |
|---|---|
| Fig 1 | Direct vs. hierarchical communication illustration |
| Fig 4 | All-Reduce DAG: 3 reductions + fence + 3 multicasts |
| Fig 5 | Tree factorization options for p = 24 GPUs |
| Fig 8 | Headline throughput bars: HiCCL approaches hardware bounds across all 4 systems |
| Fig 10 | All-Reduce scaling up to 256 nodes; effect of pipelining |
9. Decision Points (Action Space) — Distilled
For DynamICCL/RL purposes, HiCCL effectively defines an action space with 5 dimensions, applied per collective on a given machine:
| Dimension | Cardinality | Coupling |
|---|---|---|
| Hierarchy factorization (f1...fk) | Combinatorial in p | Machine-dependent; usually fixed per cluster |
| Backend mapping (L1...Lk) | small (3-5 backends per level) | Per-level |
| Striping factor s | small int (1..#NICs) | Per-collective |
| Ring size n | small int | Per-collective |
| Pipeline depth m | int (e.g., 1..64) | Per-collective; tightly coupled to message size |
Key insight: HiCCL keeps factorization and backend-mapping "machine-fixed" and treats (s, n, m) as the per-collective tuning surface. This neatly separates machine description from workload-specific decisions — exactly the split DynamICCL needs.
Relevance to DynamICCL — Mapping Table
DynamICCL is an RL-based NCCL configuration optimizer that selects per- collective parameters (algorithm, protocol, nChannels, numThreads) to minimize collective completion time on HPC GPU clusters.
Direct mapping: HiCCL knobs → DynamICCL action space
| HiCCL concept | DynamICCL analog | Mapping notes |
|---|---|---|
| Hierarchy factorization (f1..fk) | Topology embedding in observation | Fixed per cluster; passed as observation feature, not learned |
| Backend mapping per level | Algorithm choice per level (Ring/Tree per scope) | Agent-2 outputs per-level algorithm; same shape as HiCCL's L vector |
| Striping factor s | Multi-rail / multi-NIC selection | Becomes an action only on multi-NIC HPC fabrics; degenerate on 1GbE |
| Ring size n | NCCL ring size / nChannels coupling | Maps directly to nChannels structure in NCCL |
| Pipeline depth m | NCCL chunkSize / pipelining depth | Coupled with nChannels; trades latency vs. throughput |
| Multicast primitive M | Per-phase Broadcast/All-Gather subroutine | Can be timed independently for dense reward signal |
| Reduction primitive R | Per-phase Reduce/Reduce-Scatter subroutine | Same — phase-level reward decomposition |
| Fence | Phase boundary in collective | Natural reward checkpoint |
Lessons for DynamICCL
| # | Lesson | Why it matters |
|---|---|---|
| 1 | Per-level factored action space | NCCL's flat (algo, proto, nChannels, numThreads) collapses hierarchy info; HiCCL shows actions should be per-hierarchy-level. Agent-2 can output a vector of per-level decisions, dramatically shrinking effective action-space cardinality |
| 2 | Pipeline depth m as a first-class knob | NCCL's nChannels and chunkSize are functionally a pipeline-depth knob; HiCCL's m=32 sweet spot validates that this is a small-integer search dimension, not a large continuous one |
| 3 | Striping factor s for multi-rail exploitation | On 1GbE Chameleon clusters this is degenerate, but on multi-NIC HPC nodes it becomes an essential action dimension — relevant for DynamICCL's eventual port to Frontier-class clusters |
| 4 | Compositional decomposition for reward shaping | M/R/fence decomposition lets DynamICCL reward each phase of an All-Reduce separately (Reduce-Scatter time + All-Gather time) instead of only end-to-end completion time → denser, less variance-prone reward |
| 5 | Manual knobs in HiCCL = open problem DynamICCL solves | HiCCL explicitly leaves auto-tuning to future work and observes that knobs are stable per machine; this is exactly the niche an RL controller fills, with the option to specialize per (collective, message-size) bucket |
| 6 | Machine description vs. workload decision split | HiCCL fixes factorization+backend per machine and tunes (s, n, m) per call; DynamICCL should mirror this: cluster topology in observation, per-collective action only over the dynamic knobs |
| 7 | Backend portability informs ablation | HiCCL emits to NCCL/RCCL/OneCCL/MPI/IPC at each level; DynamICCL, currently NCCL-only, can frame this as a future action dimension (which library at each level) |
| 8 | Throughput vs. latency regime split | HiCCL targets throughput; DynamICCL must distinguish small-message latency-bound regimes from large-message throughput regimes — possibly via separate policies or message-size as observation feature |