HiCCL: A Hierarchical Collective Communication Library — Detailed Summary

Mert Hidayetoglu, Simon Garcia de Gonzalo, Elliott Slaughter, Pinku Surana, Wen-mei Hwu, William Gropp, Alex Aiken | Stanford, Sandia, SLAC, UIUC, Nvidia | IPDPS 2025

Per-section summary organized by paper structure. Each section includes paragraph-level bullet points.

Abstract

HPC network architectures have evolved into deep multilevel hierarchies (intra-die, intra-package, intra-node NVLink/IF, multi-NIC inter-node, multi-island), and writing one optimized collective per system per vendor is no longer tractable.
Existing libraries either lack portability (NCCL, RCCL, OneCCL — vertically integrated per vendor) or lack performance (GPU-aware MPI — generic, fails to saturate hierarchical fabrics).
HiCCL decouples collective logic from network optimization through a compositional API built on three primitives: multicast, reduction, and fence.
A specified network hierarchy is used to factorize primitives into point-to-point operations; striping and pipelining optimizations streamline execution.
HiCCL achieves 17x geomean throughput over specialized GPU-aware MPI, and meets or beats vendor libraries (NCCL, RCCL, OneCCL) — while preserving cross-vendor portability with a single source description.

1. Introduction

Why hierarchy matters:

Bandwidth in modern clusters is highly non-uniform: NVLink/NVSwitch deliver hundreds of GB/s intra-node, while inter-node Slingshot/IB delivers tens to a few hundred GB/s spread across multiple NICs.
Naive collective implementations move data redundantly across the slowest level of the hierarchy. Efficient ones move one copy across the slow link and redistribute it within fast levels.

Two gaps in existing libraries:

Performance gap (MPI): GPU-aware MPI is general but misses hierarchy- specific optimizations and cannot saturate leadership-class fabrics.
Portability gap (vendor libs): NCCL/RCCL/OneCCL are well-tuned but locked to one vendor; an All-Reduce optimized for Nvidia must be redesigned for AMD or Intel.

HiCCL's positioning: a single user-level description that compiles into hierarchy-aware schedules across vendors, calling whichever backend (NCCL, RCCL, MPI, IPC) is appropriate at each level.

2. Background and Motivation

Hierarchy levels in modern systems:

GPU dies (e.g., MI250x has 2 dies per package).
GPU devices within a node.
NUMA / multi-CPU groupings within a node.
Nodes connected by inter-node fabric.
Racks / islands connected by higher-level switches.

Direct vs. hierarchical communication (Figure 1):

Direct: every cross-node send goes over the slow link individually — slow link is replicated p times.
Hierarchical: one copy crosses the slow link, then is redistributed inside the destination node using fast intra-node links.

Vendor library limits:

NCCL implements ring/tree algorithms tuned to Nvidia hardware; switching to AMD requires RCCL with potentially different algorithm choices.
OneCCL on Intel PVC is significantly behind NCCL/RCCL on its respective hardware (motivating the 12.1x speedup HiCCL achieves over OneCCL).

3. HiCCL Programming Model

3.1 Three Compositional Primitives

Primitive	Signature	Semantics
Multicast	`M(i, j, d)`	Root GPU i sends d bytes to set of leaf GPUs j
Reduction	`R(i, j, d, op)`	Leaf GPUs i are reduced (op) to root GPU j
Fence	`fence()`	Barrier expressing dependency: subsequent primitives wait for prior step

All collectives are expressed as compositions of these three.
Multicasts and reductions registered between fences execute in parallel.
Fences enforce data dependencies between steps.

3.2 Programming via `HiCCL::Comm<T>`

A HiCCL::Comm<T> object holds the compositional spec.
Users register primitives, optionally separated by fence() for multi-step collectives.
Example — All-Reduce of size N over p GPUs:
- Step 1: register p reductions (each GPU i collects a 1/p slice into itself from all others) → effectively a Reduce-Scatter.
- fence().
- Step 2: register p multicasts (each GPU broadcasts its 1/p slice to all others) → effectively an All-Gather.

3.3 Library API Surface

HiCCL exposes a small, user-level C++ API.
The same source description compiles for Nvidia, AMD, and Intel backends — no vendor-conditional code in the user program.

4. Optimization Engine

4.1 The Five Optimization Parameters (Action Space)

#	Parameter	Type	What it controls
1	Hierarchy factorization	Vector of ints (f1, f2, ..., fk)	How p GPUs are grouped at each level (e.g., {2, 6, 2} for 24 GPUs)
2	Backend library per level	Vector of enums (L1, L2, ..., Lk)	Which library handles each level (IPC / NCCL / RCCL / OneCCL / MPI)
3	Striping factor s	Int	Number of parallel stripes (multi-NIC / multi-rail exploitation)
4	Ring size n	Int	Virtual ring topology size across nodes
5	Pipeline depth m	Int	Number of chunks for inter-level overlap

Critical quote: "HiCCL does not automatically select these parameters... however, we found that we were able to reuse the same description of the network hierarchy for all collective communication operations on a particular machine."

4.2 Compilation / Lowering Pipeline

Factorization: each high-level M/R primitive is recursively broken down along the hierarchy vector — a global multicast becomes a tree of nested multicasts at each level.
Mapping: each level is assigned a backend library; the corresponding library calls (e.g., ncclBroadcast, MPI_Isend, cudaMemcpyPeerAsync) are emitted.
Scheduling: HiCCL builds a DAG of point-to-point sends/recvs honoring fences and per-level dependencies.
Striping: payload is split into s parallel stripes, each routed through a different NIC/path; required to saturate multi-rail nodes.
Pipelining: payload further partitioned into m channels; channels are issued in warm-up, steady-state, wind-down phases so cheap intra-node hops are overlapped with expensive inter-node hops.

4.3 Backend Integration

HiCCL is not a transport. It schedules and emits calls to existing backends:
- Inter-node: MPI, NCCL, RCCL, OneCCL.
- Intra-node: IPC primitives (cudaMemcpy, HIP IPC, Intel Level Zero IPC) for direct GPU-to-GPU copies bypassing host memory.
This is why HiCCL is portable: at each level of any machine, some backend exists, and HiCCL emits to it.

4.4 Pipelining Mechanics

Pipeline depth m trades latency for throughput: small m → low latency, low utilization; large m → high utilization but each chunk is small.
The paper sweeps m and shows m = 32 is often optimal on Perlmutter.
Hiding intra-node ops behind inter-node transfer requires steady-state where intra-node time < inter-node time per chunk; this constrains m as a function of bandwidth ratios at adjacent levels.

5. Composition Examples

5.1 All-Reduce Composition (Figure 4)

DAG: 3 reductions → fence → 3 multicasts (for p=3 GPUs).
Generalizes: All-Reduce = Reduce-Scatter (R primitives) + fence + All-Gather (M primitives).

5.2 Other Standard Collectives

Collective	Composition
Broadcast	Single M primitive (recursively factored)
Reduce	Single R primitive (recursively factored)
Reduce-Scatter	p R primitives in one step
All-Gather	p M primitives in one step
All-Reduce	Reduce-Scatter + fence + All-Gather
Scatter	p M primitives, root = source
Gather	p R primitives, root = sink
All-to-All	p^2 personalized M primitives

5.3 Tree Structures (Figure 5)

For p=24 GPUs, multiple factorizations are valid:
- {24} — flat, 1 level.
- {4, 6} — 4 nodes, 6 GPUs each.
- {2, 6, 2} — 2 racks, 6 nodes, 2 GPUs each.
- {2, 2, 3, 2} — 4-level hierarchy.
Choice depends on physical topology and bandwidth per level.

6. Evaluation

6.1 Hardware Platforms

System	Vendor	GPUs/node	NICs/node	NIC bandwidth
Delta	Nvidia A100	4	1	25 GB/s Slingshot
Perlmutter	Nvidia A100	4	4	100 GB/s Slingshot
Frontier	AMD MI250x (16 dies)	8	4	100 GB/s Slingshot
Aurora	Intel PVC	12	8	200 GB/s Slingshot

6.2 Methodology

Collectives tested: All-Reduce, All-Gather, Reduce-Scatter, Broadcast, Reduce, Gather, Scatter, All-to-All.
Message sizes: focus on large-message (> 1 MB) throughput regime.
Baselines: Cray MPI / OpenMPI / MPICH (GPU-aware), NCCL (Delta, Perlmutter), RCCL (Frontier), OneCCL (Aurora).
Metric: effective throughput in GB/s at the transport level.

6.3 Headline Results

vs. GPU-aware MPI: 12.5x – 48.02x speedup, 17x geomean.
vs. NCCL (Delta): 1.26x.
vs. RCCL (Frontier): 1.55x.
vs. OneCCL (Aurora): 12.1x — vendor library is far behind.
Approaches theoretical interconnect bandwidth bound on each system.

6.4 Pipeline Depth Sweep

Sweeping m demonstrates the textbook pipelining tradeoff: throughput rises with m up to a point where intra-node overhead is fully hidden, then saturates.
m = 32 is typical optimum on Perlmutter for All-Reduce.

6.5 Scaling (Figure 10)

All-Reduce scales to 256 nodes.
At extreme scale, pipelining becomes more valuable: more hierarchy levels amplify the cost of unhidden serial work.

6.6 Auto-tuning Discussion

HiCCL does not automatically select parameters in this paper.
Parameters are largely machine-dependent (factorization, backend mapping) and are chosen once per machine; per-collective tuning of (s, m) is the active knob.

7. Limitations and Future Work

Latency regimes: HiCCL's optimization targets large-message throughput; small-message latency can still favor latency-tuned MPI implementations, particularly above 256 nodes where serial fence cost dominates.
Manual parameter selection: the five knobs (factorization, backend mapping, striping, ring size, pipeline depth) are user-supplied. The paper notes that automated selection / auto-tuning is the natural next step.
Reduction kernels: HiCCL uses generic GPU reduction kernels; NCCL's stream-fused reductions retain a small advantage in specific scenarios.

8. Key Figures

Figure	Content
Fig 1	Direct vs. hierarchical communication illustration
Fig 4	All-Reduce DAG: 3 reductions + fence + 3 multicasts
Fig 5	Tree factorization options for p = 24 GPUs
Fig 8	Headline throughput bars: HiCCL approaches hardware bounds across all 4 systems
Fig 10	All-Reduce scaling up to 256 nodes; effect of pipelining

9. Decision Points (Action Space) — Distilled

For DynamICCL/RL purposes, HiCCL effectively defines an action space with 5 dimensions, applied per collective on a given machine:

Dimension	Cardinality	Coupling
Hierarchy factorization (f1...fk)	Combinatorial in p	Machine-dependent; usually fixed per cluster
Backend mapping (L1...Lk)	small (3-5 backends per level)	Per-level
Striping factor s	small int (1..#NICs)	Per-collective
Ring size n	small int	Per-collective
Pipeline depth m	int (e.g., 1..64)	Per-collective; tightly coupled to message size

Key insight: HiCCL keeps factorization and backend-mapping "machine-fixed" and treats (s, n, m) as the per-collective tuning surface. This neatly separates machine description from workload-specific decisions — exactly the split DynamICCL needs.

Relevance to DynamICCL — Mapping Table

DynamICCL is an RL-based NCCL configuration optimizer that selects per- collective parameters (algorithm, protocol, nChannels, numThreads) to minimize collective completion time on HPC GPU clusters.

Direct mapping: HiCCL knobs → DynamICCL action space

HiCCL concept	DynamICCL analog	Mapping notes
Hierarchy factorization (f1..fk)	Topology embedding in observation	Fixed per cluster; passed as observation feature, not learned
Backend mapping per level	Algorithm choice per level (Ring/Tree per scope)	Agent-2 outputs per-level algorithm; same shape as HiCCL's L vector
Striping factor s	Multi-rail / multi-NIC selection	Becomes an action only on multi-NIC HPC fabrics; degenerate on 1GbE
Ring size n	NCCL ring size / nChannels coupling	Maps directly to nChannels structure in NCCL
Pipeline depth m	NCCL chunkSize / pipelining depth	Coupled with nChannels; trades latency vs. throughput
Multicast primitive M	Per-phase Broadcast/All-Gather subroutine	Can be timed independently for dense reward signal
Reduction primitive R	Per-phase Reduce/Reduce-Scatter subroutine	Same — phase-level reward decomposition
Fence	Phase boundary in collective	Natural reward checkpoint

Lessons for DynamICCL

#	Lesson	Why it matters
1	Per-level factored action space	NCCL's flat (algo, proto, nChannels, numThreads) collapses hierarchy info; HiCCL shows actions should be per-hierarchy-level. Agent-2 can output a vector of per-level decisions, dramatically shrinking effective action-space cardinality
2	Pipeline depth m as a first-class knob	NCCL's nChannels and chunkSize are functionally a pipeline-depth knob; HiCCL's m=32 sweet spot validates that this is a small-integer search dimension, not a large continuous one
3	Striping factor s for multi-rail exploitation	On 1GbE Chameleon clusters this is degenerate, but on multi-NIC HPC nodes it becomes an essential action dimension — relevant for DynamICCL's eventual port to Frontier-class clusters
4	Compositional decomposition for reward shaping	M/R/fence decomposition lets DynamICCL reward each phase of an All-Reduce separately (Reduce-Scatter time + All-Gather time) instead of only end-to-end completion time → denser, less variance-prone reward
5	Manual knobs in HiCCL = open problem DynamICCL solves	HiCCL explicitly leaves auto-tuning to future work and observes that knobs are stable per machine; this is exactly the niche an RL controller fills, with the option to specialize per (collective, message-size) bucket
6	Machine description vs. workload decision split	HiCCL fixes factorization+backend per machine and tunes (s, n, m) per call; DynamICCL should mirror this: cluster topology in observation, per-collective action only over the dynamic knobs
7	Backend portability informs ablation	HiCCL emits to NCCL/RCCL/OneCCL/MPI/IPC at each level; DynamICCL, currently NCCL-only, can frame this as a future action dimension (which library at each level)
8	Throughput vs. latency regime split	HiCCL targets throughput; DynamICCL must distinguish small-message latency-bound regimes from large-message throughput regimes — possibly via separate policies or message-size as observation feature