HiCCL: A Hierarchical Collective Communication Library

Mert Hidayetoglu, Simon Garcia de Gonzalo, Elliott Slaughter, Pinku Surana, Wen-mei Hwu, William Gropp, Alex Aiken | Stanford, Sandia, SLAC, UIUC, Nvidia | IPDPS 2025

Problem

Modern GPU clusters expose deep, vendor-specific communication hierarchies: GPU dies, multi-die packages, intra-node NVLink/Infinity Fabric, multi-NIC inter-node links (Slingshot, IB), and multi-island fabrics. Two gaps make collective communication painful at this scale. (1) GPU-aware MPI implementations (MPICH, OpenMPI) cannot saturate leadership-class interconnects because their collective algorithms are not co-designed with hierarchical GPU+NIC topologies. (2) Vendor libraries (NCCL, RCCL, OneCCL) deliver high performance but are vertically integrated with one vendor's hardware, so a collective tuned for Nvidia's NCCL cannot be ported to AMD or Intel without manual redesign. Users are forced to choose between portability and performance.

Core Insight

Decouple what a collective does (its logical communication pattern) from how it executes on a specific topology. Express any collective as a composition of three primitives — multicast, reduction, fence — and let a mechanical optimization engine factor that composition across the machine's hierarchy into pipelined point-to-point operations that call existing backends (NCCL, RCCL, OneCCL, MPI, IPC) at each level.

Method

HiCCL exposes a compositional API on a HiCCL::Comm<T> object:

M(i, j, d) — multicast: root i sends d bytes to leaves j (set).
R(i, j, d, op) — reduction: leaves i are reduced into root j with op.
fence() — synchronization barrier expressing data dependencies between steps.

A user defines a collective by registering primitives. Example: All-Reduce = several R primitives (Reduce-Scatter) + fence() + several M primitives (All-Gather).

The compiler takes the high-level composition plus a machine description and applies five optimization knobs:

Hierarchy factorization vector (e.g., 24 GPUs = {2 nodes, 6 devices, 2 dies}) — how the GPUs are grouped at each level.
Backend library per level vector (e.g., IPC for die-to-die, NCCL intra-node, MPI inter-node).
Striping factor s — number of parallel stripes to exploit multiple NICs/GPUs per node (multi-rail).
Ring size n — controls virtual ring formation across nodes.
Pipeline depth m — number of chunks the payload is split into to overlap intra-node and inter-node stages.

The compiler factors each multicast/reduction recursively along the hierarchy, schedules a DAG of point-to-point sends/recvs, partitions the payload into m channels, and issues them in warm-up / steady-state / wind-down stages so lower-level (faster) hops are hidden behind higher-level (slower) hops.

Results

Evaluated on four leadership systems:

Delta (A100, 1 x 25 GB/s Slingshot)
Perlmutter (A100, 4 x 100 GB/s Slingshot)
Frontier (MI250x, 4 x 100 GB/s Slingshot)
Aurora (PVC, 8 x 200 GB/s Slingshot)

Headline numbers:

17x geomean throughput over GPU-aware MPI (range 12.5x – 48x).
1.26x over NCCL on Delta.
1.55x over RCCL on Frontier.
12.1x over OneCCL on Aurora.
Approaches theoretical interconnect bandwidth bounds for All-Reduce, All-Gather, Reduce-Scatter, Broadcast, Reduce, Gather, Scatter, All-to-All.
Scales to 256 nodes; pipelining (m=32 typical on Perlmutter) is required to hide intra-node cost behind inter-node transfer.

Limitations

Optimizations target large-message throughput; latency-bound regimes (small messages, extreme scale > 256 nodes) can still favor latency-tuned MPI.
The five tuning knobs (factorization, backend mapping, striping, ring size, pipeline depth) are set by the user, not auto-tuned. The paper notes that knob values are often consistent for a given machine, so auto-tuning is flagged as future work.
Reduction kernels are generic GPU kernels; NCCL's stream-fused reductions retain a slight edge in some cases.

Relevance to DynamICCL

HiCCL formalizes the hierarchy-aware action space that DynamICCL needs. Where NCCL exposes a flat (algorithm, protocol, nChannels, numThreads) tuple, HiCCL shows that the right decision space on hierarchical clusters is per-level: at each level of the topology hierarchy, an agent must pick an algorithm, a backend, a striping count, and a chunking depth. Five concrete takeaways:

Per-level action factoring: instead of one global config, DynamICCL's Agent-2 can output a vector — one (algorithm, protocol, chunkSize) decision per hierarchy level (intra-node ring vs. inter-node tree, etc.). This dramatically shrinks the search space versus a flat product.
Pipeline depth m as a first-class knob: HiCCL shows m = 32 is often optimal and trades intra-node vs. inter-node overlap. NCCL's nChannels and chunkSize have similar semantics; DynamICCL should treat them as a coupled pipelining parameter, not independent variables.
Striping factor s as multi-rail control: on Chameleon's 1GbE clusters striping is irrelevant, but on multi-NIC HPC fabrics it would become an action dimension Agent-2 can learn over.
Compositional primitives as a reward decomposer: HiCCL's M/R/fence decomposition allows per-primitive timing measurement. DynamICCL can analogously decompose an All-Reduce into Reduce-Scatter + All-Gather phases and reward each phase, providing denser feedback than end-to-end completion time alone.
Open knob auto-tuning: HiCCL explicitly leaves the five knobs to the user. This is precisely the gap an RL approach fills; DynamICCL extends HiCCL's design space with a learned controller.