HiCCL: A Hierarchical Collective Communication Library
Mert Hidayetoglu, Simon Garcia de Gonzalo, Elliott Slaughter, Pinku Surana, Wen-mei Hwu, William Gropp, Alex Aiken | Stanford, Sandia, SLAC, UIUC, Nvidia | IPDPS 2025
Problem
Modern GPU clusters expose deep, vendor-specific communication hierarchies: GPU dies, multi-die packages, intra-node NVLink/Infinity Fabric, multi-NIC inter-node links (Slingshot, IB), and multi-island fabrics. Two gaps make collective communication painful at this scale. (1) GPU-aware MPI implementations (MPICH, OpenMPI) cannot saturate leadership-class interconnects because their collective algorithms are not co-designed with hierarchical GPU+NIC topologies. (2) Vendor libraries (NCCL, RCCL, OneCCL) deliver high performance but are vertically integrated with one vendor's hardware, so a collective tuned for Nvidia's NCCL cannot be ported to AMD or Intel without manual redesign. Users are forced to choose between portability and performance.
Core Insight
Decouple what a collective does (its logical communication pattern) from how it executes on a specific topology. Express any collective as a composition of three primitives — multicast, reduction, fence — and let a mechanical optimization engine factor that composition across the machine's hierarchy into pipelined point-to-point operations that call existing backends (NCCL, RCCL, OneCCL, MPI, IPC) at each level.
Method
HiCCL exposes a compositional API on a
HiCCL::Comm<T> object:
M(i, j, d)— multicast: root i sends d bytes to leaves j (set).R(i, j, d, op)— reduction: leaves i are reduced into root j with op.fence()— synchronization barrier expressing data dependencies between steps.
A user defines a collective by registering primitives. Example: All-Reduce = several R primitives (Reduce-Scatter) + fence() + several M primitives (All-Gather).
The compiler takes the high-level composition plus a machine description and applies five optimization knobs:
- Hierarchy factorization vector (e.g., 24 GPUs = {2 nodes, 6 devices, 2 dies}) — how the GPUs are grouped at each level.
- Backend library per level vector (e.g., IPC for die-to-die, NCCL intra-node, MPI inter-node).
- Striping factor s — number of parallel stripes to exploit multiple NICs/GPUs per node (multi-rail).
- Ring size n — controls virtual ring formation across nodes.
- Pipeline depth m — number of chunks the payload is split into to overlap intra-node and inter-node stages.
The compiler factors each multicast/reduction recursively along the hierarchy, schedules a DAG of point-to-point sends/recvs, partitions the payload into m channels, and issues them in warm-up / steady-state / wind-down stages so lower-level (faster) hops are hidden behind higher-level (slower) hops.
Results
Evaluated on four leadership systems:
- Delta (A100, 1 x 25 GB/s Slingshot)
- Perlmutter (A100, 4 x 100 GB/s Slingshot)
- Frontier (MI250x, 4 x 100 GB/s Slingshot)
- Aurora (PVC, 8 x 200 GB/s Slingshot)
Headline numbers:
- 17x geomean throughput over GPU-aware MPI (range 12.5x – 48x).
- 1.26x over NCCL on Delta.
- 1.55x over RCCL on Frontier.
- 12.1x over OneCCL on Aurora.
- Approaches theoretical interconnect bandwidth bounds for All-Reduce, All-Gather, Reduce-Scatter, Broadcast, Reduce, Gather, Scatter, All-to-All.
- Scales to 256 nodes; pipelining (m=32 typical on Perlmutter) is required to hide intra-node cost behind inter-node transfer.
Limitations
- Optimizations target large-message throughput; latency-bound regimes (small messages, extreme scale > 256 nodes) can still favor latency-tuned MPI.
- The five tuning knobs (factorization, backend mapping, striping, ring size, pipeline depth) are set by the user, not auto-tuned. The paper notes that knob values are often consistent for a given machine, so auto-tuning is flagged as future work.
- Reduction kernels are generic GPU kernels; NCCL's stream-fused reductions retain a slight edge in some cases.
Relevance to DynamICCL
HiCCL formalizes the hierarchy-aware action space that DynamICCL needs. Where NCCL exposes a flat (algorithm, protocol, nChannels, numThreads) tuple, HiCCL shows that the right decision space on hierarchical clusters is per-level: at each level of the topology hierarchy, an agent must pick an algorithm, a backend, a striping count, and a chunking depth. Five concrete takeaways:
- Per-level action factoring: instead of one global config, DynamICCL's Agent-2 can output a vector — one (algorithm, protocol, chunkSize) decision per hierarchy level (intra-node ring vs. inter-node tree, etc.). This dramatically shrinks the search space versus a flat product.
- Pipeline depth m as a first-class knob: HiCCL shows m = 32 is often optimal and trades intra-node vs. inter-node overlap. NCCL's nChannels and chunkSize have similar semantics; DynamICCL should treat them as a coupled pipelining parameter, not independent variables.
- Striping factor s as multi-rail control: on Chameleon's 1GbE clusters striping is irrelevant, but on multi-NIC HPC fabrics it would become an action dimension Agent-2 can learn over.
- Compositional primitives as a reward decomposer: HiCCL's M/R/fence decomposition allows per-primitive timing measurement. DynamICCL can analogously decompose an All-Reduce into Reduce-Scatter + All-Gather phases and reward each phase, providing denser feedback than end-to-end completion time alone.
- Open knob auto-tuning: HiCCL explicitly leaves the five knobs to the user. This is precisely the gap an RL approach fills; DynamICCL extends HiCCL's design space with a learned controller.