HiCCL: A Hierarchical Collective Communication Library

Mert Hidayetoglu, Simon Garcia de Gonzalo, Elliott Slaughter, Pinku Surana, Wen-mei Hwu, William Gropp, Alex Aiken | Stanford, Sandia, SLAC, UIUC, Nvidia | IPDPS 2025


Problem

Modern GPU clusters expose deep, vendor-specific communication hierarchies: GPU dies, multi-die packages, intra-node NVLink/Infinity Fabric, multi-NIC inter-node links (Slingshot, IB), and multi-island fabrics. Two gaps make collective communication painful at this scale. (1) GPU-aware MPI implementations (MPICH, OpenMPI) cannot saturate leadership-class interconnects because their collective algorithms are not co-designed with hierarchical GPU+NIC topologies. (2) Vendor libraries (NCCL, RCCL, OneCCL) deliver high performance but are vertically integrated with one vendor's hardware, so a collective tuned for Nvidia's NCCL cannot be ported to AMD or Intel without manual redesign. Users are forced to choose between portability and performance.


Core Insight

Decouple what a collective does (its logical communication pattern) from how it executes on a specific topology. Express any collective as a composition of three primitives — multicast, reduction, fence — and let a mechanical optimization engine factor that composition across the machine's hierarchy into pipelined point-to-point operations that call existing backends (NCCL, RCCL, OneCCL, MPI, IPC) at each level.


Method

HiCCL exposes a compositional API on a HiCCL::Comm<T> object:

A user defines a collective by registering primitives. Example: All-Reduce = several R primitives (Reduce-Scatter) + fence() + several M primitives (All-Gather).

The compiler takes the high-level composition plus a machine description and applies five optimization knobs:

  1. Hierarchy factorization vector (e.g., 24 GPUs = {2 nodes, 6 devices, 2 dies}) — how the GPUs are grouped at each level.
  2. Backend library per level vector (e.g., IPC for die-to-die, NCCL intra-node, MPI inter-node).
  3. Striping factor s — number of parallel stripes to exploit multiple NICs/GPUs per node (multi-rail).
  4. Ring size n — controls virtual ring formation across nodes.
  5. Pipeline depth m — number of chunks the payload is split into to overlap intra-node and inter-node stages.

The compiler factors each multicast/reduction recursively along the hierarchy, schedules a DAG of point-to-point sends/recvs, partitions the payload into m channels, and issues them in warm-up / steady-state / wind-down stages so lower-level (faster) hops are hidden behind higher-level (slower) hops.


Results

Evaluated on four leadership systems:

Headline numbers:


Limitations


Relevance to DynamICCL

HiCCL formalizes the hierarchy-aware action space that DynamICCL needs. Where NCCL exposes a flat (algorithm, protocol, nChannels, numThreads) tuple, HiCCL shows that the right decision space on hierarchical clusters is per-level: at each level of the topology hierarchy, an agent must pick an algorithm, a backend, a striping count, and a chunking depth. Five concrete takeaways:

  1. Per-level action factoring: instead of one global config, DynamICCL's Agent-2 can output a vector — one (algorithm, protocol, chunkSize) decision per hierarchy level (intra-node ring vs. inter-node tree, etc.). This dramatically shrinks the search space versus a flat product.
  2. Pipeline depth m as a first-class knob: HiCCL shows m = 32 is often optimal and trades intra-node vs. inter-node overlap. NCCL's nChannels and chunkSize have similar semantics; DynamICCL should treat them as a coupled pipelining parameter, not independent variables.
  3. Striping factor s as multi-rail control: on Chameleon's 1GbE clusters striping is irrelevant, but on multi-NIC HPC fabrics it would become an action dimension Agent-2 can learn over.
  4. Compositional primitives as a reward decomposer: HiCCL's M/R/fence decomposition allows per-primitive timing measurement. DynamICCL can analogously decompose an All-Reduce into Reduce-Scatter + All-Gather phases and reward each phase, providing denser feedback than end-to-end completion time alone.
  5. Open knob auto-tuning: HiCCL explicitly leaves the five knobs to the user. This is precisely the gap an RL approach fills; DynamICCL extends HiCCL's design space with a learned controller.