HiCCL: A Hierarchical Collective Communication Library — Detailed Summary

Mert Hidayetoglu, Simon Garcia de Gonzalo, Elliott Slaughter, Pinku Surana, Wen-mei Hwu, William Gropp, Alex Aiken | Stanford, Sandia, SLAC, UIUC, Nvidia | IPDPS 2025

Per-section summary organized by paper structure. Each section includes paragraph-level bullet points.


Abstract


1. Introduction

Why hierarchy matters:

Two gaps in existing libraries:

  1. Performance gap (MPI): GPU-aware MPI is general but misses hierarchy- specific optimizations and cannot saturate leadership-class fabrics.
  2. Portability gap (vendor libs): NCCL/RCCL/OneCCL are well-tuned but locked to one vendor; an All-Reduce optimized for Nvidia must be redesigned for AMD or Intel.

HiCCL's positioning: a single user-level description that compiles into hierarchy-aware schedules across vendors, calling whichever backend (NCCL, RCCL, MPI, IPC) is appropriate at each level.


2. Background and Motivation

Hierarchy levels in modern systems:

Direct vs. hierarchical communication (Figure 1):

Vendor library limits:


3. HiCCL Programming Model

3.1 Three Compositional Primitives

Primitive Signature Semantics
Multicast M(i, j, d) Root GPU i sends d bytes to set of leaf GPUs j
Reduction R(i, j, d, op) Leaf GPUs i are reduced (op) to root GPU j
Fence fence() Barrier expressing dependency: subsequent primitives wait for prior step

3.2 Programming via HiCCL::Comm<T>

3.3 Library API Surface


4. Optimization Engine

4.1 The Five Optimization Parameters (Action Space)

# Parameter Type What it controls
1 Hierarchy factorization Vector of ints (f1, f2, ..., fk) How p GPUs are grouped at each level (e.g., {2, 6, 2} for 24 GPUs)
2 Backend library per level Vector of enums (L1, L2, ..., Lk) Which library handles each level (IPC / NCCL / RCCL / OneCCL / MPI)
3 Striping factor s Int Number of parallel stripes (multi-NIC / multi-rail exploitation)
4 Ring size n Int Virtual ring topology size across nodes
5 Pipeline depth m Int Number of chunks for inter-level overlap

Critical quote: "HiCCL does not automatically select these parameters... however, we found that we were able to reuse the same description of the network hierarchy for all collective communication operations on a particular machine."

4.2 Compilation / Lowering Pipeline

  1. Factorization: each high-level M/R primitive is recursively broken down along the hierarchy vector — a global multicast becomes a tree of nested multicasts at each level.
  2. Mapping: each level is assigned a backend library; the corresponding library calls (e.g., ncclBroadcast, MPI_Isend, cudaMemcpyPeerAsync) are emitted.
  3. Scheduling: HiCCL builds a DAG of point-to-point sends/recvs honoring fences and per-level dependencies.
  4. Striping: payload is split into s parallel stripes, each routed through a different NIC/path; required to saturate multi-rail nodes.
  5. Pipelining: payload further partitioned into m channels; channels are issued in warm-up, steady-state, wind-down phases so cheap intra-node hops are overlapped with expensive inter-node hops.

4.3 Backend Integration

4.4 Pipelining Mechanics


5. Composition Examples

5.1 All-Reduce Composition (Figure 4)

5.2 Other Standard Collectives

Collective Composition
Broadcast Single M primitive (recursively factored)
Reduce Single R primitive (recursively factored)
Reduce-Scatter p R primitives in one step
All-Gather p M primitives in one step
All-Reduce Reduce-Scatter + fence + All-Gather
Scatter p M primitives, root = source
Gather p R primitives, root = sink
All-to-All p^2 personalized M primitives

5.3 Tree Structures (Figure 5)


6. Evaluation

6.1 Hardware Platforms

System Vendor GPUs/node NICs/node NIC bandwidth
Delta Nvidia A100 4 1 25 GB/s Slingshot
Perlmutter Nvidia A100 4 4 100 GB/s Slingshot
Frontier AMD MI250x (16 dies) 8 4 100 GB/s Slingshot
Aurora Intel PVC 12 8 200 GB/s Slingshot

6.2 Methodology

6.3 Headline Results

6.4 Pipeline Depth Sweep

6.5 Scaling (Figure 10)

6.6 Auto-tuning Discussion


7. Limitations and Future Work


8. Key Figures

Figure Content
Fig 1 Direct vs. hierarchical communication illustration
Fig 4 All-Reduce DAG: 3 reductions + fence + 3 multicasts
Fig 5 Tree factorization options for p = 24 GPUs
Fig 8 Headline throughput bars: HiCCL approaches hardware bounds across all 4 systems
Fig 10 All-Reduce scaling up to 256 nodes; effect of pipelining

9. Decision Points (Action Space) — Distilled

For DynamICCL/RL purposes, HiCCL effectively defines an action space with 5 dimensions, applied per collective on a given machine:

Dimension Cardinality Coupling
Hierarchy factorization (f1...fk) Combinatorial in p Machine-dependent; usually fixed per cluster
Backend mapping (L1...Lk) small (3-5 backends per level) Per-level
Striping factor s small int (1..#NICs) Per-collective
Ring size n small int Per-collective
Pipeline depth m int (e.g., 1..64) Per-collective; tightly coupled to message size

Key insight: HiCCL keeps factorization and backend-mapping "machine-fixed" and treats (s, n, m) as the per-collective tuning surface. This neatly separates machine description from workload-specific decisions — exactly the split DynamICCL needs.


Relevance to DynamICCL — Mapping Table

DynamICCL is an RL-based NCCL configuration optimizer that selects per- collective parameters (algorithm, protocol, nChannels, numThreads) to minimize collective completion time on HPC GPU clusters.

Direct mapping: HiCCL knobs → DynamICCL action space

HiCCL concept DynamICCL analog Mapping notes
Hierarchy factorization (f1..fk) Topology embedding in observation Fixed per cluster; passed as observation feature, not learned
Backend mapping per level Algorithm choice per level (Ring/Tree per scope) Agent-2 outputs per-level algorithm; same shape as HiCCL's L vector
Striping factor s Multi-rail / multi-NIC selection Becomes an action only on multi-NIC HPC fabrics; degenerate on 1GbE
Ring size n NCCL ring size / nChannels coupling Maps directly to nChannels structure in NCCL
Pipeline depth m NCCL chunkSize / pipelining depth Coupled with nChannels; trades latency vs. throughput
Multicast primitive M Per-phase Broadcast/All-Gather subroutine Can be timed independently for dense reward signal
Reduction primitive R Per-phase Reduce/Reduce-Scatter subroutine Same — phase-level reward decomposition
Fence Phase boundary in collective Natural reward checkpoint

Lessons for DynamICCL

# Lesson Why it matters
1 Per-level factored action space NCCL's flat (algo, proto, nChannels, numThreads) collapses hierarchy info; HiCCL shows actions should be per-hierarchy-level. Agent-2 can output a vector of per-level decisions, dramatically shrinking effective action-space cardinality
2 Pipeline depth m as a first-class knob NCCL's nChannels and chunkSize are functionally a pipeline-depth knob; HiCCL's m=32 sweet spot validates that this is a small-integer search dimension, not a large continuous one
3 Striping factor s for multi-rail exploitation On 1GbE Chameleon clusters this is degenerate, but on multi-NIC HPC nodes it becomes an essential action dimension — relevant for DynamICCL's eventual port to Frontier-class clusters
4 Compositional decomposition for reward shaping M/R/fence decomposition lets DynamICCL reward each phase of an All-Reduce separately (Reduce-Scatter time + All-Gather time) instead of only end-to-end completion time → denser, less variance-prone reward
5 Manual knobs in HiCCL = open problem DynamICCL solves HiCCL explicitly leaves auto-tuning to future work and observes that knobs are stable per machine; this is exactly the niche an RL controller fills, with the option to specialize per (collective, message-size) bucket
6 Machine description vs. workload decision split HiCCL fixes factorization+backend per machine and tunes (s, n, m) per call; DynamICCL should mirror this: cluster topology in observation, per-collective action only over the dynamic knobs
7 Backend portability informs ablation HiCCL emits to NCCL/RCCL/OneCCL/MPI/IPC at each level; DynamICCL, currently NCCL-only, can frame this as a future action dimension (which library at each level)
8 Throughput vs. latency regime split HiCCL targets throughput; DynamICCL must distinguish small-message latency-bound regimes from large-message throughput regimes — possibly via separate policies or message-size as observation feature