HiCCL — Architecture and Design Analysis
Paper: HiCCL: A Hierarchical Collective Communication Library Venue: IPDPS 2025 (IEEE International Parallel and Distributed Processing Symposium) Authors: Mert Hidayetoglu (Stanford / Nvidia), Simon Garcia de Gonzalo (Sandia), Elliott Slaughter (SLAC), Pinku Surana (SLAC), Wen-mei Hwu (UIUC / Nvidia), William Gropp (UIUC), Alex Aiken (Stanford) Code: https://github.com/merthidayetoglu/HiCCL Analyst: Vishwakarma Date: 2026-04-28
Table of Contents
- System Overview Block Diagram
- The IR / Primitive Layer (Multicast, Reduction, Fence)
- Collective Composition Grammar (Single-step + Multi-step)
- Hierarchical Factorization & Optimization Architecture
- Control Flow / Data Flow Diagrams
- Performance Model and Cost Equations
- Trade-off Tables: HiCCL vs. Flat NCCL
- What to Borrow for DynamICCL
- Analogy
- Summary Table
1. System Overview Block Diagram
+--------------------------------------------------------------------+
| HiCCL System Architecture |
| |
| +-----------------------------------------------------------+ |
| | Application layer (PyTorch / scientific app) | |
| | declares a collective (Allreduce, Broadcast, ...) | |
| +----------------------------+------------------------------+ |
| | C++ API call |
| v |
| +-----------------------------------------------------------+ |
| | HiCCL Compositional API (Section III) | |
| | | |
| | +---------------------+ +-------------------------+ | |
| | | Comm<T>:: | | Three IR primitives: | | |
| | | add_multicast(...) | | M(i, j, d) | | |
| | | add_reduction(...) | | R(i, j, d, op) | | |
| | | add_fence() | | fence (sync barrier) | | |
| | +----------+----------+ +-------------------------+ | |
| | | persistent communicator (memoized plan) | |
| +--------------+--------------------------------------------+ |
| v |
| +-----------------------------------------------------------+ |
| | Hierarchical Optimizer (Section IV) -- run at init() | |
| | | |
| | Optimization parameters (5): | |
| | 1) hierarchy = vector<int> (factors of p) | |
| | 2) library = vector<Library> per level | |
| | 3) stripe(s) -- multi-NIC striping factor | |
| | 4) ring(n) -- ring nodes count | |
| | 5) pipeline(m) -- pipeline depth | |
| | | |
| | Produces: tree/ring/ring+tree virtual topology + | |
| | per-level library binding + pipeline schedule | |
| +-----------------------------+-----------------------------+ |
| | factorized P2P plan |
| v |
| +-----------------------------------------------------------+ |
| | Backend Execution (Section V) -- start() / wait() | |
| | | |
| | +----------+ +----------+ +----------+ +-------------+ | |
| | | MPI P2P | | NCCL P2P | | RCCL P2P | | OneCCL/IPC | | |
| | | (OMPI / | | (Send/ | | (HIP) | | (CUDA IPC, | | |
| | | MPICH) | | Recv) | | | | HIP IPC, | | |
| | | | | | | | | L0 IPC) | | |
| | +----+-----+ +----+-----+ +----+-----+ +------+------+ | |
| | | | | | | |
| | v v v v | |
| | Slingshot NVLink XGMI intra-die | |
| | IB / RoCE (Nvidia) (AMD) (PVC tile) | |
| +-----------------------------------------------------------+ |
+--------------------------------------------------------------------+
^ Fig 1: HiCCL four-layer architecture. The compositional API
decouples *what* the collective does (IR primitives) from
*how* it executes (per-level library + topology). Only
nonblocking point-to-point operations from existing
libraries are used as the bottom-most building block.
Interpretation. HiCCL inverts the traditional collective library model. NCCL/RCCL/MPI each ship a closed set of monolithic collective implementations specialised to one vendor's hardware. HiCCL instead ships a thin compositional layer that consumes any vendor library purely through its point-to-point primitives, then assembles those into hierarchical multi-step collectives. The architectural consequence is that the same HiCCL collective spec runs on Delta (Nvidia A100), Perlmutter (Nvidia A100, 4 NICs), Frontier (AMD MI250X), and Aurora (Intel PVC) by changing only the per-level library binding and the hierarchy factorization vector.
2. The IR / Primitive Layer (Multicast, Reduction, Fence)
HiCCL's entire collective universe is expressed in three primitives. This is the "narrow waist" of the design -- everything above is high-level collectives, everything below is point-to-point send/recv on some vendor library.
+-------------------------------------------------------------------+
| HiCCL IR -- three primitives, no others |
| |
| (1) MULTICAST M(i, j, d) |
| one-to-many, root rank i sends d bytes to leaf vector j |
| |
| root i |
| +---------- d ----------> j[0] |
| | |
| +---------- d ----------> j[1] |
| | |
| +---------- d ----------> j[2] |
| |
| Special case: |j| = 1 => point-to-point send |
| |
| (2) REDUCTION R(i, j, d, op) |
| many-to-one, leaf vector i reduces d bytes into root j |
| with op in {sum, max, min, logical_or, ...} |
| |
| leaf i[0] ---- d ----+ |
| | |
| leaf i[1] ---- d ----+--[ op ]----> root j |
| | |
| leaf i[2] ---- d ----+ |
| |
| Special case: |i| = 1 => point-to-point send (no op) |
| |
| (3) FENCE (no arguments) |
| Logical synchronization barrier between successive |
| multicast/reduction registrations. Expresses the data |
| dependency: "primitives registered after this fence may |
| read the result of primitives registered before this fence." |
| It is purely a *dependency declaration*, not a runtime |
| synchronization. Pipelined execution (Section IV-E) |
| respects fences without adding global barriers. |
+-------------------------------------------------------------------+
^ Fig 2: The three HiCCL IR primitives. Multicast and reduction
each form a one-root-many-leaf tree; fence separates
multi-step collectives into dependent stages.
C++ registration API (Listing 1 in paper):
void Comm<T>::add_multicast(T *sendbuf, T *recvbuf,
size_t count,
int i, // root rank
std::vector<int> j_vec); // leaf ranks
void Comm<T>::add_reduction(T *sendbuf, T *recvbuf,
size_t count,
std::vector<int> i_vec, // leaf ranks
int j, // root rank
HiCCL::op op); // sum / max / ...
void Comm<T>::add_fence();
All registered primitives execute in parallel by default within a single step. The fence is the only mechanism by which the user (or compiler) introduces sequencing.
Why three primitives suffice. Multicast covers Broadcast/Scatter/Gather/ Allgather/Alltoall; reduction covers Reduce/Reduce-scatter/Allreduce when composed with multicast across a fence; fence enables multi-step expression. The paper proves (Table II) this is sufficient for all eight standard MPI collectives.
3. Collective Composition Grammar
3a. Single-step compositions (Table II, "Single" column)
+-------------------------------------------------------------------+
| Collective Composition (parallel, no fence) |
| ---------- -------------------------------- |
| Broadcast M(i, U, dp) |
| Reduce R(U, j, dp, op) |
| All-gather sum_j M(j, U, d) |
| Reduce-scatter sum_j R(U, j, d, op) |
| All-reduce sum_j R(U, j, dp, op) // redundant! |
| Scatter sum_j M(i, j, d) |
| Gather sum_i M(i, j, d) |
| All-to-all sum_i sum_j M(i, j, d) // p^2 prims |
| |
| U = vector of all p ranks; sum = parallel composition operator |
+-------------------------------------------------------------------+
^ Fig 3: Single-step realizations. All-reduce in one step moves
d*p^2 bytes -- bandwidth-suboptimal -- which motivates the
multi-step formulation below.
3b. Multi-step compositions (Table II, "Multiple" column)
The "." operator denotes a fence-separated sequence (right to left):
+-------------------------------------------------------------------+
| Collective Composition (with fences) |
| ---------- ----------------------------- |
| Broadcast Gather . Scatter |
| Reduce Gather . Reduce-scatter |
| All-gather Broadcast . Gather |
| Reduce-scatter Scatter . Reduce |
| All-reduce Broadcast . Reduce (2-step) |
| All-reduce All-gather . Reduce-scatter (2-step) |
+-------------------------------------------------------------------+
^ Fig 4: Multi-step realizations. The All-reduce = AllGather o
ReduceScatter form moves only 2*d*(p-1) bytes vs. d*p^2 in
the single-step form, restoring bandwidth optimality.
Worked example: All-reduce on 3 processes (paper Fig. 4):
Proc 0 Proc 1 Proc 2
[1, 2, 3] [4, 5, 6] [7, 8, 9]
| | |
| Phase 1: Reduce-scatter (3 R primitives in parallel)
| | |
| R0(U, 0, d, +) for slice [0] // sum -> P0
| R1(U, 1, d, +) for slice [1] // sum -> P1
| R2(U, 2, d, +) for slice [2] // sum -> P2
v v v
[12, *, *] [*, 15, *] [*, *, 18]
| | |
| ----- fence ----- (data dependency declared)
| | |
| Phase 2: All-gather (3 M primitives in parallel)
| | |
| M0(0, U, d) : P0 multicasts its slice to all
| M1(1, U, d) : P1 multicasts its slice to all
| M2(2, U, d) : P2 multicasts its slice to all
v v v
[12, 15, 18] [12, 15, 18] [12, 15, 18]
| | |
start() . . . . . . . . . . . . . . wait()
^ Fig 5: All-reduce as ReduceScatter . AllGather. The dashed
broadcast edges in Fig. 4 of the paper can be omitted for
in-place implementation (recvbuf reused).
Listing 2 sample HiCCL program (verbatim from paper):
using namespace HiCCL;
Comm<float> comm();
// step 1) register Reduce-scatter using primitives
for (int j = 0; j < numproc; j++)
comm.add_reduction(sendbuf + j*count, recvbuf + j*count,
count, U, j, op::sum);
// step 2) register fence to express data dependency
comm.add_fence();
// step 3) register All-gather using primitives
for (int i = 0; i < numproc; i++)
comm.add_multicast(recvbuf + i*count, recvbuf + i*count,
count, i, others);
// optimization parameters for Aurora
std::vector<int> hierarchy{numproc/12, 6, 2};
std::vector<Library> library{MPI, IPC, IPC};
int stripe(1);
int ring(1);
int pipeline(16);
// initialization
comm.init(hierarchy, library, ring, stripe, pipeline);
comm.start(); // nonblocking
// ... do other things ...
comm.wait(); // block until safe to reuse buffers
4. Hierarchical Factorization & Optimization Architecture
After the compositional graph is registered, HiCCL applies four
mechanical optimizations at init() time. None depend on the
specific collective: they all operate on the IR primitives.
+-------------------------------------------------------------------+
| HiCCL Optimization Pipeline (run once at init()) |
| |
| IR graph (multicast/reduction primitives + fences) |
| | |
| v |
| (A) HIERARCHICAL TREE STRUCTURE |
| Recursively decompose each M/R primitive into sub-trees, |
| one per level of the network hierarchy. |
| hierarchy = {f1, f2, ..., fL} (factors of p) |
| Each level binds to a specific point-to-point library. |
| | |
| v |
| (B) MULTI-NIC STRIPING (s = stripe factor) |
| Split each inter-node primitive into s parallel copies, |
| each handled by a different NIC. All NICs participate. |
| Implements sum_{striped roots} R(0, j, d/s) etc. |
| | |
| v |
| (C) HYBRID RING + TREE TOPOLOGY (n = ring nodes) |
| At the inter-node level, choose between: |
| - tree (latency-optimal, log(n) hops) |
| - ring (bandwidth-optimal, n-1 hops) |
| - ring+tree (ring across nodes, tree within) |
| | |
| v |
| (D) PIPELINING (m = pipeline depth) |
| Slice each primitive's payload into m sub-chunks, |
| overlap stages along the dependency graph. Hides |
| intra-node cost behind inter-node cost (or vice versa). |
| Requires m >= number of pipeline stages (4 for tree, |
| 5 for ring) for full overlap. |
| | |
| v |
| Final: dependency graph of P2P calls + per-level library |
| binding + pipeline schedule, memoized in communicator. |
+-------------------------------------------------------------------+
^ Fig 6: The five optimization parameters and four optimization
passes. Parameters 1, 3, 4 (hierarchy, stripe, ring) are
machine-architecture dependent; parameter 2 (library) is
software-stack dependent; parameter 5 (pipeline) is
message-length dependent.
4a. Hierarchy is a virtual topology, not the physical one
HiCCL's hierarchy is a vector of integer factors of p (the total GPU count). For p=24 the paper shows six valid factorizations (Fig. 5):
{3, 8} -- 2 levels, 3 outer * 8 inner
{4, 6} -- 2 levels, 4 outer * 6 inner
{3, 2, 4} -- 3 levels
{2, 2, 6} -- 3 levels
{3, 2, 2, 2} -- 4 levels (deepest hierarchy)
{2, 2, 2, 3} -- 4 levels
The factorization is a virtual communication hierarchy that need not match the physical one -- though performance is best when they align. This decoupling is what gives HiCCL portability: the same primitive registration runs on any machine by changing the hierarchy vector.
4b. Per-level library binding (mixed-library execution)
+-----------------------------------------------------------------+
| Hierarchy {2, 2, 3} --> Library {MPI, NCCL, IPC} |
| |
| Level 1 (red) -----> MPI (across nodes) |
| Level 2 (yellow) ----> NCCL (across NVLink islands in node) |
| Level 3 (green) -----> IPC (within NVLink island) |
| |
| Leaf level 4 (blue) --> GPU endpoints |
+-----------------------------------------------------------------+
^ Fig 7: HiCCL uses a different library at each level of the
hierarchy. The library is chosen for the level where it
delivers the best point-to-point performance.
Final factorizations the paper used (Table V):
+----------------+--------+-----------+-----------------------+
| System | Topo | Hierarchy | Library per level |
+----------------+--------+-----------+-----------------------+
| Delta | Tree | {2,2,4} | {NCCL, NCCL, IPC} |
| Perlmutter | Ring+T | {4,4} | {NCCL, IPC} |
| Frontier | Tree | {2,2,4,2} | {MPI, MPI, IPC, IPC} |
| Frontier | Ring+T | {4,4,2} | {MPI, IPC, IPC} |
| Aurora | Tree | {2,2,6,2} | {MPI, MPI, IPC, IPC} |
| Aurora | Ring+T | {4,6,2} | {MPI, IPC, IPC} |
+----------------+--------+-----------+-----------------------+
^ Fig 8: Per-system factorizations. Note that NCCL is unavailable
on AMD/Intel and is replaced with MPI at top levels;
IPC dominates intra-node on all systems.
4c. State machine of an optimized collective
[register primitives]
|
v
+---------------+
| COMPOSED | (IR graph in memory)
+-------+-------+
| comm.init(hierarchy, library, ring,
| stripe, pipeline)
v
+---------------+
| OPTIMIZED | (factorized P2P plan
| | memoized)
+-------+-------+
| comm.start()
v
+---------------+
| EXECUTING | (pipelined P2P calls
| | on GPU streams)
+-------+-------+
| comm.wait()
v
+---------------+
| COMPLETE | (buffers safe to reuse)
+---------------+
| (next call: reuse memoized plan)
v
back to EXECUTING
^ Fig 9: HiCCL persistent communicator state machine. The
OPTIMIZED state is reached once and reused across all
subsequent calls -- amortizing planning over thousands
of collective invocations.
5. Control Flow / Data Flow Diagrams
5a. Control flow -- registration and execution
START
|
v
1. User registers primitives:
comm.add_multicast(...)
comm.add_reduction(...)
comm.add_fence()
|
v
2. User specifies optimization parameters:
hierarchy = {f1, ..., fL}
library = {Lib1, ..., LibL}
stripe(s), ring(n), pipeline(m)
|
v
3. comm.init(...) [one-time, "memoization"]
|
+---> (a) Build hierarchical tree per primitive (Sec IV-B)
|
+---> (b) Apply striping s (Sec IV-C)
|
+---> (c) Decide ring/tree/ring+tree at inter-node level
| (Sec IV-D), driven by parameter n
|
+---> (d) Build pipelined dependency graph with depth m
| (Sec IV-E)
|
+---> Persist final P2P call plan in Comm<T>
|
v
4. comm.start() [nonblocking]
|
+---> Issue pipelined P2P calls on GPU streams
| using each level's bound library
|
v
5. (concurrent host work)
|
v
6. comm.wait() [block on completion]
|
v
DONE
^ Fig 10: HiCCL control flow. Steps 1-3 happen once; steps 4-6
repeat for each collective invocation, reusing the plan.
5b. Data flow -- striped tree broadcast on 4 nodes x 3 GPUs
Hierarchy {4, 3}, stripe(3), tree topology.
Root is GPU 0 on Node N0. Multicast d bytes to all 12 GPUs.
Stage 0 (intra-node striping at root):
N0: g0 ====== d/3 ======> g1 [orange]
N0: g0 ====== d/3 ======> g2 [orange]
Stage 1 (inter-node hop, NIC-parallel):
N0:g0 ==d/3==> N1:g0 (NIC 0 -- forward stripe) [green]
N0:g1 ==d/3==> N2:g1 (NIC 1 -- forward stripe) [green]
N0:g2 ==d/3==> N3:g2 (NIC 2 -- forward stripe) [green]
Stage 2 (inter-node hop, completing tree):
N1:g0 ==d/3==> N2:g0 (and so on for the tree fanout) [green]
N1:g0 ==d/3==> N3:g0
...
Stage 3 (intra-node redistribution at every leaf node):
Each receiving node's "staging GPU" multicasts d/3 to its
two siblings via IPC, in parallel across the three stripes.
N1: g0 ==d/3==> g1,g2 (all in parallel) [red]
N2: g0 ==d/3==> g0,g2
N3: g0 ==d/3==> g0,g1
Total bytes per NIC = d/3 (one forward stripe + outgoing fanout).
All 4 NICs active simultaneously -> ~4x BW vs. single-NIC scheme.
^ Fig 11: 4-stage striped tree broadcast. Each color tracks one
stripe of d/3 bytes. With pipelining (m=4), stages 0-3
overlap, hiding intra-node cost behind inter-node cost.
5c. Pipelined execution overlap (paper Fig. 7)
m=1 (no pipelining):
|--MPI--|--NCCL--|--IPC--| stage 0 fully serializes
|--MPI--|--NCCL--|--IPC--| stage 1
|--MPI--|--NCCL--|--IPC--|
^^^^^ slowest stage is intra-node (orange) and inter-node (cyan)
serially
m=2 (depth-2 pipeline):
|--MPI--|--NCCL--|--IPC--|
|--MPI--|--NCCL--|--IPC--| 50% overlap
m=5 (full pipeline for ring topology):
|M|N|I|
|M|N|I|
|M|N|I|
|M|N|I|
|M|N|I| middle stages fully overlapped;
only warm-up + wind-down are partial
^ Fig 12: Pipeline depth scaling. For ring topology, m>=5 stages
(4 for tree) are required to fully overlap inter-node and
intra-node communication. Beyond that, throughput is
bound by the slowest stage alone.
6. Performance Model and Cost Equations
The paper provides closed-form cost models (Eqs. 1-2):
Ring (n nodes, pipeline depth m):
d
t_ring = ( alpha + ----- ) * (n + m - 2) + O(d/m)
k*f*m
^^^^^^^
intra-node overhead
shrinks with 1/m
Tree (n nodes, pipeline depth m):
d
t_tree = ( alpha m + --- ) * log n + O(d/m)
k*f
alpha = per-hop network latency
k = #NICs per node
f = NIC bandwidth (GB/s)
d = message length
n = #conceptual ring/tree nodes
m = pipeline depth
Asymptotic throughput upper bounds (Table III):
+-----------+-------------------+------------------------+-----------+
| Collective| Throughput bound | Form | Note |
+-----------+-------------------+------------------------+-----------+
| Broadcast | k*f | independent of p | per NIC |
| Reduce | k*f | independent of p | per NIC |
| Gather | k*f * p/(p-g) | p:GPUs, g:GPUs/node | - |
| Scatter | k*f * p/(p-g) | mirrors Gather | - |
| All-gather| k*f * p/(p-g) | - | - |
| ReduceSc. | k*f * p/(p-g) | - | - |
| All-reduce| k*f * p / 2(p-g) | factor 2 vs scatter | - |
| All-to-all| k*f * p / g(p-g) | g GPUs/node | - |
+-----------+-------------------+------------------------+-----------+
^ Fig 13: HiCCL theoretical upper bounds (Table III). These are
the empirical bounds reached by HiCCL in Fig. 8 of the
paper (the rectangular frames around HiCCL bars).
7. Trade-off Tables: HiCCL vs. Flat NCCL
7a. Architectural trade-offs
| Dimension | Flat NCCL | Hierarchical HiCCL | Winner (DynamICCL) |
|---|---|---|---|
| Algorithm selection | Single algorithm per collective | Per-level algorithm (tree/ring/ring+tree) | HiCCL |
| Vendor portability | Nvidia GPUs only | Nvidia + AMD + Intel via P2P API | HiCCL |
| NIC utilization (multi-NIC) | One NIC per process bound | All k NICs via stripe(s) | HiCCL |
| Memoization of plan | Internal, opaque | Persistent communicator, reusable plan | HiCCL |
| Pipelining across levels | Tied to algorithm | Generalized, depth m configurable | HiCCL |
| Algorithm count | ~5 (Ring/Tree/CollNet/NVLS/PAT) | Compositional infinity over 3 primitives | HiCCL (expressive) |
| Tuning surface | (algo, proto, nCh, nThreads) | (hierarchy[], library[], s, n, m) | HiCCL (richer) |
| Code-generation requirement | None (precompiled) | None (interpreted at init) | Tie |
| Custom-tuned per-vendor perf | Highest peak on Nvidia | 1.05-1.55x competitive vs. NCCL/RCCL/OneCCL | NCCL (peak only) |
| Geomean speedup vs. MPI | N/A (NCCL is not MPI) | 12.52x / 14.22x / 9.76x / 48.02x | HiCCL |
| Configuration space size | ~5 algos x 3 protos x 8 ch x... | hierarchy^L x library^L x s x n x m | HiCCL (much larger) |
For DynamICCL, prefer HiCCL's hierarchical view
because flat NCCL's configuration tuple
(algo, proto, nChannels, numThreads) is exactly the
single-level projection of HiCCL's
(hierarchy, library, stripe, ring, pipeline). To cover the
full optimization space that HiCCL exposes, DynamICCL Agent-2 must
extend its action space to be hierarchical: emit a per-level decision
rather than a single global decision.
7b. Composition vs. specialization trade-offs
| Dimension | Specialized algorithm (NCCL Ring) | Composed algorithm (HiCCL) | Winner (DynamICCL) |
|---|---|---|---|
| Performance ceiling | Hand-tuned for one topology | Approaches theoretical bound | HiCCL |
| Performance floor | Optimized only for known topo. | Bound by slowest level's library | NCCL |
| Adaptability to new hardware | Requires NCCL release | Change hierarchy[] vector | HiCCL |
| Number of design decisions | 1 (which algo) | L levels * (algo, library) each | NCCL (simpler) |
| Reusability of optimizations | Per-collective | Universal (all collectives benefit) | HiCCL |
| RL action space cardinality | ~120 configs (538) | ~10^4-10^6 hierarchical configs | HiCCL (harder RL) |
| Per-level locality of policy | Not exposed | Native -- per-level head possible | HiCCL |
Implication for DynamICCL: The exploding action space when going hierarchical (5b column) is the central tension. A flat 120-action policy is trivial; a hierarchical 10^5-action policy needs structural decomposition -- specifically, a separate per-level policy head that factorizes the joint over levels. This is the critical architectural insight HiCCL forces onto DynamICCL.
7c. Optimization composability
| Optimization | Independent of collective? | Independent of hardware? | Per-level? |
|---|---|---|---|
| Hierarchical tree | Yes (works on any M/R primitive) | No (factor vector is HW-specific) | Yes |
| Multi-NIC striping | Yes | No (depends on k NICs/node) | Inter-node only |
| Ring/Tree/Hybrid | Yes | Mostly (depends on N nodes) | Inter-node only |
| Pipelining | Yes | No (depends on intra/inter ratio) | Cross-level |
| Library binding | Yes | No (depends on installed stack) | Yes |
The pattern: optimizations are orthogonal across collectives but coupled to hardware. This is the architectural reason HiCCL ports cleanly across machines while requiring per-machine tuning of the 5 parameters.
7d. Empirical performance summary (Fig. 8, Table V)
+-----------------+--------+----------+-----------+----------+--------+
| Geomean speedup | Delta |Perlmutter| Frontier | Aurora | Avg |
+-----------------+--------+----------+-----------+----------+--------+
| HiCCL vs. MPI | 12.52x | 14.22x | 9.76x | 48.02x | 17x |
| HiCCL vs. NCCL | 1.26x | 1.05x | 1.55x* | 12.01x** | 1.27x |
+-----------------+--------+----------+-----------+----------+--------+
* vs. RCCL on Frontier (no NCCL on AMD)
** vs. OneCCL on Aurora (no NCCL on Intel)
Per-optimization incremental speedup (geomean, Fig. 8):
flat baseline 1.00x
+ hierarchical opts +4.08x (red -> orange)
+ multi-NIC striping +3.62x to +4.76x on multi-NIC nodes
+1.29x on single-NIC Delta
+ pipelining up to 2.72x extra on Perlmutter
(full saturation of intra-node BW)
8. What to Borrow for DynamICCL
DynamICCL's Agent-2 currently selects
(algo, proto, nChannels, numThreads) as a flat
tuple per collective. HiCCL's architectural argument is that this flat
view is fundamentally the projection of a richer hierarchical
configuration onto a single dimension. Borrowing from HiCCL means
lifting Agent-2 into a hierarchical decision-maker.
8.1 Hierarchical Action Space Decomposition
HiCCL's pattern: A collective is parameterized by a
vector of per-level decisions:
hierarchy = {f1, f2, ..., fL},
library = {Lib1, ..., LibL}. Each level is independently
chosen.
DynamICCL application: Agent-2 should emit a per-level action rather than a single global action. For a 2-level cluster (intra-node, inter-node):
Agent-2 output structure:
level 1 head (intra-node):
action_intra = (algo_intra in {ring, tree, ipc-direct},
proto_intra in {ll, ll128, simple},
nCh_intra in {1, 2, 4, 8})
level 2 head (inter-node):
action_inter = (algo_inter in {ring, tree, ring+tree},
proto_inter in {ll128, simple},
nCh_inter in {1, 2, 4, 8})
meta head:
stripe_factor s in {1, 2, 4, k_NICs}
pipeline_depth m in {1, 4, 8, 16, 32}
This is a factored MDP: instead of a 10^5-cardinality joint action space, each head emits 30-100 actions and the joint is sampled compositionally. The RL training cost scales linearly with L (number of levels) instead of exponentially.
8.2 Three-Primitive IR as a Structural Prior
HiCCL's pattern: All eight standard collectives reduce to a sequence of multicast / reduction / fence primitives. The implementation surface area is 3 (primitives) + 4 (optimizations) = 7 things to get right.
DynamICCL application: Agent-2's reward / state model should treat the collective as a graph of primitives rather than an opaque op. Specifically, for an Allreduce:
- recognize it as
AllGather . ReduceScatter(two primitives + fence); - emit a separate config for each primitive within the same collective, because reduction has different bandwidth/latency characteristics than multicast.
This unlocks decisions like "use LL128 for the AllGather phase but
Simple for the ReduceScatter phase" which the current flat tuner cannot
express. Concretely, the state vector should include
primitive_type in {multicast, reduction, fence} as an input
feature, so the policy can specialize.
8.3 Persistent Memoized Plan = Cached RL Inference
HiCCL's pattern: The communicator memoizes its
optimized plan at init(). Subsequent calls reuse it -- zero
replanning overhead.
DynamICCL application: Agent-2 should be a
plan-emitting agent, not a per-call agent. After
Agent-1 (Trigger) detects a regime change, Agent-2 emits a hierarchical
plan that the runtime caches in the communicator. Until the next
trigger, the runtime serves from cache. This is identical to HiCCL's
persistent communicator pattern, and matches DynamICCL's existing
"stale-but-valid KV cache" design (memory note 2026-03-18, line 95). The
cache key is
(collective_type, hierarchy_signature, msg_size_bin).
8.4 Pipeline
Depth m as a First-Class State Feature
HiCCL's pattern: Pipeline depth m
directly trades latency vs. intra-node hiding. Eq. 1 shows
t_ring ~ d/(k*f*m) -- latency drops with 1/m until
amortized. But excessive m hurts small messages because the per-stage
warm-up dominates (Fig. 9 shows m=128 worse than m=4 for small
buffers).
DynamICCL application: Add
pipeline_depth_m as a learnable action dimension AND
current_msg_size_bin as a state feature, because the
optimal m is monotonically increasing in message size. The state-action
coupling is explicit: for msg_size < 64KiB,
m=1 is optimal; for msg_size > 16MiB,
m=32 saturates. Agent-2 should learn this shape.
Recommended action discretization:
m in {1, 2, 4, 8, 16, 32, 64, 128} (8 choices) -- matches
the HiCCL paper's sweep range exactly.
8.5 Multi-NIC
Striping s -- The Forgotten Knob
HiCCL's pattern: Striping factor s
parallelizes inter-node primitives across all k NICs of a multi-NIC
node. On Perlmutter (4 NICs), striping delivers 3.62x extra throughput.
NCCL ties one process to one NIC and cannot do this without manual
configuration.
DynamICCL application: Add
nic_stripe_factor s to Agent-2's action space. State
features needed: num_nics_per_node (cluster topology
descriptor) and is_inter_node_phase (binary). Reward
shaping: for is_inter_node=true AND msg_size > 1MiB,
give a large positive reward when s == num_nics_per_node,
because under-striping leaves NIC bandwidth on the table.
This is a knob NCCL does not natively expose, so DynamICCL effectively becomes a hybrid HiCCL/NCCL tuner: it can choose to delegate NIC striping to NCCL (single-NIC default) or to override with multi-NIC striping when the hardware supports it.
8.6 Per-Level Library Binding as a Vendor-Aware State Feature
HiCCL's pattern: Library =
{MPI, NCCL, IPC} per level. NCCL is unused at top levels on
Frontier (AMD) and Aurora (Intel) because RCCL/OneCCL have inferior
point-to-point throughput at scale, but IPC dominates intra-node on all
platforms.
DynamICCL application: The state vector needs:
cluster_vendor in {nvidia, amd, intel, mixed}
intra_node_libs in 2^{ipc, nccl, mpi} (bitmask)
inter_node_libs in 2^{nccl, rccl, oneccl, mpi}
The action space then has a library_per_level head. On a
pure-Nvidia DGX cluster, the policy collapses to "always NCCL"; on
Frontier the policy must learn "MPI for inter-node, IPC for intra-node."
The same agent generalizes across cluster vendors -- exactly the HiCCL
portability contract.
8.7 Asymptotic Cost Model as Reward Shaping Oracle
HiCCL's pattern: Equations 1-2 give closed-form predictions. The observed bandwidth in Fig. 8 is within ~5% of these bounds across all four machines.
DynamICCL application: Use the HiCCL cost model as a simulator for Agent-2 pretraining (analogous to Pensieve's chunk-level simulator):
T_pred(d, hierarchy, library, s, n, m) =
alpha_inter * (n + m - 2)
+ d / (k * f_inter * m * s) // inter-node term
+ alpha_intra * log(g) // intra-node tree depth
+ d / (g * f_intra * m) // intra-node term
Fit (alpha_inter, alpha_intra, f_inter, f_intra) from a
one-time profiling sweep on the target cluster. Agent-2 is pre-trained
against this oracle (thousands of episodes/sec), then fine-tuned online
against actual collectives. This collapses sample complexity by 100x --
the same multiplier Pensieve achieved with its chunk-level simulator
(memory note 2026-03-17, line 76).
8.8 Action Masking for Invalid Hierarchical Configurations
HiCCL's pattern: Not every
(hierarchy, library) combination is valid on every machine.
NCCL is unavailable on AMD; IPC requires same-node ranks; ring requires
n >= 2. HiCCL silently rejects invalid configs at
init().
DynamICCL application: Mirror NCCL's action-masking pattern from 0011_Demystifying_NCCL.md (section 5.1). Build a per-cluster action validity matrix:
+-------------+---------+--------+---------+--------+
| Level/Lib | NCCL | RCCL | OneCCL | IPC |
+-------------+---------+--------+---------+--------+
| Intra-node | YES | YES | YES | YES |
| Inter-node | YES(NV) | YES(AMD)| YES(I) | NO |
+-------------+---------+--------+---------+--------+
At policy forward pass, mask invalid actions to -inf
logits before softmax. This is the same trick the NCCL paper
uses for (algo, proto) constraints (Tree only supports
Simple, etc.), now extended to a 2D hierarchy-by-library matrix.
8.9 Two-Stage All-reduce Decomposition as Multi-Step Reward Attribution
HiCCL's pattern: All-reduce is
AllGather . ReduceScatter -- two phases separated by a
fence. Each phase has its own (potentially different) optimization
parameters.
DynamICCL application: When Agent-2 emits a config for an Allreduce, it should emit two configs -- one per phase. The reward signal for the collective then decomposes:
r_total = -(t_RS + t_AG)
r_RS = -t_RS (credit for ReduceScatter phase config)
r_AG = -t_AG (credit for AllGather phase config)
This is per-phase credit assignment, dramatically improving learning efficiency vs. attributing the full collective time to a single monolithic action. The HiCCL primitive graph gives DynamICCL exactly the structure needed to do this attribution correctly.
8.10 "Compose, don't specialize" as the Meta-Architectural Principle
The deepest lesson HiCCL teaches is structural. NCCL ships ~5 specialized algorithms; HiCCL ships 3 primitives + 4 optimizations and recovers all the algorithms compositionally, often better than the specialists. For DynamICCL this implies:
- Don't enumerate algorithms; enumerate primitives + transforms. Agent-2's action space should be over (primitive_type, level, library, stripe, pipeline) -- not over a fixed list of NCCL algorithm names.
- Composition is the policy class with the highest expressivity per parameter. A 5-dimensional hierarchical action vector has higher capacity than a flat 120-action softmax.
- Generalization across hardware comes from compositional structure, not from training on more hardware. HiCCL ports to AMD/Intel by changing parameters, not by retraining. DynamICCL should aim for the same property: train on Nvidia, deploy on AMD by changing topology features.
9. Analogy
HiCCL is to NCCL what LLVM IR is to hand-written
assembly. NCCL ships hand-tuned assembly for each (collective,
hardware) pair -- fast on the hardware it knew about at compile time,
and unable to retarget. HiCCL ships an intermediate representation
(multicast / reduction / fence) plus a small set of mechanical
optimization passes (factorize / stripe / ring-or-tree / pipeline), and
at init() time it lowers the IR to whichever vendor's point-to-point
primitives happen to be installed. The "hierarchy vector" is the target
triple; the "library vector" is the backend choice; the "stripe / ring /
pipeline" parameters are the optimization flags. NCCL's algorithm
enumeration corresponds to LLVM having only -O0,
-O1, ... HiCCL's compositional design is what
-march=native plus per-pass tuning gets you. DynamICCL
Agent-2 in this analogy is the PGO (profile-guided optimizer):
it observes runtime behavior and adjusts the IR-to-machine lowering
decisions at every call.
10. Summary Table
| Pattern | HiCCL origin (paper section / fig) | DynamICCL application |
|---|---|---|
| 3-primitive IR (M, R, fence) | Sec III.A, Fig. 3, Listing 1 | State feature: primitive_type per phase |
| Single-step vs multi-step composition | Sec III.B-C, Table II, Fig. 4 | Decompose Allreduce into ReduceScatter . AllGather phases |
| Hierarchy vector = factorization of p | Sec IV.B, Fig. 5, Listing 2 line 13 | Per-level action head; one per level of factorization |
| Per-level library binding | Sec IV.B, Listing 2 line 14, Table V | library-per-level head; vendor-aware state features |
| Multi-NIC striping factor s | Sec IV.C, Fig. 6 | nic_stripe action; reward = +k*f utilization for inter-node large |
| Hybrid Ring+Tree at inter-node level | Sec IV.D, Fig. 6 | algo head per level; ring for n>=4, tree for small n |
| Pipeline depth m | Sec IV.E, Fig. 7, Eq. 1-2 | pipeline_depth m action; coupled to msg_size_bin state |
| Persistent communicator memoization | Sec V.B | Stale-but-valid KV cache for Agent-2 plans (already in DynamICCL) |
| Cost model t_ring, t_tree | Eq. 1-2, Sec IV-F | Pretraining oracle for Agent-2 (Pensieve-style fast simulator) |
| Theoretical upper bound table | Table III, frames in Fig. 8 | Reward normalization: r = -t / t_theoretical_bound (in [0, 1]) |
| Action masking on invalid (algo, lib) | Implicit (Sec V.A library availability) | Mask invalid (level, library) pairs at policy forward pass |
| 17x geomean speedup vs MPI | Sec VIII (Conclusion) | Establishes ceiling -- DynamICCL targets the residual 1.05-1.55x |
| Scaling tested up to 256 nodes / 2048 GPUs | Sec VI.E, Fig. 10 | Validates that hierarchical policies remain valid at scale |