HiCCL — Architecture and Design Analysis

Paper: HiCCL: A Hierarchical Collective Communication Library Venue: IPDPS 2025 (IEEE International Parallel and Distributed Processing Symposium) Authors: Mert Hidayetoglu (Stanford / Nvidia), Simon Garcia de Gonzalo (Sandia), Elliott Slaughter (SLAC), Pinku Surana (SLAC), Wen-mei Hwu (UIUC / Nvidia), William Gropp (UIUC), Alex Aiken (Stanford) Code: https://github.com/merthidayetoglu/HiCCL Analyst: Vishwakarma Date: 2026-04-28


Table of Contents

  1. System Overview Block Diagram
  2. The IR / Primitive Layer (Multicast, Reduction, Fence)
  3. Collective Composition Grammar (Single-step + Multi-step)
  4. Hierarchical Factorization & Optimization Architecture
  5. Control Flow / Data Flow Diagrams
  6. Performance Model and Cost Equations
  7. Trade-off Tables: HiCCL vs. Flat NCCL
  8. What to Borrow for DynamICCL
  9. Analogy
  10. Summary Table

1. System Overview Block Diagram

+--------------------------------------------------------------------+
|                       HiCCL System Architecture                    |
|                                                                    |
|  +-----------------------------------------------------------+     |
|  |  Application layer (PyTorch / scientific app)             |     |
|  |    declares a collective (Allreduce, Broadcast, ...)      |     |
|  +----------------------------+------------------------------+     |
|                               | C++ API call                       |
|                               v                                    |
|  +-----------------------------------------------------------+     |
|  |  HiCCL Compositional API (Section III)                    |     |
|  |                                                           |     |
|  |   +---------------------+   +-------------------------+   |     |
|  |   | Comm<T>::            |  |  Three IR primitives:   |   |     |
|  |   |  add_multicast(...)  |  |    M(i, j, d)           |   |     |
|  |   |  add_reduction(...)  |  |    R(i, j, d, op)       |   |     |
|  |   |  add_fence()         |  |    fence (sync barrier) |   |     |
|  |   +----------+----------+   +-------------------------+   |     |
|  |              | persistent communicator (memoized plan)    |     |
|  +--------------+--------------------------------------------+     |
|                 v                                                  |
|  +-----------------------------------------------------------+     |
|  |  Hierarchical Optimizer (Section IV) -- run at init()     |     |
|  |                                                           |     |
|  |  Optimization parameters (5):                             |     |
|  |    1) hierarchy = vector<int> (factors of p)              |     |
|  |    2) library   = vector<Library> per level               |     |
|  |    3) stripe(s) -- multi-NIC striping factor              |     |
|  |    4) ring(n)   -- ring nodes count                       |     |
|  |    5) pipeline(m) -- pipeline depth                       |     |
|  |                                                           |     |
|  |  Produces: tree/ring/ring+tree virtual topology +         |     |
|  |           per-level library binding + pipeline schedule   |     |
|  +-----------------------------+-----------------------------+     |
|                                | factorized P2P plan                |
|                                v                                    |
|  +-----------------------------------------------------------+     |
|  |  Backend Execution (Section V) -- start() / wait()        |     |
|  |                                                           |     |
|  |   +----------+ +----------+ +----------+ +-------------+ |     |
|  |   | MPI P2P  | | NCCL P2P | | RCCL P2P | | OneCCL/IPC  | |     |
|  |   | (OMPI /  | | (Send/   | | (HIP)    | | (CUDA IPC,  | |     |
|  |   |  MPICH)  | | Recv)    | |          | |  HIP IPC,   | |     |
|  |   |          | |          | |          | |  L0 IPC)    | |     |
|  |   +----+-----+ +----+-----+ +----+-----+ +------+------+ |     |
|  |        |            |            |              |        |     |
|  |        v            v            v              v        |     |
|  |   Slingshot       NVLink        XGMI         intra-die   |     |
|  |   IB / RoCE       (Nvidia)      (AMD)        (PVC tile)  |     |
|  +-----------------------------------------------------------+     |
+--------------------------------------------------------------------+
^ Fig 1: HiCCL four-layer architecture. The compositional API
        decouples *what* the collective does (IR primitives) from
        *how* it executes (per-level library + topology). Only
        nonblocking point-to-point operations from existing
        libraries are used as the bottom-most building block.

Interpretation. HiCCL inverts the traditional collective library model. NCCL/RCCL/MPI each ship a closed set of monolithic collective implementations specialised to one vendor's hardware. HiCCL instead ships a thin compositional layer that consumes any vendor library purely through its point-to-point primitives, then assembles those into hierarchical multi-step collectives. The architectural consequence is that the same HiCCL collective spec runs on Delta (Nvidia A100), Perlmutter (Nvidia A100, 4 NICs), Frontier (AMD MI250X), and Aurora (Intel PVC) by changing only the per-level library binding and the hierarchy factorization vector.


2. The IR / Primitive Layer (Multicast, Reduction, Fence)

HiCCL's entire collective universe is expressed in three primitives. This is the "narrow waist" of the design -- everything above is high-level collectives, everything below is point-to-point send/recv on some vendor library.

+-------------------------------------------------------------------+
|       HiCCL IR -- three primitives, no others                     |
|                                                                   |
|  (1) MULTICAST   M(i, j, d)                                       |
|      one-to-many, root rank i sends d bytes to leaf vector j      |
|                                                                   |
|         root i                                                    |
|           +---------- d ----------> j[0]                          |
|           |                                                       |
|           +---------- d ----------> j[1]                          |
|           |                                                       |
|           +---------- d ----------> j[2]                          |
|                                                                   |
|      Special case: |j| = 1  =>  point-to-point send               |
|                                                                   |
|  (2) REDUCTION   R(i, j, d, op)                                   |
|      many-to-one, leaf vector i reduces d bytes into root j       |
|      with op in {sum, max, min, logical_or, ...}                  |
|                                                                   |
|         leaf i[0] ---- d ----+                                    |
|                              |                                    |
|         leaf i[1] ---- d ----+--[ op ]----> root j                |
|                              |                                    |
|         leaf i[2] ---- d ----+                                    |
|                                                                   |
|      Special case: |i| = 1  =>  point-to-point send (no op)       |
|                                                                   |
|  (3) FENCE  (no arguments)                                        |
|      Logical synchronization barrier between successive           |
|      multicast/reduction registrations. Expresses the data        |
|      dependency: "primitives registered after this fence may      |
|      read the result of primitives registered before this fence." |
|      It is purely a *dependency declaration*, not a runtime       |
|      synchronization. Pipelined execution (Section IV-E)          |
|      respects fences without adding global barriers.              |
+-------------------------------------------------------------------+
^ Fig 2: The three HiCCL IR primitives. Multicast and reduction
        each form a one-root-many-leaf tree; fence separates
        multi-step collectives into dependent stages.

C++ registration API (Listing 1 in paper):

void Comm<T>::add_multicast(T *sendbuf, T *recvbuf,
                            size_t count,
                            int i,                 // root rank
                            std::vector<int> j_vec); // leaf ranks

void Comm<T>::add_reduction(T *sendbuf, T *recvbuf,
                            size_t count,
                            std::vector<int> i_vec, // leaf ranks
                            int j,                  // root rank
                            HiCCL::op op);          // sum / max / ...

void Comm<T>::add_fence();

All registered primitives execute in parallel by default within a single step. The fence is the only mechanism by which the user (or compiler) introduces sequencing.

Why three primitives suffice. Multicast covers Broadcast/Scatter/Gather/ Allgather/Alltoall; reduction covers Reduce/Reduce-scatter/Allreduce when composed with multicast across a fence; fence enables multi-step expression. The paper proves (Table II) this is sufficient for all eight standard MPI collectives.


3. Collective Composition Grammar

3a. Single-step compositions (Table II, "Single" column)

+-------------------------------------------------------------------+
|  Collective       Composition (parallel, no fence)                |
|  ----------       --------------------------------                |
|  Broadcast        M(i, U, dp)                                     |
|  Reduce           R(U, j, dp, op)                                 |
|  All-gather       sum_j M(j, U, d)                                |
|  Reduce-scatter   sum_j R(U, j, d, op)                            |
|  All-reduce       sum_j R(U, j, dp, op)        // redundant!      |
|  Scatter          sum_j M(i, j, d)                                |
|  Gather           sum_i M(i, j, d)                                |
|  All-to-all       sum_i sum_j M(i, j, d)       // p^2 prims       |
|                                                                   |
|  U = vector of all p ranks; sum = parallel composition operator   |
+-------------------------------------------------------------------+
^ Fig 3: Single-step realizations. All-reduce in one step moves
        d*p^2 bytes -- bandwidth-suboptimal -- which motivates the
        multi-step formulation below.

3b. Multi-step compositions (Table II, "Multiple" column)

The "." operator denotes a fence-separated sequence (right to left):

+-------------------------------------------------------------------+
|  Collective       Composition (with fences)                       |
|  ----------       -----------------------------                   |
|  Broadcast        Gather . Scatter                                |
|  Reduce           Gather . Reduce-scatter                         |
|  All-gather       Broadcast . Gather                              |
|  Reduce-scatter   Scatter . Reduce                                |
|  All-reduce       Broadcast . Reduce             (2-step)         |
|  All-reduce       All-gather . Reduce-scatter    (2-step)         |
+-------------------------------------------------------------------+
^ Fig 4: Multi-step realizations. The All-reduce = AllGather o
        ReduceScatter form moves only 2*d*(p-1) bytes vs. d*p^2 in
        the single-step form, restoring bandwidth optimality.

Worked example: All-reduce on 3 processes (paper Fig. 4):

   Proc 0           Proc 1            Proc 2
  [1, 2, 3]        [4, 5, 6]         [7, 8, 9]
      |                |                  |
      |  Phase 1: Reduce-scatter (3 R primitives in parallel)
      |                |                  |
      |  R0(U, 0, d, +) for slice [0]                          // sum -> P0
      |  R1(U, 1, d, +) for slice [1]                          // sum -> P1
      |  R2(U, 2, d, +) for slice [2]                          // sum -> P2
      v                v                  v
   [12, *, *]      [*, 15, *]         [*, *, 18]
      |                |                  |
      |  ----- fence -----   (data dependency declared)
      |                |                  |
      |  Phase 2: All-gather (3 M primitives in parallel)
      |                |                  |
      |  M0(0, U, d)   :  P0 multicasts its slice to all
      |  M1(1, U, d)   :  P1 multicasts its slice to all
      |  M2(2, U, d)   :  P2 multicasts its slice to all
      v                v                  v
   [12, 15, 18]    [12, 15, 18]      [12, 15, 18]
      |                |                  |
   start() . . . . . . . . . . . . . . wait()
^ Fig 5: All-reduce as ReduceScatter . AllGather. The dashed
        broadcast edges in Fig. 4 of the paper can be omitted for
        in-place implementation (recvbuf reused).

Listing 2 sample HiCCL program (verbatim from paper):

using namespace HiCCL;
Comm<float> comm();
// step 1) register Reduce-scatter using primitives
for (int j = 0; j < numproc; j++)
    comm.add_reduction(sendbuf + j*count, recvbuf + j*count,
                       count, U, j, op::sum);
// step 2) register fence to express data dependency
comm.add_fence();
// step 3) register All-gather using primitives
for (int i = 0; i < numproc; i++)
    comm.add_multicast(recvbuf + i*count, recvbuf + i*count,
                       count, i, others);
// optimization parameters for Aurora
std::vector<int> hierarchy{numproc/12, 6, 2};
std::vector<Library> library{MPI, IPC, IPC};
int stripe(1);
int ring(1);
int pipeline(16);
// initialization
comm.init(hierarchy, library, ring, stripe, pipeline);

comm.start();   // nonblocking
// ... do other things ...
comm.wait();    // block until safe to reuse buffers

4. Hierarchical Factorization & Optimization Architecture

After the compositional graph is registered, HiCCL applies four mechanical optimizations at init() time. None depend on the specific collective: they all operate on the IR primitives.

+-------------------------------------------------------------------+
|      HiCCL Optimization Pipeline (run once at init())             |
|                                                                   |
|   IR graph (multicast/reduction primitives + fences)              |
|       |                                                           |
|       v                                                           |
|   (A) HIERARCHICAL TREE STRUCTURE                                 |
|       Recursively decompose each M/R primitive into sub-trees,    |
|       one per level of the network hierarchy.                     |
|       hierarchy = {f1, f2, ..., fL}  (factors of p)               |
|       Each level binds to a specific point-to-point library.      |
|       |                                                           |
|       v                                                           |
|   (B) MULTI-NIC STRIPING  (s = stripe factor)                     |
|       Split each inter-node primitive into s parallel copies,     |
|       each handled by a different NIC. All NICs participate.      |
|       Implements sum_{striped roots} R(0, j, d/s) etc.            |
|       |                                                           |
|       v                                                           |
|   (C) HYBRID RING + TREE TOPOLOGY  (n = ring nodes)               |
|       At the inter-node level, choose between:                    |
|         - tree (latency-optimal, log(n) hops)                     |
|         - ring (bandwidth-optimal, n-1 hops)                      |
|         - ring+tree (ring across nodes, tree within)              |
|       |                                                           |
|       v                                                           |
|   (D) PIPELINING  (m = pipeline depth)                            |
|       Slice each primitive's payload into m sub-chunks,           |
|       overlap stages along the dependency graph. Hides            |
|       intra-node cost behind inter-node cost (or vice versa).     |
|       Requires m >= number of pipeline stages (4 for tree,        |
|       5 for ring) for full overlap.                               |
|       |                                                           |
|       v                                                           |
|   Final: dependency graph of P2P calls + per-level library        |
|          binding + pipeline schedule, memoized in communicator.   |
+-------------------------------------------------------------------+
^ Fig 6: The five optimization parameters and four optimization
        passes. Parameters 1, 3, 4 (hierarchy, stripe, ring) are
        machine-architecture dependent; parameter 2 (library) is
        software-stack dependent; parameter 5 (pipeline) is
        message-length dependent.

4a. Hierarchy is a virtual topology, not the physical one

HiCCL's hierarchy is a vector of integer factors of p (the total GPU count). For p=24 the paper shows six valid factorizations (Fig. 5):

   {3, 8}     -- 2 levels, 3 outer * 8 inner
   {4, 6}     -- 2 levels, 4 outer * 6 inner
   {3, 2, 4}  -- 3 levels
   {2, 2, 6}  -- 3 levels
   {3, 2, 2, 2} -- 4 levels (deepest hierarchy)
   {2, 2, 2, 3} -- 4 levels

The factorization is a virtual communication hierarchy that need not match the physical one -- though performance is best when they align. This decoupling is what gives HiCCL portability: the same primitive registration runs on any machine by changing the hierarchy vector.

4b. Per-level library binding (mixed-library execution)

+-----------------------------------------------------------------+
|  Hierarchy {2, 2, 3}  -->  Library {MPI, NCCL, IPC}             |
|                                                                 |
|  Level 1 (red)   ----->  MPI    (across nodes)                  |
|  Level 2 (yellow) ---->  NCCL   (across NVLink islands in node) |
|  Level 3 (green) ----->  IPC    (within NVLink island)          |
|                                                                 |
|  Leaf level 4 (blue) -->  GPU endpoints                         |
+-----------------------------------------------------------------+
^ Fig 7: HiCCL uses a different library at each level of the
        hierarchy. The library is chosen for the level where it
        delivers the best point-to-point performance.

Final factorizations the paper used (Table V):

+----------------+--------+-----------+-----------------------+
| System         | Topo   | Hierarchy | Library per level     |
+----------------+--------+-----------+-----------------------+
| Delta          | Tree   | {2,2,4}   | {NCCL, NCCL, IPC}     |
| Perlmutter     | Ring+T | {4,4}     | {NCCL, IPC}           |
| Frontier       | Tree   | {2,2,4,2} | {MPI, MPI, IPC, IPC}  |
| Frontier       | Ring+T | {4,4,2}   | {MPI, IPC, IPC}       |
| Aurora         | Tree   | {2,2,6,2} | {MPI, MPI, IPC, IPC}  |
| Aurora         | Ring+T | {4,6,2}   | {MPI, IPC, IPC}       |
+----------------+--------+-----------+-----------------------+
^ Fig 8: Per-system factorizations. Note that NCCL is unavailable
        on AMD/Intel and is replaced with MPI at top levels;
        IPC dominates intra-node on all systems.

4c. State machine of an optimized collective

                 [register primitives]
                        |
                        v
                +---------------+
                |  COMPOSED     |   (IR graph in memory)
                +-------+-------+
                        | comm.init(hierarchy, library, ring,
                        |           stripe, pipeline)
                        v
                +---------------+
                |  OPTIMIZED    |   (factorized P2P plan
                |               |    memoized)
                +-------+-------+
                        | comm.start()
                        v
                +---------------+
                |  EXECUTING    |   (pipelined P2P calls
                |               |    on GPU streams)
                +-------+-------+
                        | comm.wait()
                        v
                +---------------+
                |  COMPLETE     |   (buffers safe to reuse)
                +---------------+
                        | (next call: reuse memoized plan)
                        v
                  back to EXECUTING
^ Fig 9: HiCCL persistent communicator state machine. The
        OPTIMIZED state is reached once and reused across all
        subsequent calls -- amortizing planning over thousands
        of collective invocations.

5. Control Flow / Data Flow Diagrams

5a. Control flow -- registration and execution

  START
    |
    v
1. User registers primitives:
     comm.add_multicast(...)
     comm.add_reduction(...)
     comm.add_fence()
    |
    v
2. User specifies optimization parameters:
     hierarchy = {f1, ..., fL}
     library   = {Lib1, ..., LibL}
     stripe(s), ring(n), pipeline(m)
    |
    v
3. comm.init(...) [one-time, "memoization"]
    |
    +---> (a) Build hierarchical tree per primitive (Sec IV-B)
    |
    +---> (b) Apply striping s (Sec IV-C)
    |
    +---> (c) Decide ring/tree/ring+tree at inter-node level
    |         (Sec IV-D), driven by parameter n
    |
    +---> (d) Build pipelined dependency graph with depth m
    |         (Sec IV-E)
    |
    +---> Persist final P2P call plan in Comm<T>
    |
    v
4. comm.start()         [nonblocking]
    |
    +---> Issue pipelined P2P calls on GPU streams
    |     using each level's bound library
    |
    v
5. (concurrent host work)
    |
    v
6. comm.wait()          [block on completion]
    |
    v
  DONE
^ Fig 10: HiCCL control flow. Steps 1-3 happen once; steps 4-6
        repeat for each collective invocation, reusing the plan.

5b. Data flow -- striped tree broadcast on 4 nodes x 3 GPUs

  Hierarchy {4, 3}, stripe(3), tree topology.
  Root is GPU 0 on Node N0. Multicast d bytes to all 12 GPUs.

  Stage 0 (intra-node striping at root):
  N0: g0 ====== d/3 ======> g1                                 [orange]
  N0: g0 ====== d/3 ======> g2                                 [orange]

  Stage 1 (inter-node hop, NIC-parallel):
  N0:g0 ==d/3==> N1:g0       (NIC 0 -- forward stripe)         [green]
  N0:g1 ==d/3==> N2:g1       (NIC 1 -- forward stripe)         [green]
  N0:g2 ==d/3==> N3:g2       (NIC 2 -- forward stripe)         [green]

  Stage 2 (inter-node hop, completing tree):
  N1:g0 ==d/3==> N2:g0       (and so on for the tree fanout)   [green]
  N1:g0 ==d/3==> N3:g0
  ...

  Stage 3 (intra-node redistribution at every leaf node):
  Each receiving node's "staging GPU" multicasts d/3 to its
  two siblings via IPC, in parallel across the three stripes.
  N1: g0 ==d/3==> g1,g2  (all in parallel)                     [red]
  N2: g0 ==d/3==> g0,g2
  N3: g0 ==d/3==> g0,g1

  Total bytes per NIC = d/3 (one forward stripe + outgoing fanout).
  All 4 NICs active simultaneously -> ~4x BW vs. single-NIC scheme.
^ Fig 11: 4-stage striped tree broadcast. Each color tracks one
        stripe of d/3 bytes. With pipelining (m=4), stages 0-3
        overlap, hiding intra-node cost behind inter-node cost.

5c. Pipelined execution overlap (paper Fig. 7)

  m=1 (no pipelining):
   |--MPI--|--NCCL--|--IPC--|     stage 0 fully serializes
                    |--MPI--|--NCCL--|--IPC--|   stage 1
                                     |--MPI--|--NCCL--|--IPC--|
   ^^^^^ slowest stage is intra-node (orange) and inter-node (cyan)
                                                          serially

  m=2 (depth-2 pipeline):
   |--MPI--|--NCCL--|--IPC--|
            |--MPI--|--NCCL--|--IPC--|     50% overlap

  m=5 (full pipeline for ring topology):
   |M|N|I|
    |M|N|I|
     |M|N|I|
      |M|N|I|
       |M|N|I|     middle stages fully overlapped;
                   only warm-up + wind-down are partial
^ Fig 12: Pipeline depth scaling. For ring topology, m>=5 stages
        (4 for tree) are required to fully overlap inter-node and
        intra-node communication. Beyond that, throughput is
        bound by the slowest stage alone.

6. Performance Model and Cost Equations

The paper provides closed-form cost models (Eqs. 1-2):

  Ring (n nodes, pipeline depth m):
                          d
    t_ring  = ( alpha + ----- ) * (n + m - 2)  +  O(d/m)
                        k*f*m
                                      ^^^^^^^
                                      intra-node overhead
                                      shrinks with 1/m

  Tree (n nodes, pipeline depth m):
                          d
    t_tree  = ( alpha m + --- ) * log n  +  O(d/m)
                          k*f

  alpha = per-hop network latency
  k     = #NICs per node
  f     = NIC bandwidth (GB/s)
  d     = message length
  n     = #conceptual ring/tree nodes
  m     = pipeline depth

Asymptotic throughput upper bounds (Table III):

+-----------+-------------------+------------------------+-----------+
| Collective| Throughput bound  | Form                   | Note      |
+-----------+-------------------+------------------------+-----------+
| Broadcast | k*f               | independent of p       | per NIC   |
| Reduce    | k*f               | independent of p       | per NIC   |
| Gather    | k*f * p/(p-g)     | p:GPUs, g:GPUs/node   | -         |
| Scatter   | k*f * p/(p-g)     | mirrors Gather         | -         |
| All-gather| k*f * p/(p-g)     | -                     | -         |
| ReduceSc. | k*f * p/(p-g)     | -                     | -         |
| All-reduce| k*f * p / 2(p-g)  | factor 2 vs scatter    | -         |
| All-to-all| k*f * p / g(p-g)  | g GPUs/node            | -         |
+-----------+-------------------+------------------------+-----------+
^ Fig 13: HiCCL theoretical upper bounds (Table III). These are
        the empirical bounds reached by HiCCL in Fig. 8 of the
        paper (the rectangular frames around HiCCL bars).

7. Trade-off Tables: HiCCL vs. Flat NCCL

7a. Architectural trade-offs

Dimension Flat NCCL Hierarchical HiCCL Winner (DynamICCL)
Algorithm selection Single algorithm per collective Per-level algorithm (tree/ring/ring+tree) HiCCL
Vendor portability Nvidia GPUs only Nvidia + AMD + Intel via P2P API HiCCL
NIC utilization (multi-NIC) One NIC per process bound All k NICs via stripe(s) HiCCL
Memoization of plan Internal, opaque Persistent communicator, reusable plan HiCCL
Pipelining across levels Tied to algorithm Generalized, depth m configurable HiCCL
Algorithm count ~5 (Ring/Tree/CollNet/NVLS/PAT) Compositional infinity over 3 primitives HiCCL (expressive)
Tuning surface (algo, proto, nCh, nThreads) (hierarchy[], library[], s, n, m) HiCCL (richer)
Code-generation requirement None (precompiled) None (interpreted at init) Tie
Custom-tuned per-vendor perf Highest peak on Nvidia 1.05-1.55x competitive vs. NCCL/RCCL/OneCCL NCCL (peak only)
Geomean speedup vs. MPI N/A (NCCL is not MPI) 12.52x / 14.22x / 9.76x / 48.02x HiCCL
Configuration space size ~5 algos x 3 protos x 8 ch x... hierarchy^L x library^L x s x n x m HiCCL (much larger)

For DynamICCL, prefer HiCCL's hierarchical view because flat NCCL's configuration tuple (algo, proto, nChannels, numThreads) is exactly the single-level projection of HiCCL's (hierarchy, library, stripe, ring, pipeline). To cover the full optimization space that HiCCL exposes, DynamICCL Agent-2 must extend its action space to be hierarchical: emit a per-level decision rather than a single global decision.

7b. Composition vs. specialization trade-offs

Dimension Specialized algorithm (NCCL Ring) Composed algorithm (HiCCL) Winner (DynamICCL)
Performance ceiling Hand-tuned for one topology Approaches theoretical bound HiCCL
Performance floor Optimized only for known topo. Bound by slowest level's library NCCL
Adaptability to new hardware Requires NCCL release Change hierarchy[] vector HiCCL
Number of design decisions 1 (which algo) L levels * (algo, library) each NCCL (simpler)
Reusability of optimizations Per-collective Universal (all collectives benefit) HiCCL
RL action space cardinality ~120 configs (538) ~10^4-10^6 hierarchical configs HiCCL (harder RL)
Per-level locality of policy Not exposed Native -- per-level head possible HiCCL

Implication for DynamICCL: The exploding action space when going hierarchical (5b column) is the central tension. A flat 120-action policy is trivial; a hierarchical 10^5-action policy needs structural decomposition -- specifically, a separate per-level policy head that factorizes the joint over levels. This is the critical architectural insight HiCCL forces onto DynamICCL.

7c. Optimization composability

Optimization Independent of collective? Independent of hardware? Per-level?
Hierarchical tree Yes (works on any M/R primitive) No (factor vector is HW-specific) Yes
Multi-NIC striping Yes No (depends on k NICs/node) Inter-node only
Ring/Tree/Hybrid Yes Mostly (depends on N nodes) Inter-node only
Pipelining Yes No (depends on intra/inter ratio) Cross-level
Library binding Yes No (depends on installed stack) Yes

The pattern: optimizations are orthogonal across collectives but coupled to hardware. This is the architectural reason HiCCL ports cleanly across machines while requiring per-machine tuning of the 5 parameters.

7d. Empirical performance summary (Fig. 8, Table V)

+-----------------+--------+----------+-----------+----------+--------+
| Geomean speedup | Delta  |Perlmutter|  Frontier | Aurora   | Avg    |
+-----------------+--------+----------+-----------+----------+--------+
| HiCCL vs. MPI   | 12.52x | 14.22x   |  9.76x    | 48.02x   | 17x    |
| HiCCL vs. NCCL  |  1.26x |  1.05x   |  1.55x*   | 12.01x** | 1.27x  |
+-----------------+--------+----------+-----------+----------+--------+
*  vs. RCCL on Frontier (no NCCL on AMD)
** vs. OneCCL on Aurora (no NCCL on Intel)

Per-optimization incremental speedup (geomean, Fig. 8):

   flat baseline           1.00x
  + hierarchical opts     +4.08x  (red -> orange)
  + multi-NIC striping    +3.62x to +4.76x on multi-NIC nodes
                          +1.29x on single-NIC Delta
  + pipelining            up to 2.72x extra on Perlmutter
                          (full saturation of intra-node BW)

8. What to Borrow for DynamICCL

DynamICCL's Agent-2 currently selects (algo, proto, nChannels, numThreads) as a flat tuple per collective. HiCCL's architectural argument is that this flat view is fundamentally the projection of a richer hierarchical configuration onto a single dimension. Borrowing from HiCCL means lifting Agent-2 into a hierarchical decision-maker.

8.1 Hierarchical Action Space Decomposition

HiCCL's pattern: A collective is parameterized by a vector of per-level decisions: hierarchy = {f1, f2, ..., fL}, library = {Lib1, ..., LibL}. Each level is independently chosen.

DynamICCL application: Agent-2 should emit a per-level action rather than a single global action. For a 2-level cluster (intra-node, inter-node):

  Agent-2 output structure:

  level 1 head (intra-node):
     action_intra = (algo_intra in {ring, tree, ipc-direct},
                     proto_intra in {ll, ll128, simple},
                     nCh_intra in {1, 2, 4, 8})

  level 2 head (inter-node):
     action_inter = (algo_inter in {ring, tree, ring+tree},
                     proto_inter in {ll128, simple},
                     nCh_inter in {1, 2, 4, 8})

  meta head:
     stripe_factor s in {1, 2, 4, k_NICs}
     pipeline_depth m in {1, 4, 8, 16, 32}

This is a factored MDP: instead of a 10^5-cardinality joint action space, each head emits 30-100 actions and the joint is sampled compositionally. The RL training cost scales linearly with L (number of levels) instead of exponentially.

8.2 Three-Primitive IR as a Structural Prior

HiCCL's pattern: All eight standard collectives reduce to a sequence of multicast / reduction / fence primitives. The implementation surface area is 3 (primitives) + 4 (optimizations) = 7 things to get right.

DynamICCL application: Agent-2's reward / state model should treat the collective as a graph of primitives rather than an opaque op. Specifically, for an Allreduce:

This unlocks decisions like "use LL128 for the AllGather phase but Simple for the ReduceScatter phase" which the current flat tuner cannot express. Concretely, the state vector should include primitive_type in {multicast, reduction, fence} as an input feature, so the policy can specialize.

8.3 Persistent Memoized Plan = Cached RL Inference

HiCCL's pattern: The communicator memoizes its optimized plan at init(). Subsequent calls reuse it -- zero replanning overhead.

DynamICCL application: Agent-2 should be a plan-emitting agent, not a per-call agent. After Agent-1 (Trigger) detects a regime change, Agent-2 emits a hierarchical plan that the runtime caches in the communicator. Until the next trigger, the runtime serves from cache. This is identical to HiCCL's persistent communicator pattern, and matches DynamICCL's existing "stale-but-valid KV cache" design (memory note 2026-03-18, line 95). The cache key is (collective_type, hierarchy_signature, msg_size_bin).

8.4 Pipeline Depth m as a First-Class State Feature

HiCCL's pattern: Pipeline depth m directly trades latency vs. intra-node hiding. Eq. 1 shows t_ring ~ d/(k*f*m) -- latency drops with 1/m until amortized. But excessive m hurts small messages because the per-stage warm-up dominates (Fig. 9 shows m=128 worse than m=4 for small buffers).

DynamICCL application: Add pipeline_depth_m as a learnable action dimension AND current_msg_size_bin as a state feature, because the optimal m is monotonically increasing in message size. The state-action coupling is explicit: for msg_size < 64KiB, m=1 is optimal; for msg_size > 16MiB, m=32 saturates. Agent-2 should learn this shape.

Recommended action discretization: m in {1, 2, 4, 8, 16, 32, 64, 128} (8 choices) -- matches the HiCCL paper's sweep range exactly.

8.5 Multi-NIC Striping s -- The Forgotten Knob

HiCCL's pattern: Striping factor s parallelizes inter-node primitives across all k NICs of a multi-NIC node. On Perlmutter (4 NICs), striping delivers 3.62x extra throughput. NCCL ties one process to one NIC and cannot do this without manual configuration.

DynamICCL application: Add nic_stripe_factor s to Agent-2's action space. State features needed: num_nics_per_node (cluster topology descriptor) and is_inter_node_phase (binary). Reward shaping: for is_inter_node=true AND msg_size > 1MiB, give a large positive reward when s == num_nics_per_node, because under-striping leaves NIC bandwidth on the table.

This is a knob NCCL does not natively expose, so DynamICCL effectively becomes a hybrid HiCCL/NCCL tuner: it can choose to delegate NIC striping to NCCL (single-NIC default) or to override with multi-NIC striping when the hardware supports it.

8.6 Per-Level Library Binding as a Vendor-Aware State Feature

HiCCL's pattern: Library = {MPI, NCCL, IPC} per level. NCCL is unused at top levels on Frontier (AMD) and Aurora (Intel) because RCCL/OneCCL have inferior point-to-point throughput at scale, but IPC dominates intra-node on all platforms.

DynamICCL application: The state vector needs:

  cluster_vendor       in {nvidia, amd, intel, mixed}
  intra_node_libs      in 2^{ipc, nccl, mpi}      (bitmask)
  inter_node_libs      in 2^{nccl, rccl, oneccl, mpi}

The action space then has a library_per_level head. On a pure-Nvidia DGX cluster, the policy collapses to "always NCCL"; on Frontier the policy must learn "MPI for inter-node, IPC for intra-node." The same agent generalizes across cluster vendors -- exactly the HiCCL portability contract.

8.7 Asymptotic Cost Model as Reward Shaping Oracle

HiCCL's pattern: Equations 1-2 give closed-form predictions. The observed bandwidth in Fig. 8 is within ~5% of these bounds across all four machines.

DynamICCL application: Use the HiCCL cost model as a simulator for Agent-2 pretraining (analogous to Pensieve's chunk-level simulator):

  T_pred(d, hierarchy, library, s, n, m) =
      alpha_inter * (n + m - 2)
    + d / (k * f_inter * m * s)         // inter-node term
    + alpha_intra * log(g)              // intra-node tree depth
    + d / (g * f_intra * m)              // intra-node term

Fit (alpha_inter, alpha_intra, f_inter, f_intra) from a one-time profiling sweep on the target cluster. Agent-2 is pre-trained against this oracle (thousands of episodes/sec), then fine-tuned online against actual collectives. This collapses sample complexity by 100x -- the same multiplier Pensieve achieved with its chunk-level simulator (memory note 2026-03-17, line 76).

8.8 Action Masking for Invalid Hierarchical Configurations

HiCCL's pattern: Not every (hierarchy, library) combination is valid on every machine. NCCL is unavailable on AMD; IPC requires same-node ranks; ring requires n >= 2. HiCCL silently rejects invalid configs at init().

DynamICCL application: Mirror NCCL's action-masking pattern from 0011_Demystifying_NCCL.md (section 5.1). Build a per-cluster action validity matrix:

  +-------------+---------+--------+---------+--------+
  | Level/Lib   | NCCL    | RCCL   | OneCCL  | IPC    |
  +-------------+---------+--------+---------+--------+
  | Intra-node  | YES     | YES    | YES     | YES    |
  | Inter-node  | YES(NV) | YES(AMD)| YES(I) | NO     |
  +-------------+---------+--------+---------+--------+

At policy forward pass, mask invalid actions to -inf logits before softmax. This is the same trick the NCCL paper uses for (algo, proto) constraints (Tree only supports Simple, etc.), now extended to a 2D hierarchy-by-library matrix.

8.9 Two-Stage All-reduce Decomposition as Multi-Step Reward Attribution

HiCCL's pattern: All-reduce is AllGather . ReduceScatter -- two phases separated by a fence. Each phase has its own (potentially different) optimization parameters.

DynamICCL application: When Agent-2 emits a config for an Allreduce, it should emit two configs -- one per phase. The reward signal for the collective then decomposes:

  r_total = -(t_RS + t_AG)
  r_RS    = -t_RS    (credit for ReduceScatter phase config)
  r_AG    = -t_AG    (credit for AllGather phase config)

This is per-phase credit assignment, dramatically improving learning efficiency vs. attributing the full collective time to a single monolithic action. The HiCCL primitive graph gives DynamICCL exactly the structure needed to do this attribution correctly.

8.10 "Compose, don't specialize" as the Meta-Architectural Principle

The deepest lesson HiCCL teaches is structural. NCCL ships ~5 specialized algorithms; HiCCL ships 3 primitives + 4 optimizations and recovers all the algorithms compositionally, often better than the specialists. For DynamICCL this implies:


9. Analogy

HiCCL is to NCCL what LLVM IR is to hand-written assembly. NCCL ships hand-tuned assembly for each (collective, hardware) pair -- fast on the hardware it knew about at compile time, and unable to retarget. HiCCL ships an intermediate representation (multicast / reduction / fence) plus a small set of mechanical optimization passes (factorize / stripe / ring-or-tree / pipeline), and at init() time it lowers the IR to whichever vendor's point-to-point primitives happen to be installed. The "hierarchy vector" is the target triple; the "library vector" is the backend choice; the "stripe / ring / pipeline" parameters are the optimization flags. NCCL's algorithm enumeration corresponds to LLVM having only -O0, -O1, ... HiCCL's compositional design is what -march=native plus per-pass tuning gets you. DynamICCL Agent-2 in this analogy is the PGO (profile-guided optimizer): it observes runtime behavior and adjusts the IR-to-machine lowering decisions at every call.


10. Summary Table

Pattern HiCCL origin (paper section / fig) DynamICCL application
3-primitive IR (M, R, fence) Sec III.A, Fig. 3, Listing 1 State feature: primitive_type per phase
Single-step vs multi-step composition Sec III.B-C, Table II, Fig. 4 Decompose Allreduce into ReduceScatter . AllGather phases
Hierarchy vector = factorization of p Sec IV.B, Fig. 5, Listing 2 line 13 Per-level action head; one per level of factorization
Per-level library binding Sec IV.B, Listing 2 line 14, Table V library-per-level head; vendor-aware state features
Multi-NIC striping factor s Sec IV.C, Fig. 6 nic_stripe action; reward = +k*f utilization for inter-node large
Hybrid Ring+Tree at inter-node level Sec IV.D, Fig. 6 algo head per level; ring for n>=4, tree for small n
Pipeline depth m Sec IV.E, Fig. 7, Eq. 1-2 pipeline_depth m action; coupled to msg_size_bin state
Persistent communicator memoization Sec V.B Stale-but-valid KV cache for Agent-2 plans (already in DynamICCL)
Cost model t_ring, t_tree Eq. 1-2, Sec IV-F Pretraining oracle for Agent-2 (Pensieve-style fast simulator)
Theoretical upper bound table Table III, frames in Fig. 8 Reward normalization: r = -t / t_theoretical_bound (in [0, 1])
Action masking on invalid (algo, lib) Implicit (Sec V.A library availability) Mask invalid (level, library) pairs at policy forward pass
17x geomean speedup vs MPI Sec VIII (Conclusion) Establishes ceiling -- DynamICCL targets the residual 1.05-1.55x
Scaling tested up to 256 nodes / 2048 GPUs Sec VI.E, Fig. 10 Validates that hierarchical policies remain valid at scale