R2CCL — Architecture and Design Analysis

Paper: Reliable and Resilient Collective Communication Library for LLM Training and Serving Authors: Wei Wang, Nengneng Yu, Sixian Xiong, Zaoxing Liu (University of Maryland, College Park) Venue: arXiv:2512.25059v1 [cs.DC], 31 Dec 2025 Code: https://github.com/r2cc-project/R-2CCL Analyst: Vishwakarma Date: 2026-04-28

R2 = Reliable and Resilient (per title, abstract, and §3 overview). The paper positions R2CCL as a fault-tolerant drop-in replacement for NCCL/RCCL that exploits multi-NIC redundancy already present in modern GPU servers to avoid full job restart on NIC/cable/QP failure.


Table of Contents

  1. System Overview Block Diagram
  2. Component Architectures (Bilateral Failure Awareness, Live Migration, R2CCL-Balance, R2CCL-AllReduce, Multi-Failure Recursive Decomposition)
  3. Annotated Flow — Failure Detection -> Hot Repair -> Online Re-optimization
  4. Trade-off Analysis
  5. What to Borrow for DynamICCL
  6. Summary Table

1. System Overview Block Diagram

+-------------------------------------------------------------------+
|                    R2CCL System Architecture                      |
|        (drop-in extension of NCCL 2.23.4, ~3K LoC C++)            |
|                                                                   |
|  +---------------------------------------------------------+      |
|  |          Application Layer (PyTorch / vLLM)             |      |
|  |   ncclAllReduce / ncclSend / ncclRecv  (unmodified)     |      |
|  +-------------------------+-------------------------------+      |
|                            | collective invocation                |
|                            v                                      |
|  +---------------------------------------------------------+      |
|  |        R2CCL Planner (host-side, plugin layer)          |      |
|  |                                                         |      |
|  |  +---------------------+   +-------------------------+  |      |
|  |  | Health-aware        |   | Strategy selector       |  |      |
|  |  | algo & schedule     |==>| (alpha-beta cost model) |  |      |
|  |  | dispatch            |   | Standard | Balance |    |  |      |
|  |  | (NCCL enqueue hook) |   | AllReduce | Recursive|   |  |      |
|  |  +---------+-----------+   +-----+-------------------+  |      |
|  |            |                     |                     |      |
|  +------------|---------------------|---------------------+      |
|               |                     |                            |
|               v                     v                            |
|  +---------------------------------------------------------+      |
|  |     R2CCL Net Plugin (ncclNet hook + proxy thread)      |      |
|  |                                                         |      |
|  |  +-----------------+   +-------------------+   +------+ |      |
|  |  | Bilateral OOB   |   | Probe QP pool     |   | Per- | |      |
|  |  | notifier (MPI / |<->| (zero-byte RDMA   |   | chan | |      |
|  |  |  TCP, non-data  |   |  Write triangul.) |   | fail-| |      |
|  |  |  NIC)           |   +-------------------+   | over | |      |
|  |  +-----------------+                           | list | |      |
|  |                                                | (PCIe| |      |
|  |  +----------------------+   +---------------+  |  ord)| |      |
|  |  | Multi-NIC GPU buffer |   | DMA-buffer    |  +------+ |      |
|  |  | registration (all    |   | rollback (last|           |      |
|  |  | NICs <-> all GPU bufs)|  | acked chunk)  |           |      |
|  |  +----------+-----------+   +-------+-------+           |      |
|  +-------------|---------------------- |------------------+      |
|                |                       |                          |
|                v                       v                          |
|  +---------------------------------------------------------+      |
|  |              NCCL Core (unmodified, intercepted)        |      |
|  |   Ring/Tree algorithms + channel partitioning           |      |
|  +-------------------------+-------------------------------+      |
|                            |                                      |
|                            v                                      |
|  +---------------------------------------------------------+      |
|  |         Multi-NIC Hardware (8 x CX-7 400Gbps + NVLink)  |      |
|  |                                                         |      |
|  |  +------+  +------+      +------+  +------+             |      |
|  |  | NIC0 |  | NIC1 | .... | NICk |  | NIC7 |             |      |
|  |  +--+---+  +--+---+      +--+---+  +--+---+             |      |
|  |     |         |             |         |                 |      |
|  |  PCIe + NVLink fabric (PXN proxy fwd available)         |      |
|  +---------------------------------------------------------+      |
+-------------------------------------------------------------------+
^ Fig 1: R2CCL inserts a Planner + Net-Plugin pair between
  PyTorch/vLLM and NCCL Core. The plugin owns three new control-
  plane primitives: bilateral OOB notification, probe-QP triangu-
  lation, and pre-registered backup connections. The planner owns
  failure-aware schedule synthesis (Balance / AllReduce / Recursive).

The architectural commitment is that R2CCL does not modify the NCCL collective kernel. It intercepts at two stable boundaries: the ncclNet plugin (transport) and NCCL's enqueue/planning layer (host). This is the same insertion-point philosophy as DynamICCL — the tuner is a plugin, NCCL core stays untouched. Consequence: R2CCL inherits all of NCCL's per-channel ring/tree machinery and re-uses the channel abstraction to express its multi-phase schedules.


2. Component Architectures

2.1 Bilateral Failure Awareness + Triangulation Localizer

  Steady state                                Failure event
  ============                                =============

  Server A                  Server B            Server A             Server B
  +--------+                +--------+          +--------+           +--------+
  | NCCL   |  RDMA data     | NCCL   |          | NCCL   | X--data-X | NCCL   |
  | proxy  | <===========>  | proxy  |          | proxy  |  (CQ err) | proxy  |
  +---+----+                +----+---+          +---+----+           +---+----+
      |                          |                  | (1) detect       |
      |  OOB bootstrap ch.       |                  |     CQ/QP error  |
      |  (MPI over non-data NIC) |                  |                  |
      +========================= +                  +======OOB notify==>
            (always idle)                                  "I see error"

                                                  +---------------------------+
                                                  | (2) Both sides probe via  |
                                                  |     dedicated probe-QP    |
                                                  |     pool (zero-byte RDMA  |
                                                  |     Write to peer + aux)  |
                                                  +-------------+-------------+
                                                                |
                                                                v
                                                  +---------------------------+
                                                  | Three-point triangulation |
                                                  |  - A->B fails, A->aux ok  |
                                                  |    => B-side NIC dead     |
                                                  |  - B->A fails, B->aux ok  |
                                                  |    => A-side NIC dead     |
                                                  |  - both fail              |
                                                  |    => link / cable broken |
                                                  +---------------------------+
                                                                |
                                                                v
                                                  +---------------------------+
                                                  | (3) Broadcast verdict to  |
                                                  |     all ranks via OOB     |
                                                  +---------------------------+
^ Fig 2: Bilateral OOB awareness + 3-point triangulation. RDMA's
  asymmetric error visibility (only one side sees CQ error) is fixed
  by an OOB notification on a non-datapath NIC. The probe-QP pool is
  isolated from data QPs so probe traffic never queues behind stalled
  bulk transfers. Auxiliary NIC distinguishes single-endpoint NIC
  failure from link failure.

The OOB channel is the load-bearing primitive. Without it, R2CCL would fall back to NCCL's existing timeout-based detection (minutes). With it, detection is millisecond-scale because the asymmetric-visibility problem (only one side gets a CQE error) is resolved by an explicit peer notification on a separate NIC.

2.2 Live Migration (Hot Repair)

  +------------------------------------------------------------------+
  |              Live Migration Data Path                            |
  |                                                                  |
  |  Sender state (rolled back to last completion)                   |
  |    +--+--+--+--+--+--+--+--+                                     |
  |    |C0|C1|C2|C3|C4|C5|XX|XX|   <- C0..C4 polled completion       |
  |    +--+--+--+--+--+--+--+--+      C5 in flight, C6+ not started  |
  |     ^                                                            |
  |     |  rewind to first chunk without CQE  (= C5)                 |
  |                                                                  |
  |  Pre-registered backup NIC list (PCIe-distance ordered):         |
  |    [NIC_primary, NIC_b1, NIC_b2, ... NIC_bk]                     |
  |                                                                  |
  |   Step A. failed NIC = primary -> select NIC_b1                  |
  |   Step B. all GPU buffers were *multi-registered* at init        |
  |           => no on-demand registration cost                      |
  |   Step C. proxy reissues C5..C7 over NIC_b1                      |
  |   Step D. if NIC_b1 fails later -> next in chain (NIC_b2)        |
  |                                                                  |
  |  Receiver state (already-written chunks are safe):               |
  |    +--+--+--+--+--+--+--+--+                                     |
  |    |C0|C1|C2|C3|C4|??|  |  |   GPU kernels read AFTER completion,|
  |    +--+--+--+--+--+--+--+--+   so partial writes harmless.       |
  |                                Reset RX to last confirmed = C4.  |
  +------------------------------------------------------------------+
^ Fig 3: Live migration via multi-NIC pre-registration + DMA rollback.
  The two technical enablers are (i) registering each GPU buffer with
  ALL NICs at communicator init (eager, off the recovery path), and
  (ii) rolling back to the last completed chunk because NCCL never
  consumes recv buffers before completion is polled.

This is the section where the paper introduces what amount to two new control-plane primitives over NCCL: multi_register(buf, [NIC0..NICk]) and rollback_to_last_ack(channel). Both operations are unavailable in stock NCCL because stock NCCL ties one buffer to one NIC and treats mid-collective faults as fatal.

2.3 R2CCL-Balance — Topology-Aware Reroute

  Before failure                 R2CCL-HotRepair (naive)         R2CCL-Balance
  =================              ========================        ==================

  GPUg                            GPUg                            GPUg
   | NIC0 (failed)                  | (skipped)                     | (skipped)
   | NIC1 ===>                      | NIC1 =====2x====>             | NIC1 =1.33x=>
   | NIC2 ===>      D bytes/NIC     | NIC2 =====1x====>             | NIC2 =1.33x=>
   | NIC3 ===>                      | NIC3 =====1x====>             | NIC3 =1.33x=>
                                  bottleneck on NIC1            evenly redistributed
                                  ~46% throughput loss           proportional to BW

  Rerouting decision (per-flow, topology-aware):
   if backup_NIC same NUMA AND PCIe headroom > flow_demand:
       -> direct PCIe forwarding from GPUg
   elif PCIe+QPI cost < NVLink-PXN cost:
       -> PCIe + CPU-interconnect path
   else:
       -> PXN: NVLink to proxy GPU co-located with target NIC
^ Fig 4: R2CCL-Balance treats remaining NICs as a shared pool and
  redistributes the failed flow's share proportionally to live-NIC
  bandwidth, picking the lower-cost PCIe-vs-PXN path per flow.
  Applies to ALL collectives except AllReduce (which gets §2.4).

2.4 R2CCL-AllReduce — Bandwidth-Asymmetric Schedule Synthesis

  Standard Ring AllReduce on 4 servers, NIC failure on Server 4:
  ==============================================================
                                           total per-server xfer
   S1 ==> S2 ==> S3 ==> S4 (slow)             ~2D
                          \____back to S1     (S4 = bottleneck, all wait)


  R2CCL-AllReduce decomposition:
  ==============================
   Phase 1 (concurrent):
   +-------------------------------------------------------------+
   | Global AllReduce  (S1..S4, throttled by S4's reduced BW)     |
   |     [S1] -> [S2] -> [S3] -> [S4] -> [S1]   carries Y*D bytes |
   +-------------------------------------------------------------+
   +-------------------------------------------------------------+
   | Partial AllReduce (S1..S3 only, full speed)                  |
   |     [S1] -> [S2] -> [S3] -> [S1]    carries (1-Y)*D bytes    |
   +-------------------------------------------------------------+

   Phase 2 (pipelined custom broadcast):
   +-------------------------------------------------------------+
   | [S4] ----> [S1] -> [S2] -> [S3] -> [S4]   delivers partial-  |
   |     init                              fwd  AllReduce result  |
   |                                            back to S4        |
   +-------------------------------------------------------------+

   Closed-form completion time (D=B=1, X = lost BW fraction on S4):
       T1(Y) = a*(1-Y)/(1-X)     a = 2(ng-1)/(ng)
       T2(Y) = b*Y/X             b = 2((n-1)g-1)/((n-1)g)
       T3(Y) = Y/X
       T(Y)  = max(T1, T2) + T3

   Crossover threshold (Appendix A):
       X <= ng/(3ng-2)   ->  use STANDARD Ring AllReduce
       X >= ng/(3ng-2)   ->  use R2CCL-AllReduce
       Practical rule:    X < 1/3 -> standard;  X >= 1/3 -> R2CCL-AR
^ Fig 5: R2CCL-AllReduce splits AllReduce into a global stage (paid
  at the slow rate) plus a partial stage (full-speed on healthy nodes)
  plus a custom broadcast back to the degraded node. The optimal data
  partition Y* and the strategy crossover X* are CLOSED FORM, evaluated
  at runtime via NCCL's alpha-beta model.

The key engineering insight is that the AllReduce schedule is parameterized by a single scalar X = lost-bandwidth fraction on the slowest node. The optimal partition Y* and the crossover threshold to fall back to standard Ring are both algebraic. This is a model-based runtime decision (alpha-beta cost) — not an RL decision. R2CCL deliberately stays in the closed-form regime here.

2.5 Multi-Failure: Topology-Aware Re-ranking + Recursive Decomposition

  Before re-ranking (rail topology, S1 lost rail r,  S2 lost rail r'):
                                                                       
   S1     S2     S3              shared rails between S1 and S2
   [X]    .      .                = rails \ {r, r'}
   .      [X]    .                = NARROW intersection
   .      .      .                local load balancing fails

  Bridge-Based Re-ranking (Algorithm 1, Appendix D):
   - identify pairs (u,v) with |S_u ∩ S_v| < B_global
   - find a "bridge" node w with broad rail connectivity
   - relocate w to sit between u and v in the logical ring

  After re-ranking:
                                                                       
   S1     S3     S2     ...      S3 has wide rail set,
   [X]    OK     [X]              bridges S1 <-> S2 path

  Recursive AllReduce decomposition (multi-bottleneck):
   1. Build global ring at slowest node's rate (handles ALL nodes)
   2. Peel off slowest node, build sub-ring on remaining (n-1)
   3. Peel off next-slowest, build sub-sub-ring on (n-2)
   4. ...recurse until residual bandwidth variance is acceptable
   5. Reduction phases of all rings run IN PARALLEL
   6. Broadcast phases impose dependency: slower rings wait for
      partial results from faster sub-rings (fast nodes finish their
      own broadcast concurrently with slow nodes' work)
^ Fig 6: Multi-failure handling. Re-ranking is a logical (not
  physical) graph rewrite that inserts a bridge node where rail-
  intersection bandwidth is too narrow. Recursive decomposition
  generalizes the dual-ring (global + partial) idea to a
  multi-tier ring spectrum, one tier per bandwidth class.

3. Annotated Flow — Detect -> Repair -> Re-optimize

  START: collective in flight on channel c, NIC n is the data path
    |
    v
  (1) NCCL proxy poll loop sees CQE error / QP error / WQE flush on n
    |
    v
  (2) R2CCL net plugin INTERCEPTS the error
      (NCCL would otherwise crash the process here)
    |
    v
  (3) Bilateral OOB notify peer over MPI/TCP on non-datapath NIC
    |
    v
  (4) Probe phase: zero-byte RDMA Write
        -- to peer NIC
        -- to auxiliary NIC (third-party witness)
    |
    +-- 3-point triangulation ----> verdict in {local NIC, peer NIC, link}
    |
    v
  (5) Broadcast verdict to all ranks via OOB
    |
    v
  (6) HOT REPAIR (transport layer):
       a. proxy purges outstanding WQEs on channel c
       b. rollback sender to first chunk without CQE
       c. reset receiver to last confirmed chunk
       d. select next NIC in pre-registered failover chain
          (PCIe-distance ordered: same-NUMA preferred)
       e. retransmit residual chunks over backup NIC
    |
    v
  (7) ONLINE RE-OPTIMIZATION (planner layer):
       evaluate alpha-beta cost model with new (B_remaining, X)
       |
       +-- coll != AllReduce ------> R2CCL-Balance schedule
       |                              (rebalance DMA across healthy NICs)
       |
       +-- coll == AllReduce
       |     |
       |     +-- X < 1/3 ----------> standard Ring (live)
       |     +-- X >= 1/3 ---------> R2CCL-AllReduce (split D-Y, Y)
       |
       +-- multi-failure ---------> bridge-based re-rank +
                                     recursive AllReduce decomposition
    |
    v
  (8) Periodic re-probe healed NICs (NIC reset, cable re-insert)
       -- adapt probe frequency based on observed failure/recovery rate
       -- on recovery, reverse migration: restore primary NIC if faster
    |
    v
  CONTINUE collective traffic on new schedule.
^ Fig 7: End-to-end control flow from in-flight failure to seamless
  resumption. Steps 1-6 = transport-layer hot repair (Section 4 of
  paper). Step 7 = host-side online re-optimization (Sections 5-6).
  Step 8 = adaptive re-probing (Section 4.2).

The flow exposes a clean two-tier control hierarchy: the proxy thread owns the microsecond-to-millisecond decisions (detect, probe, rollback, NIC swap); the planner owns the collective-schedule decisions (Balance vs. AllReduce vs. Recursive). DynamICCL's two-agent architecture (Trigger + Config) maps onto this same split.


4. Trade-off Analysis

4.1 Recovery Strategy

Dimension Checkpoint Restart AdapCC (between-coll) DejaVu (KV replicate) R2CCL (in-flight) Winner (DynamICCL)
Recovery latency 17-68 min between collectives seconds (state restore) milliseconds R2CCL
Mid-collective fault aborts job aborts collective restarts request survives in-flight R2CCL
Application changes none none KV replication required none (drop-in) R2CCL
Memory overhead checkpoint storage low KV replicas mapping entries only R2CCL
Training overhead 12.7% job duration 8.65% n/a <1% R2CCL
Inference overhead 35s per restart n/a 14-33% (replication) <3% (0.71-1.58%) R2CCL
Failure scope any non-mid-collective NIC failure NIC/QP/link only R2CCL

For DynamICCL, prefer R2CCL because in-flight survival is the correct invariant for an RL agent that must continue gathering reward signal across episodes. Restart-based fault tolerance interrupts the trajectory in a way that breaks credit assignment.

4.2 Failure Detection Mechanism

Dimension NCCL default (timeout) R2CCL bilateral OOB + probe-triangulation
Detection latency minutes milliseconds
Asymmetric visibility broken (one side hangs) explicit peer notification
Localization granularity none local-NIC vs peer-NIC vs link distinguishable
Probe-channel isolation n/a dedicated probe QP pool (no HOL blocking)
Adaptivity to recovery none adaptive re-probe frequency

The probe-QP-pool isolation is the same architectural pattern as NCCL's separate forward QP (bulk) + reverse QP (CTS) split documented in 0011_Demystifying_NCCL.md — small control messages must never share queues with bulk traffic, or HOL blocking destroys their latency advantage.

4.3 Schedule Synthesis Strategy

Dimension Solver (TACCL/TE-CCL) Closed-form (R2CCL alpha-beta) Online RL (DynamICCL)
Decision latency seconds (Gurobi) microseconds tens of microseconds (NN)
Adapts to failure offline only yes (recompute X, Y*) yes (LSTM regime change)
Generality high only Ring AllReduce + Balance learned across action space
Optimality guarantee optimal provably optimal in 2-class learned, no guarantee
Cluster generalization re-solve per topology analytic in (n, g, X) requires retraining or transfer

Winner depends on regime. R2CCL chooses closed-form for the fault-recovery hot path because (a) the 2-class bandwidth model (slow node + healthy nodes) is small enough to admit an analytic solution, and (b) the recovery decision must be made in microseconds on the proxy thread. DynamICCL operates at a different timescale (per-collective config selection) where RL is appropriate.

4.4 Multi-NIC Buffer Registration

Dimension Lazy registration on failover R2CCL eager multi-registration
Init-time cost low tens of ms per NIC (one-time)
Recovery-time cost tens of ms per buffer (FATAL) zero (already mapped)
Steady-state HBM low ~same (mapping entries only)
Composability with PXN hard same buffer accessible by all

Eager multi-registration is the architectural twin of NCCLX's lazy allocation flags — but applied in the opposite direction. NCCLX defers allocation to save HBM under the assumption that most code paths are not exercised. R2CCL eager-registers because the failure path must be exercised at zero latency. The two designs are complementary: lazy for cold paths, eager for hot recovery paths.


5. What to Borrow for DynamICCL

5.1 New State Features for Agent-2 (NIC-Health Plane)

R2CCL exposes a structured failure-state space that DynamICCL has not previously encoded. Add to Agent-2's state vector:

  s_health = {
    nic_health_vector:   one bool per NIC on this node      (e.g., 8 dims)
    bandwidth_loss_X:    scalar in [0, 1] (NIC-failure fraction on slowest node)
    num_concurrent_fail: integer count of currently-failed NICs cluster-wide
    rail_intersection:   |S_u INTERSECT S_v| / |full rail set|    (for adjacent ranks)
    recent_failover_age: time since last hot-repair event         (seconds)
    recent_failure_rate: EMA of failure events / hour
  }

The bandwidth-loss fraction X is the single most important new feature because R2CCL's closed-form crossover X* = ng/(3ng-2) tells DynamICCL exactly when to switch AllReduce strategy. Agent-2 should not try to learn this from scratch — it should consume X as a state feature and use the closed-form as a warm-start prior.

5.2 New Action Dimension — Failure-Aware Schedule Mode

  action_failure_mode in {
    standard,          // unchanged Ring/Tree (X < 1/3)
    R2CCL_Balance,     // proportional NIC redistribution (any coll)
    R2CCL_AllReduce,   // dual-stage AllReduce (X >= 1/3)
    Recursive_AllRed   // multi-bottleneck recursive decomposition
  }

This becomes a new categorical head on Agent-2's policy network, conditioned on s_health.bandwidth_loss_X and coll_type. The policy collapses to standard when nic_health_vector is all-good, and is gated on by the trigger agent only when failures are detected. This mirrors NCCLX's hierarchical PATH dispatch (see notes.md NCCLX borrows): outer head selects mode, inner head selects mode-specific parameters.

5.3 Closed-Form alpha-beta Cost as Reward Normalizer / Warm-Start

R2CCL already uses NCCL's alpha-beta model to decide between strategies at runtime. DynamICCL should:

  1. Use the alpha-beta predicted completion time T_pred(coll, msg, X) as the reward normalizer: r_t = -t_observed / T_pred. This removes msg-size and bandwidth-loss-fraction scale dependence and gives a unitless reward in [-large, -1].
  2. Use Y* = X + X(1-X)/(X + (g(n-1)-1)*n) (paper Appendix A) as a warm-start for any data-partition-related action dimension. The RL refines around the analytic optimum rather than searching from scratch.

This is the same "predictive baseline + RL correction" pattern noted from the Wickramasinghe & Lumsdaine survey (notes.md): use Rabenseifner closed-form ms* to seed nChannels, let RL refine.

5.4 Bilateral OOB Notification = Trigger Agent Cross-Rank Synchronization

R2CCL's bilateral OOB primitive solves the asymmetric visibility problem: only one side sees a failure CQE. DynamICCL has the same problem at the regime-detection layer: only one rank's CUSUM may trip first when congestion onsets. Borrow the pattern:

5.5 Probe-QP Pool = Dedicated Telemetry Channel for Agent-1

R2CCL's three-point triangulation uses zero-byte RDMA Writes on a dedicated probe QP pool, isolated from data QPs. DynamICCL's congestion detector should follow this exactly:

5.6 Pre-Registered Failover Chain = Agent-2 Action Cache

R2CCL pre-registers GPU buffers with all NICs at init, ordered by PCIe distance. The recovery action is then a cheap O(1) lookup. This is exactly the stale-but-valid plan cache pattern from DynamICCL's Phase 2 design (notes.md):

5.7 DMA-Buffer Rollback = RL Episode Boundary Definition

R2CCL's DMA rollback is enabled by an NCCL invariant: receive buffers are not consumed by GPU kernels before completion is polled. This yields a clean episode boundary for the RL agent: each chunk-completion polled = one (s, a, r, s') tuple committed; any chunk in-flight at the time of failure is rolled back and re-issued as a fresh action. The translation:

5.8 Bridge-Based Re-ranking = Logical Hierarchy Re-shuffling

R2CCL's bridge-based logical re-ranking (Algorithm 1) generalizes the HiCCL insight already in notes.md: the communication hierarchy is virtual, not physical. DynamICCL's Agent-2 can therefore have a re-rank action head that picks logical neighbor ordering for ring algorithms. Combine with:

Together these expand Agent-2's action space from "knob picker" to "logical-topology architect" — closer to HiCCL's compositional design philosophy.

5.9 SimAI-Style Large-Scale Simulator for Pre-Training

R2CCL evaluates at 1024 GPUs via SimAI (cycle-accurate network + collective sim, NSDI'25). DynamICCL inherits the recommendation (already aligned with the Pensieve / HiCCL chunk-level simulator borrows in notes.md): pre-train Agent-2 on SimAI-like simulator, fine-tune on physical testbed. R2CCL's failure-injection mode (single NIC failure -> 12.5% bandwidth loss; multi-failure 1-10 across 64 servers) gives a ready-made curriculum for training Agent-2 on the failure-aware action heads from §5.2.

5.10 Failure-Mode Action Mask (Hard Constraints)

Just as 0011_Demystifying_NCCL.md produced an algorithm-protocol compatibility action mask, R2CCL produces a failure-mode action mask:

  Coll type        | R2CCL-Balance | R2CCL-AllReduce | Standard
  -----------------+---------------+-----------------+----------
  AllReduce        |     OK        |       OK        |   OK
  ReduceScatter    |     OK        |     INVALID     |   OK
  AllGather        |     OK        |     INVALID     |   OK
  Broadcast        |     OK        |     INVALID     |   OK
  Reduce           |     OK        |     INVALID     |   OK
  Send/Recv (P2P)  |     OK        |     INVALID     |   OK
  All-to-All       |     OK        |     INVALID     |   OK
^ Action mask: R2CCL-AllReduce only applies to AllReduce; R2CCL-Balance
  applies to all collectives. Encode as a hard mask on Agent-2's
  failure_mode head.

Analogy

R2CCL is a multi-engine airliner with hot-swappable engines and a flight computer that re-balances thrust on engine failure. The eight NICs are the engines (some are reachable through PCIe, others via NVLink PXN proxies — like wing-mounted vs. tail-mounted engines with different fuel routings). The OOB notification is the cockpit's MASTER CAUTION light wired to a separate avionics bus so it survives the failure of any single engine. The probe-QP pool is the auxiliary pitot/static system used to sanity-check sensor readings. The DMA-buffer rollback is the autopilot's ability to revert to the last confirmed waypoint rather than crash on a corrupted leg. R2CCL-Balance is symmetric thrust redistribution (every working engine carries 1/k more); R2CCL-AllReduce is the asymmetric "shed load on the failing engine, full thrust on the rest, then resync" maneuver. The recursive decomposition is what you do when multiple engines have different damage levels — you stratify the fleet into bandwidth tiers and let each tier fly at its own speed, synchronizing only at the broadcast handoff.


6. Summary Table

Pattern R2CCL origin DynamICCL application
Bilateral OOB failure notification §4.1 Cross-rank trigger broadcast on dedicated channel
Three-point probe triangulation §4.2 Trigger Agent on isolated probe-QP pool
Multi-NIC GPU buffer pre-registration §4.3 Tech I Action cache pre-computed at init time
DMA buffer rollback §4.3 Tech II RL episode boundary = polled-completion timestamp
PCIe-distance-ordered NIC chain §4.3, §7 Ranked failover-action list per cache cell
alpha-beta closed-form crossover X* §5.2 + Appendix A Warm-start prior + reward normalizer for Agent-2
R2CCL-Balance proportional reroute §5.1 New action head: failure_mode in {standard, balance, ...}
R2CCL-AllReduce dual-stage §5.2 Hard action mask: only valid for AllReduce
Bridge-based logical re-ranking §6 + Appendix D Logical-neighbor re-rank action for ring algos
Recursive AllReduce decomposition §6 Hierarchical action head per bandwidth tier
SimAI failure-injection curriculum §8.1 Pre-train Agent-2 with single + multi-NIC failures
Adaptive probe re-frequency §4.2 Trigger Agent learns re-probe cadence via LSTM
NIC health vector as state feature §3 (overview) New state dim: per-NIC bool + bandwidth_loss_X
Failure-mode action mask §3, Table 1 Hard mask: R2CCL-AllReduce only on AllReduce coll