R2CCL — Architecture and Design Analysis
Paper: Reliable and Resilient Collective Communication Library for LLM Training and Serving Authors: Wei Wang, Nengneng Yu, Sixian Xiong, Zaoxing Liu (University of Maryland, College Park) Venue: arXiv:2512.25059v1 [cs.DC], 31 Dec 2025 Code: https://github.com/r2cc-project/R-2CCL Analyst: Vishwakarma Date: 2026-04-28
R2 = Reliable and Resilient (per title, abstract, and §3 overview). The paper positions R2CCL as a fault-tolerant drop-in replacement for NCCL/RCCL that exploits multi-NIC redundancy already present in modern GPU servers to avoid full job restart on NIC/cable/QP failure.
Table of Contents
- System Overview Block Diagram
- Component Architectures (Bilateral Failure Awareness, Live Migration, R2CCL-Balance, R2CCL-AllReduce, Multi-Failure Recursive Decomposition)
- Annotated Flow — Failure Detection -> Hot Repair -> Online Re-optimization
- Trade-off Analysis
- What to Borrow for DynamICCL
- Summary Table
1. System Overview Block Diagram
+-------------------------------------------------------------------+
| R2CCL System Architecture |
| (drop-in extension of NCCL 2.23.4, ~3K LoC C++) |
| |
| +---------------------------------------------------------+ |
| | Application Layer (PyTorch / vLLM) | |
| | ncclAllReduce / ncclSend / ncclRecv (unmodified) | |
| +-------------------------+-------------------------------+ |
| | collective invocation |
| v |
| +---------------------------------------------------------+ |
| | R2CCL Planner (host-side, plugin layer) | |
| | | |
| | +---------------------+ +-------------------------+ | |
| | | Health-aware | | Strategy selector | | |
| | | algo & schedule |==>| (alpha-beta cost model) | | |
| | | dispatch | | Standard | Balance | | | |
| | | (NCCL enqueue hook) | | AllReduce | Recursive| | | |
| | +---------+-----------+ +-----+-------------------+ | |
| | | | | |
| +------------|---------------------|---------------------+ |
| | | |
| v v |
| +---------------------------------------------------------+ |
| | R2CCL Net Plugin (ncclNet hook + proxy thread) | |
| | | |
| | +-----------------+ +-------------------+ +------+ | |
| | | Bilateral OOB | | Probe QP pool | | Per- | | |
| | | notifier (MPI / |<->| (zero-byte RDMA | | chan | | |
| | | TCP, non-data | | Write triangul.) | | fail-| | |
| | | NIC) | +-------------------+ | over | | |
| | +-----------------+ | list | | |
| | | (PCIe| | |
| | +----------------------+ +---------------+ | ord)| | |
| | | Multi-NIC GPU buffer | | DMA-buffer | +------+ | |
| | | registration (all | | rollback (last| | |
| | | NICs <-> all GPU bufs)| | acked chunk) | | |
| | +----------+-----------+ +-------+-------+ | |
| +-------------|---------------------- |------------------+ |
| | | |
| v v |
| +---------------------------------------------------------+ |
| | NCCL Core (unmodified, intercepted) | |
| | Ring/Tree algorithms + channel partitioning | |
| +-------------------------+-------------------------------+ |
| | |
| v |
| +---------------------------------------------------------+ |
| | Multi-NIC Hardware (8 x CX-7 400Gbps + NVLink) | |
| | | |
| | +------+ +------+ +------+ +------+ | |
| | | NIC0 | | NIC1 | .... | NICk | | NIC7 | | |
| | +--+---+ +--+---+ +--+---+ +--+---+ | |
| | | | | | | |
| | PCIe + NVLink fabric (PXN proxy fwd available) | |
| +---------------------------------------------------------+ |
+-------------------------------------------------------------------+
^ Fig 1: R2CCL inserts a Planner + Net-Plugin pair between
PyTorch/vLLM and NCCL Core. The plugin owns three new control-
plane primitives: bilateral OOB notification, probe-QP triangu-
lation, and pre-registered backup connections. The planner owns
failure-aware schedule synthesis (Balance / AllReduce / Recursive).
The architectural commitment is that R2CCL does not modify
the NCCL collective kernel. It intercepts at two stable
boundaries: the ncclNet plugin (transport) and NCCL's
enqueue/planning layer (host). This is the same insertion-point
philosophy as DynamICCL — the tuner is a plugin, NCCL core stays
untouched. Consequence: R2CCL inherits all of NCCL's per-channel
ring/tree machinery and re-uses the channel abstraction to express its
multi-phase schedules.
2. Component Architectures
2.1 Bilateral Failure Awareness + Triangulation Localizer
Steady state Failure event
============ =============
Server A Server B Server A Server B
+--------+ +--------+ +--------+ +--------+
| NCCL | RDMA data | NCCL | | NCCL | X--data-X | NCCL |
| proxy | <===========> | proxy | | proxy | (CQ err) | proxy |
+---+----+ +----+---+ +---+----+ +---+----+
| | | (1) detect |
| OOB bootstrap ch. | | CQ/QP error |
| (MPI over non-data NIC) | | |
+========================= + +======OOB notify==>
(always idle) "I see error"
+---------------------------+
| (2) Both sides probe via |
| dedicated probe-QP |
| pool (zero-byte RDMA |
| Write to peer + aux) |
+-------------+-------------+
|
v
+---------------------------+
| Three-point triangulation |
| - A->B fails, A->aux ok |
| => B-side NIC dead |
| - B->A fails, B->aux ok |
| => A-side NIC dead |
| - both fail |
| => link / cable broken |
+---------------------------+
|
v
+---------------------------+
| (3) Broadcast verdict to |
| all ranks via OOB |
+---------------------------+
^ Fig 2: Bilateral OOB awareness + 3-point triangulation. RDMA's
asymmetric error visibility (only one side sees CQ error) is fixed
by an OOB notification on a non-datapath NIC. The probe-QP pool is
isolated from data QPs so probe traffic never queues behind stalled
bulk transfers. Auxiliary NIC distinguishes single-endpoint NIC
failure from link failure.
The OOB channel is the load-bearing primitive. Without it, R2CCL would fall back to NCCL's existing timeout-based detection (minutes). With it, detection is millisecond-scale because the asymmetric-visibility problem (only one side gets a CQE error) is resolved by an explicit peer notification on a separate NIC.
2.2 Live Migration (Hot Repair)
+------------------------------------------------------------------+
| Live Migration Data Path |
| |
| Sender state (rolled back to last completion) |
| +--+--+--+--+--+--+--+--+ |
| |C0|C1|C2|C3|C4|C5|XX|XX| <- C0..C4 polled completion |
| +--+--+--+--+--+--+--+--+ C5 in flight, C6+ not started |
| ^ |
| | rewind to first chunk without CQE (= C5) |
| |
| Pre-registered backup NIC list (PCIe-distance ordered): |
| [NIC_primary, NIC_b1, NIC_b2, ... NIC_bk] |
| |
| Step A. failed NIC = primary -> select NIC_b1 |
| Step B. all GPU buffers were *multi-registered* at init |
| => no on-demand registration cost |
| Step C. proxy reissues C5..C7 over NIC_b1 |
| Step D. if NIC_b1 fails later -> next in chain (NIC_b2) |
| |
| Receiver state (already-written chunks are safe): |
| +--+--+--+--+--+--+--+--+ |
| |C0|C1|C2|C3|C4|??| | | GPU kernels read AFTER completion,|
| +--+--+--+--+--+--+--+--+ so partial writes harmless. |
| Reset RX to last confirmed = C4. |
+------------------------------------------------------------------+
^ Fig 3: Live migration via multi-NIC pre-registration + DMA rollback.
The two technical enablers are (i) registering each GPU buffer with
ALL NICs at communicator init (eager, off the recovery path), and
(ii) rolling back to the last completed chunk because NCCL never
consumes recv buffers before completion is polled.
This is the section where the paper introduces what amount to
two new control-plane primitives over NCCL:
multi_register(buf, [NIC0..NICk]) and
rollback_to_last_ack(channel). Both operations are
unavailable in stock NCCL because stock NCCL ties one buffer to one NIC
and treats mid-collective faults as fatal.
2.3 R2CCL-Balance — Topology-Aware Reroute
Before failure R2CCL-HotRepair (naive) R2CCL-Balance
================= ======================== ==================
GPUg GPUg GPUg
| NIC0 (failed) | (skipped) | (skipped)
| NIC1 ===> | NIC1 =====2x====> | NIC1 =1.33x=>
| NIC2 ===> D bytes/NIC | NIC2 =====1x====> | NIC2 =1.33x=>
| NIC3 ===> | NIC3 =====1x====> | NIC3 =1.33x=>
bottleneck on NIC1 evenly redistributed
~46% throughput loss proportional to BW
Rerouting decision (per-flow, topology-aware):
if backup_NIC same NUMA AND PCIe headroom > flow_demand:
-> direct PCIe forwarding from GPUg
elif PCIe+QPI cost < NVLink-PXN cost:
-> PCIe + CPU-interconnect path
else:
-> PXN: NVLink to proxy GPU co-located with target NIC
^ Fig 4: R2CCL-Balance treats remaining NICs as a shared pool and
redistributes the failed flow's share proportionally to live-NIC
bandwidth, picking the lower-cost PCIe-vs-PXN path per flow.
Applies to ALL collectives except AllReduce (which gets §2.4).
2.4 R2CCL-AllReduce — Bandwidth-Asymmetric Schedule Synthesis
Standard Ring AllReduce on 4 servers, NIC failure on Server 4:
==============================================================
total per-server xfer
S1 ==> S2 ==> S3 ==> S4 (slow) ~2D
\____back to S1 (S4 = bottleneck, all wait)
R2CCL-AllReduce decomposition:
==============================
Phase 1 (concurrent):
+-------------------------------------------------------------+
| Global AllReduce (S1..S4, throttled by S4's reduced BW) |
| [S1] -> [S2] -> [S3] -> [S4] -> [S1] carries Y*D bytes |
+-------------------------------------------------------------+
+-------------------------------------------------------------+
| Partial AllReduce (S1..S3 only, full speed) |
| [S1] -> [S2] -> [S3] -> [S1] carries (1-Y)*D bytes |
+-------------------------------------------------------------+
Phase 2 (pipelined custom broadcast):
+-------------------------------------------------------------+
| [S4] ----> [S1] -> [S2] -> [S3] -> [S4] delivers partial- |
| init fwd AllReduce result |
| back to S4 |
+-------------------------------------------------------------+
Closed-form completion time (D=B=1, X = lost BW fraction on S4):
T1(Y) = a*(1-Y)/(1-X) a = 2(ng-1)/(ng)
T2(Y) = b*Y/X b = 2((n-1)g-1)/((n-1)g)
T3(Y) = Y/X
T(Y) = max(T1, T2) + T3
Crossover threshold (Appendix A):
X <= ng/(3ng-2) -> use STANDARD Ring AllReduce
X >= ng/(3ng-2) -> use R2CCL-AllReduce
Practical rule: X < 1/3 -> standard; X >= 1/3 -> R2CCL-AR
^ Fig 5: R2CCL-AllReduce splits AllReduce into a global stage (paid
at the slow rate) plus a partial stage (full-speed on healthy nodes)
plus a custom broadcast back to the degraded node. The optimal data
partition Y* and the strategy crossover X* are CLOSED FORM, evaluated
at runtime via NCCL's alpha-beta model.
The key engineering insight is that the AllReduce schedule is parameterized by a single scalar X = lost-bandwidth fraction on the slowest node. The optimal partition Y* and the crossover threshold to fall back to standard Ring are both algebraic. This is a model-based runtime decision (alpha-beta cost) — not an RL decision. R2CCL deliberately stays in the closed-form regime here.
2.5 Multi-Failure: Topology-Aware Re-ranking + Recursive Decomposition
Before re-ranking (rail topology, S1 lost rail r, S2 lost rail r'):
S1 S2 S3 shared rails between S1 and S2
[X] . . = rails \ {r, r'}
. [X] . = NARROW intersection
. . . local load balancing fails
Bridge-Based Re-ranking (Algorithm 1, Appendix D):
- identify pairs (u,v) with |S_u ∩ S_v| < B_global
- find a "bridge" node w with broad rail connectivity
- relocate w to sit between u and v in the logical ring
After re-ranking:
S1 S3 S2 ... S3 has wide rail set,
[X] OK [X] bridges S1 <-> S2 path
Recursive AllReduce decomposition (multi-bottleneck):
1. Build global ring at slowest node's rate (handles ALL nodes)
2. Peel off slowest node, build sub-ring on remaining (n-1)
3. Peel off next-slowest, build sub-sub-ring on (n-2)
4. ...recurse until residual bandwidth variance is acceptable
5. Reduction phases of all rings run IN PARALLEL
6. Broadcast phases impose dependency: slower rings wait for
partial results from faster sub-rings (fast nodes finish their
own broadcast concurrently with slow nodes' work)
^ Fig 6: Multi-failure handling. Re-ranking is a logical (not
physical) graph rewrite that inserts a bridge node where rail-
intersection bandwidth is too narrow. Recursive decomposition
generalizes the dual-ring (global + partial) idea to a
multi-tier ring spectrum, one tier per bandwidth class.
3. Annotated Flow — Detect -> Repair -> Re-optimize
START: collective in flight on channel c, NIC n is the data path
|
v
(1) NCCL proxy poll loop sees CQE error / QP error / WQE flush on n
|
v
(2) R2CCL net plugin INTERCEPTS the error
(NCCL would otherwise crash the process here)
|
v
(3) Bilateral OOB notify peer over MPI/TCP on non-datapath NIC
|
v
(4) Probe phase: zero-byte RDMA Write
-- to peer NIC
-- to auxiliary NIC (third-party witness)
|
+-- 3-point triangulation ----> verdict in {local NIC, peer NIC, link}
|
v
(5) Broadcast verdict to all ranks via OOB
|
v
(6) HOT REPAIR (transport layer):
a. proxy purges outstanding WQEs on channel c
b. rollback sender to first chunk without CQE
c. reset receiver to last confirmed chunk
d. select next NIC in pre-registered failover chain
(PCIe-distance ordered: same-NUMA preferred)
e. retransmit residual chunks over backup NIC
|
v
(7) ONLINE RE-OPTIMIZATION (planner layer):
evaluate alpha-beta cost model with new (B_remaining, X)
|
+-- coll != AllReduce ------> R2CCL-Balance schedule
| (rebalance DMA across healthy NICs)
|
+-- coll == AllReduce
| |
| +-- X < 1/3 ----------> standard Ring (live)
| +-- X >= 1/3 ---------> R2CCL-AllReduce (split D-Y, Y)
|
+-- multi-failure ---------> bridge-based re-rank +
recursive AllReduce decomposition
|
v
(8) Periodic re-probe healed NICs (NIC reset, cable re-insert)
-- adapt probe frequency based on observed failure/recovery rate
-- on recovery, reverse migration: restore primary NIC if faster
|
v
CONTINUE collective traffic on new schedule.
^ Fig 7: End-to-end control flow from in-flight failure to seamless
resumption. Steps 1-6 = transport-layer hot repair (Section 4 of
paper). Step 7 = host-side online re-optimization (Sections 5-6).
Step 8 = adaptive re-probing (Section 4.2).
The flow exposes a clean two-tier control hierarchy: the proxy thread owns the microsecond-to-millisecond decisions (detect, probe, rollback, NIC swap); the planner owns the collective-schedule decisions (Balance vs. AllReduce vs. Recursive). DynamICCL's two-agent architecture (Trigger + Config) maps onto this same split.
4. Trade-off Analysis
4.1 Recovery Strategy
| Dimension | Checkpoint Restart | AdapCC (between-coll) | DejaVu (KV replicate) | R2CCL (in-flight) | Winner (DynamICCL) |
|---|---|---|---|---|---|
| Recovery latency | 17-68 min | between collectives | seconds (state restore) | milliseconds | R2CCL |
| Mid-collective fault | aborts job | aborts collective | restarts request | survives in-flight | R2CCL |
| Application changes | none | none | KV replication required | none (drop-in) | R2CCL |
| Memory overhead | checkpoint storage | low | KV replicas | mapping entries only | R2CCL |
| Training overhead | 12.7% job duration | 8.65% | n/a | <1% | R2CCL |
| Inference overhead | 35s per restart | n/a | 14-33% (replication) | <3% (0.71-1.58%) | R2CCL |
| Failure scope | any | non-mid-collective | NIC failure | NIC/QP/link only | R2CCL |
For DynamICCL, prefer R2CCL because in-flight survival is the correct invariant for an RL agent that must continue gathering reward signal across episodes. Restart-based fault tolerance interrupts the trajectory in a way that breaks credit assignment.
4.2 Failure Detection Mechanism
| Dimension | NCCL default (timeout) | R2CCL bilateral OOB + probe-triangulation |
|---|---|---|
| Detection latency | minutes | milliseconds |
| Asymmetric visibility | broken (one side hangs) | explicit peer notification |
| Localization granularity | none | local-NIC vs peer-NIC vs link distinguishable |
| Probe-channel isolation | n/a | dedicated probe QP pool (no HOL blocking) |
| Adaptivity to recovery | none | adaptive re-probe frequency |
The probe-QP-pool isolation is the same architectural pattern as
NCCL's separate forward QP (bulk) + reverse QP (CTS) split documented in
0011_Demystifying_NCCL.md — small control messages must
never share queues with bulk traffic, or HOL blocking destroys their
latency advantage.
4.3 Schedule Synthesis Strategy
| Dimension | Solver (TACCL/TE-CCL) | Closed-form (R2CCL alpha-beta) | Online RL (DynamICCL) |
|---|---|---|---|
| Decision latency | seconds (Gurobi) | microseconds | tens of microseconds (NN) |
| Adapts to failure | offline only | yes (recompute X, Y*) | yes (LSTM regime change) |
| Generality | high | only Ring AllReduce + Balance | learned across action space |
| Optimality guarantee | optimal | provably optimal in 2-class | learned, no guarantee |
| Cluster generalization | re-solve per topology | analytic in (n, g, X) | requires retraining or transfer |
Winner depends on regime. R2CCL chooses closed-form for the fault-recovery hot path because (a) the 2-class bandwidth model (slow node + healthy nodes) is small enough to admit an analytic solution, and (b) the recovery decision must be made in microseconds on the proxy thread. DynamICCL operates at a different timescale (per-collective config selection) where RL is appropriate.
4.4 Multi-NIC Buffer Registration
| Dimension | Lazy registration on failover | R2CCL eager multi-registration |
|---|---|---|
| Init-time cost | low | tens of ms per NIC (one-time) |
| Recovery-time cost | tens of ms per buffer (FATAL) | zero (already mapped) |
| Steady-state HBM | low | ~same (mapping entries only) |
| Composability with PXN | hard | same buffer accessible by all |
Eager multi-registration is the architectural twin of NCCLX's lazy allocation flags — but applied in the opposite direction. NCCLX defers allocation to save HBM under the assumption that most code paths are not exercised. R2CCL eager-registers because the failure path must be exercised at zero latency. The two designs are complementary: lazy for cold paths, eager for hot recovery paths.
5. What to Borrow for DynamICCL
5.1 New State Features for Agent-2 (NIC-Health Plane)
R2CCL exposes a structured failure-state space that DynamICCL has not previously encoded. Add to Agent-2's state vector:
s_health = {
nic_health_vector: one bool per NIC on this node (e.g., 8 dims)
bandwidth_loss_X: scalar in [0, 1] (NIC-failure fraction on slowest node)
num_concurrent_fail: integer count of currently-failed NICs cluster-wide
rail_intersection: |S_u INTERSECT S_v| / |full rail set| (for adjacent ranks)
recent_failover_age: time since last hot-repair event (seconds)
recent_failure_rate: EMA of failure events / hour
}
The bandwidth-loss fraction X is the single most important new feature because R2CCL's closed-form crossover X* = ng/(3ng-2) tells DynamICCL exactly when to switch AllReduce strategy. Agent-2 should not try to learn this from scratch — it should consume X as a state feature and use the closed-form as a warm-start prior.
5.2 New Action Dimension — Failure-Aware Schedule Mode
action_failure_mode in {
standard, // unchanged Ring/Tree (X < 1/3)
R2CCL_Balance, // proportional NIC redistribution (any coll)
R2CCL_AllReduce, // dual-stage AllReduce (X >= 1/3)
Recursive_AllRed // multi-bottleneck recursive decomposition
}
This becomes a new categorical head on Agent-2's policy network,
conditioned on s_health.bandwidth_loss_X and
coll_type. The policy collapses to
standard when nic_health_vector is all-good,
and is gated on by the trigger agent only when failures are detected.
This mirrors NCCLX's hierarchical PATH dispatch (see notes.md NCCLX
borrows): outer head selects mode, inner head selects mode-specific
parameters.
5.3 Closed-Form alpha-beta Cost as Reward Normalizer / Warm-Start
R2CCL already uses NCCL's alpha-beta model to decide between strategies at runtime. DynamICCL should:
- Use the alpha-beta predicted completion time
T_pred(coll, msg, X)as the reward normalizer: r_t = -t_observed / T_pred. This removes msg-size and bandwidth-loss-fraction scale dependence and gives a unitless reward in [-large, -1]. - Use Y* = X + X(1-X)/(X + (g(n-1)-1)*n) (paper Appendix A) as a warm-start for any data-partition-related action dimension. The RL refines around the analytic optimum rather than searching from scratch.
This is the same "predictive baseline + RL correction" pattern noted from the Wickramasinghe & Lumsdaine survey (notes.md): use Rabenseifner closed-form ms* to seed nChannels, let RL refine.
5.4 Bilateral OOB Notification = Trigger Agent Cross-Rank Synchronization
R2CCL's bilateral OOB primitive solves the asymmetric visibility problem: only one side sees a failure CQE. DynamICCL has the same problem at the regime-detection layer: only one rank's CUSUM may trip first when congestion onsets. Borrow the pattern:
- Reserve a dedicated control-plane NIC / channel (e.g., MPI bootstrap network, already present in NCCL) for cross-rank trigger signal aggregation. Never share with data-plane traffic.
- When any rank's Trigger Agent fires, broadcast via OOB to all peers before any rank commits a config change.
- This implements gossip-style trigger propagation with O(N) messages on a non-datapath channel — cheap and aligns with the Distributed Systems 4th Ed. NACK-suppression / hierarchical coordinator pattern from the existing notes.md.
5.5 Probe-QP Pool = Dedicated Telemetry Channel for Agent-1
R2CCL's three-point triangulation uses zero-byte RDMA Writes on a dedicated probe QP pool, isolated from data QPs. DynamICCL's congestion detector should follow this exactly:
- Reserve an isolated probe-QP pool for the Trigger Agent's micro-benchmark probes (Hopper-style power-of-two-choices exploration).
- Probes never queue behind stalled bulk transfers.
- Probe frequency adapts based on observed congestion / recovery patterns (R2CCL §4.2: "adapting probe frequency based on observed failure and recovery patterns"). DynamICCL can learn this adaptation via Agent-1's LSTM.
5.6 Pre-Registered Failover Chain = Agent-2 Action Cache
R2CCL pre-registers GPU buffers with all NICs at init, ordered by PCIe distance. The recovery action is then a cheap O(1) lookup. This is exactly the stale-but-valid plan cache pattern from DynamICCL's Phase 2 design (notes.md):
- At communicator init, Agent-2 evaluates its policy for
every
(coll_type, msg_size_bin, P_bin, NIC_health_state)quad and caches the resulting action in a quad-tree (Wickramasinghe & Lumsdaine, Section 3.3 — already in notes.md). - On the hot path, lookup is O(1); a cache miss falls back to conservative defaults.
- The R2CCL borrow is the PCIe-distance-ordered fallback chain: for each cell, store not just the optimal config but a ranked list so successive hot-swaps are also pre-decided.
5.7 DMA-Buffer Rollback = RL Episode Boundary Definition
R2CCL's DMA rollback is enabled by an NCCL invariant: receive buffers are not consumed by GPU kernels before completion is polled. This yields a clean episode boundary for the RL agent: each chunk-completion polled = one (s, a, r, s') tuple committed; any chunk in-flight at the time of failure is rolled back and re-issued as a fresh action. The translation:
- Agent-2's reward must only be credited on polled completion of a collective, never on enqueue. Otherwise a failed-then-rolled-back chunk would corrupt the reward signal.
- The Trigger Agent's congestion detector must also use polled- completion timestamps as its time axis, not enqueue timestamps.
5.8 Bridge-Based Re-ranking = Logical Hierarchy Re-shuffling
R2CCL's bridge-based logical re-ranking (Algorithm 1) generalizes the HiCCL insight already in notes.md: the communication hierarchy is virtual, not physical. DynamICCL's Agent-2 can therefore have a re-rank action head that picks logical neighbor ordering for ring algorithms. Combine with:
- HiCCL's per-level library binding (state: cluster_vendor)
- R2CCL's rail-intersection bandwidth metric (state: rail_intersection)
- Bridge-node selection as an action
Together these expand Agent-2's action space from "knob picker" to "logical-topology architect" — closer to HiCCL's compositional design philosophy.
5.9 SimAI-Style Large-Scale Simulator for Pre-Training
R2CCL evaluates at 1024 GPUs via SimAI (cycle-accurate network + collective sim, NSDI'25). DynamICCL inherits the recommendation (already aligned with the Pensieve / HiCCL chunk-level simulator borrows in notes.md): pre-train Agent-2 on SimAI-like simulator, fine-tune on physical testbed. R2CCL's failure-injection mode (single NIC failure -> 12.5% bandwidth loss; multi-failure 1-10 across 64 servers) gives a ready-made curriculum for training Agent-2 on the failure-aware action heads from §5.2.
5.10 Failure-Mode Action Mask (Hard Constraints)
Just as 0011_Demystifying_NCCL.md produced an
algorithm-protocol compatibility action mask, R2CCL produces a
failure-mode action mask:
Coll type | R2CCL-Balance | R2CCL-AllReduce | Standard
-----------------+---------------+-----------------+----------
AllReduce | OK | OK | OK
ReduceScatter | OK | INVALID | OK
AllGather | OK | INVALID | OK
Broadcast | OK | INVALID | OK
Reduce | OK | INVALID | OK
Send/Recv (P2P) | OK | INVALID | OK
All-to-All | OK | INVALID | OK
^ Action mask: R2CCL-AllReduce only applies to AllReduce; R2CCL-Balance
applies to all collectives. Encode as a hard mask on Agent-2's
failure_mode head.
Analogy
R2CCL is a multi-engine airliner with hot-swappable engines and a flight computer that re-balances thrust on engine failure. The eight NICs are the engines (some are reachable through PCIe, others via NVLink PXN proxies — like wing-mounted vs. tail-mounted engines with different fuel routings). The OOB notification is the cockpit's MASTER CAUTION light wired to a separate avionics bus so it survives the failure of any single engine. The probe-QP pool is the auxiliary pitot/static system used to sanity-check sensor readings. The DMA-buffer rollback is the autopilot's ability to revert to the last confirmed waypoint rather than crash on a corrupted leg. R2CCL-Balance is symmetric thrust redistribution (every working engine carries 1/k more); R2CCL-AllReduce is the asymmetric "shed load on the failing engine, full thrust on the rest, then resync" maneuver. The recursive decomposition is what you do when multiple engines have different damage levels — you stratify the fleet into bandwidth tiers and let each tier fly at its own speed, synchronizing only at the broadcast handoff.
6. Summary Table
| Pattern | R2CCL origin | DynamICCL application |
|---|---|---|
| Bilateral OOB failure notification | §4.1 | Cross-rank trigger broadcast on dedicated channel |
| Three-point probe triangulation | §4.2 | Trigger Agent on isolated probe-QP pool |
| Multi-NIC GPU buffer pre-registration | §4.3 Tech I | Action cache pre-computed at init time |
| DMA buffer rollback | §4.3 Tech II | RL episode boundary = polled-completion timestamp |
| PCIe-distance-ordered NIC chain | §4.3, §7 | Ranked failover-action list per cache cell |
| alpha-beta closed-form crossover X* | §5.2 + Appendix A | Warm-start prior + reward normalizer for Agent-2 |
| R2CCL-Balance proportional reroute | §5.1 | New action head: failure_mode in {standard, balance, ...} |
| R2CCL-AllReduce dual-stage | §5.2 | Hard action mask: only valid for AllReduce |
| Bridge-based logical re-ranking | §6 + Appendix D | Logical-neighbor re-rank action for ring algos |
| Recursive AllReduce decomposition | §6 | Hierarchical action head per bandwidth tier |
| SimAI failure-injection curriculum | §8.1 | Pre-train Agent-2 with single + multi-NIC failures |
| Adaptive probe re-frequency | §4.2 | Trigger Agent learns re-probe cadence via LSTM |
| NIC health vector as state feature | §3 (overview) | New state dim: per-NIC bool + bandwidth_loss_X |
| Failure-mode action mask | §3, Table 1 | Hard mask: R2CCL-AllReduce only on AllReduce coll |