R2CCL: Reliable and Resilient Collective Communication Library for LLM Training and Serving — Detailed Summary

Wei Wang, Nengneng Yu, Sixian Xiong, Zaoxing Liu | University of Maryland, College Park | arXiv:2512.25059v1 [cs.DC], Dec 2025 / Jan 2026

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points.


Abstract


1. Introduction

Cost of network failures at scale:

The redundancy gap:

Approach summary:


2. Background

GPU-cluster networking:

Why current CCLs are fragile:

Failure types:


3. R^2CCL Overview

Three-step recovery loop:

  1. Detect and localize — CQE error codes plus OOB peer notification plus three-point RDMA triangulation pinpoint whether the fault is a local NIC, a cable, or a remote peer.
  2. Migrate — switch active traffic to a pre-registered backup NIC using DMA-buffer rollback for losslessness.
  3. Optimize — re-derive the collective schedule (algorithm choice, ranking, partition ratio) for the new bandwidth profile.

Plugin integration:


4. Failure Detection and Mitigation

Bilateral failure awareness:

Three-point triangulation:

GPU-NIC multi-registration:

DMA-buffer rollback:


5. Optimize Scheduling: Single-Failure Case

5.1 R^2CCL-Balance

5.2 R^2CCL-AllReduce (multi-phase pipelined)

5.3 Switching rule


6. Optimize Scheduling: Multi-Failure Case

Topology-aware logical re-ranking:

Recursive AllReduce decomposition:


7. Implementation

Integration with NCCL:

NCCL internals touched:


8. Evaluation

8.1 Setup

Hardware:

Workloads:

Baselines:

Metrics:

8.2 Headline Results

Workload Baseline R^2CCL overhead under single NIC fault Speedup vs. competitor
Megatron-LM training NCCL <1% ~12x faster recovery vs. AdapCC
vLLM inference NCCL <3% ~47x faster recovery vs. DejaVu
Sustained throughput under fault up to 93% of fault-free


10. Conclusions


Limitations


Action / State Surface Summary (for DynamICCL mapping)

Element R^2CCL's Decision Surface
Algorithm Ring, Tree, R^2CCL-AllReduce (multi-phase)
Protocol Simple, LL, LL128 (inherited from NCCL)
nChannels parallel channel count
numThreads per-proxy / per-kernel thread count
Chunk size pipelining granularity
Backup-QP pool sleeping QPs on every NIC pair
Failover chain NIC ordering by PCIe/NUMA proximity
Stall timeout CQE poll deadline before declaring NIC dead
Data partition Y split between Stage A and Stage B in R^2CCL-AllReduce
Logical rank order re-rankable at runtime for ring topology
Observation Signals R^2CCL Uses
Per-NIC available bandwidth
CQE error codes (transport / work-request)
OOB peer status notifications
RDMA triangulation probes
Link-up / link-down events
Per-NIC PCIe/NUMA distance to GPU

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer where Agent-2 selects per-collective (algorithm, protocol, nChannels, numThreads) on HPC GPU clusters to minimize collective completion time. R^2CCL is the single most directly relevant systems paper for DynamICCL because it operates on the exact same NCCL action surface but for a different objective (fault survival rather than nominal performance).

Direct structural analogies:

R^2CCL element DynamICCL analog
(algorithm, protocol, nChannels, numThreads, chunkSize) DynamICCL Agent-2 action vector — verbatim
alpha-beta cost model switching Balance vs. AllReduce DynamICCL's hand-rule baseline that the RL policy must beat
Per-NIC bandwidth + link health observations DynamICCL state input under multi-NIC topology
Online schedule optimizer at ms latency Inference-time deployment budget for DynamICCL Agent-2
ncclNet plugin + proxy hook integration Same plugin point DynamICCL should use to inject configs

Key lessons for DynamICCL:

  1. The action space is validated: R^2CCL confirms that the (algorithm, protocol, nChannels, numThreads, chunkSize) tuple is the correct knob set — no need to expand it for DynamICCL Agent-2.
  2. Online reconfiguration is feasible at production latency: R^2CCL demonstrates millisecond-scale online ring/tree re-derivation. DynamICCL has at least the same time budget per inference step.
  3. R^2CCL's switching rule is the brittle baseline DynamICCL must beat: the closed-form alpha-beta switch between Balance and AllReduce is exactly the kind of hand-rule that fails to generalize across topologies — perfect target for an RL policy.
  4. Plugin integration pattern is reusable: R^2CCL's ncclNet transport hook plus proxy send/receive instrumentation is the same integration point DynamICCL needs to inject Agent-2's selected configuration without forking NCCL.
  5. State signals overlap: per-NIC available bandwidth and link health are first-class observations for both systems. DynamICCL's state vector should include them.
  6. The two systems compose, not compete: R^2CCL handles the failure axis; DynamICCL handles the nominal-performance axis. A joint deployment would let DynamICCL consume R^2CCL's per-link-bandwidth signal as state input and produce a fault-aware optimal config.
  7. Pre-registration removes failover latency cost: R^2CCL's GPU-NIC multi-registration trick — pay setup cost once, switch instantly later — is a useful design pattern for DynamICCL when deploying RL-selected configs that may need quick rollback.