R2CCL: Reliable and Resilient Collective Communication Library for LLM Training and Serving

Wei Wang, Nengneng Yu, Sixian Xiong, Zaoxing Liu | University of Maryland, College Park | arXiv:2512.25059, 2026

Problem

Large-scale LLM training and inference jobs running on GPU clusters lose 10%-15% of GPU hours to network failures: NIC faults, cable issues, optical-link flapping, CRC errors. Existing collective communication libraries (NCCL, RCCL) follow an asynchronous error-abort model: a single failed NIC or link during a collective operation crashes the entire job, forcing a fallback to heavyweight checkpoint restore (median recovery ~68 minutes for training) or full request reprocessing (for inference). Modern GPU servers carry redundant per-server NICs (e.g., 8x ConnectX-7 400 Gbps in an H100 box) and heterogeneous paths (PCIe, NVLink, multi-rail InfiniBand) that could carry traffic around a failure, but no production CCL exposes that redundancy.

Core Insight

R^2 stands for Reliable and Resilient. Treat the per-server set of NICs as a shared pool and redistribute in-flight collective traffic across surviving NICs in a topology-aware manner so collectives complete losslessly without aborting the job. Combine fault-aware live connection migration (rollback to last acked chunk, switch to a pre-registered backup NIC) with an online algorithm/partition replanner that adapts the collective schedule to the post-failure bandwidth landscape rather than continuing on a degraded ring.

Method

R^2CCL is a drop-in extension to NCCL (~3K lines of C++) that hooks the ncclNet plugin layer plus the algorithm/planner. It pipelines three stages:

Fault detection and localization. Bilateral error awareness pairs in-band RDMA completion-queue error signals with an out-of-band MPI/TCP notification path so both endpoints learn about a failure simultaneously (no half-open hangs). Three-point triangulation with zero-byte RDMA probes from a third auxiliary node distinguishes local-NIC, remote-NIC, and cable failures.
Lossless live migration. GPU buffers are pre-registered with every available NIC at init time so failover avoids on-demand ibv_reg_mr latency. A DMA-buffer rollback rewinds sender and receiver to the last acknowledged chunk; traffic is steered onto the next NIC in an ordered chain ranked by PCIe/NUMA distance. Cross-NUMA failover uses NVLink-based proxy forwarding to reach a same-NUMA NIC.
Online schedule re-optimization. A failure-aware planner picks one of:
- R^2CCL-Balance: redistribute the failed-link traffic across remaining healthy NICs proportionally to bandwidth (latency-bound or non-AllReduce collectives).
- R^2CCL-AllReduce: a two-stage pipelined algorithm (global AllReduce throttled by the slow node, concurrent partial AllReduce + tailored Broadcast on the healthy subset) that reduces bottleneck workload from 2D to 1.75D for throughput-bound AllReduce on heterogeneous remaining bandwidth.
- Recursive R^2CCL-AllReduce: peel off sub-rings of healthy nodes for concurrent multi-failure scenarios.
The planner uses an alpha-beta cost model T(Y) = max(T1(Y), T2(Y)) + T3(Y) to pick the optimal data partition ratio Y*. The paper proves a threshold X = ng / (3ng - 2) on lost-bandwidth fraction above which R^2CCL-AllReduce strictly beats vanilla ring.

Results

Testbed: two 8-GPU H100 servers with ConnectX-7 400 Gbps InfiniBand; workloads include Megatron-LM GPT-3 2.7B/13B, Llama-3 70B/405B, vLLM, NCCL-tests; SimAI simulation up to 1024 GPUs. Baselines: vanilla NCCL (crashes), AdapCC (training-side reconfiguration), DejaVu (inference-side KV-cache replication).

Training overhead under a single NIC failure: < 1%.
Inference overhead under a single NIC failure: < 3%.
~12x lower training overhead than AdapCC.
~47x lower inference overhead than DejaVu.
Sustains up to 93% of pre-failure throughput during the failed period.
~4% overhead with 10 concurrent NIC failures across 512 GPUs (SimAI).

Limitations

Does not handle NVLink/NVSwitch fabric failures or top-of-rack / switch-wide outages.
Does not handle GPU/OS/process crashes -- those still fall back to checkpoint restore.
Requires the OOB management network (MPI/TCP) to remain functional.
Multi-NIC pre-registration of every GPU buffer increases memory-region pressure; may hit hardware MR-count limits on some RDMA NICs.
Large-scale (1024-GPU) numbers come from SimAI simulation, not from a real run at that size.

Relevance to DynamICCL

R^2CCL is the closest published work to a runtime, fault-aware NCCL planner. It lives in the same place DynamICCL would live (between collective enqueue and transport selection) and exposes a directly comparable knob set, but it uses analytical alpha-beta optimization where DynamICCL uses RL.

State-space overlap. R^2CCL's planner observes per-NIC health, per-link bandwidth, and topology distance. DynamICCL's Agent-2 observes the same plus message size, collective type, and recent timing. R^2CCL's "lost-bandwidth fraction X" is a useful state feature for DynamICCL.
Action-space overlap. R^2CCL chooses (algorithm in {Balance, AllReduce, Recursive-AllReduce}, data partition ratio Y, failover NIC chain). DynamICCL chooses (algorithm, protocol, nChannels, numThreads). The algorithm sub-action is structurally identical; DynamICCL can extend R^2CCL's algorithm menu with Ring/Tree/CollNet variants and add the protocol axis.
Online re-planning trigger. R^2CCL only re-plans on detected failure; DynamICCL would generalize the trigger to any change in observed per-collective latency, treating link degradation as a smooth signal rather than a binary event.
Drop-in NCCL integration template. R^2CCL shows the ncclNet plugin layer plus tuner/planner is the right insertion point in modern NCCL versions -- DynamICCL should reuse this exact integration pattern for NCCL 2.18.3.
Cost-model baseline. R^2CCL's alpha-beta T(Y) and the threshold proof give DynamICCL a strong analytical baseline to beat with RL, and a sanity check: when the network is healthy and homogeneous, the RL policy should converge close to R^2CCL's analytical optimum.
Failure-mode coverage gap. R^2CCL's stated limitations (NVLink-fabric faults, switch outages, partial-degradation that is not yet a hard CQ error) are exactly the regimes where a learned policy could exploit subtle latency signals before a hard failure fires.