R2CCL: Reliable and Resilient Collective Communication Library for LLM Training and Serving — Detailed Summary
Wei Wang, Nengneng Yu, Sixian Xiong, Zaoxing Liu | University of Maryland, College Park | arXiv:2512.25059v1 [cs.DC], Dec 2025 / Jan 2026
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points.
Abstract
- R^2CCL is a fault-tolerant collective communication library for large-scale LLM training and inference, positioned as a drop-in replacement / extension of NCCL/RCCL.
- It targets the 10-15% GPU-hour waste caused by slow recovery from network faults (median ~68 minutes per incident under existing systems).
- It exploits multi-NIC hardware redundancy already present in modern GPU servers (8 NICs/node typical) to provide lossless, low-overhead failover.
- Three primitives: (1) rapid connection migration onto backup NICs, (2) bandwidth-aware load redistribution across surviving paths, (3) resilient collective algorithms that maintain progress without checkpoint rollback (training) or request restart (inference).
1. Introduction
Cost of network failures at scale:
- LLM training and inference run on clusters of tens of thousands of GPUs with multiple NICs per node — fault rates scale linearly with hardware.
- A single NIC failure with vanilla NCCL stalls the affected collective, hangs the whole job, and forces a checkpoint rollback (training) or request reprocessing (inference).
- Median recovery time on the order of 68 minutes; cumulative GPU-hour waste 10-15% across the cluster.
The redundancy gap:
- Modern GPU servers already have multiple HCAs and heterogeneous intra-node interconnects (PCIe, NVLink) — the redundancy needed to survive a NIC fault is physically present but unused by NCCL/RCCL.
- The contribution is a CCL that exposes and orchestrates that redundancy.
Approach summary:
- Detect and localize the fault (CQ poll + OOB peer signaling + three-point triangulation).
- Migrate in-flight traffic to a pre-registered backup NIC.
- Re-optimize the collective schedule online so surviving NICs are loaded proportionally to their remaining bandwidth.
2. Background
GPU-cluster networking:
- Two-tier interconnect: intra-node NVLink/NVSwitch + inter-node RDMA fabric (InfiniBand or RoCE).
- Each modern GPU server has 1 NIC per GPU (8 NICs/node typical).
Why current CCLs are fragile:
- NCCL ring/tree topology is computed once at ncclCommInitRank time and is static for the life of the communicator.
- The ncclNet plugin opens RDMA QPs to specific peers up front; on QP error or CQE error, the proxy thread stalls and the communicator becomes unusable.
- Recovery requires job restart from the last checkpoint.
Failure types:
- NIC hardware/port failure, cable/optic failure, ToR-port failure, RDMA transport-level errors, link flapping, CRC errors.
- Out of scope: NVLink/NVSwitch failure, switch-wide outage, GPU/OS/process crash, full network partition.
3. R^2CCL Overview
Three-step recovery loop:
- Detect and localize — CQE error codes plus OOB peer notification plus three-point RDMA triangulation pinpoint whether the fault is a local NIC, a cable, or a remote peer.
- Migrate — switch active traffic to a pre-registered backup NIC using DMA-buffer rollback for losslessness.
- Optimize — re-derive the collective schedule (algorithm choice, ranking, partition ratio) for the new bandwidth profile.
Plugin integration:
- R^2CCL hooks into NCCL's existing ncclNet plugin layer; no fork of NCCL core required.
- Pre-allocates "sleeping" backup QPs and pre-registers GPU buffers with every NIC during ncclCommInitRank to remove on-demand registration latency from the failover path.
4. Failure Detection and Mitigation
Bilateral failure awareness:
- Out-of-band channel (MPI over a management NIC, or TCP) carries failure notifications between peers so both sides agree on which NIC is dead before retransmission begins.
- Avoids split-brain where one side has migrated and the other has not.
Three-point triangulation:
- Zero-byte RDMA writes from the failed-node's peer and an auxiliary third node distinguish "my NIC failed" vs. "your NIC failed" vs. "cable in between failed."
- Triangulation completes in milliseconds.
GPU-NIC multi-registration:
- At startup, every GPU buffer is registered with every NIC on the node using ibv_reg_mr.
- On failover, no fresh registration is needed — the backup NIC's rkey/lkey is already valid.
DMA-buffer rollback:
- The R^2CCL proxy keeps a sliding window of acknowledged chunks.
- On NIC failure, communication state is rewound to the last acknowledged chunk and resumed on the backup NIC — losslessly.
5. Optimize Scheduling: Single-Failure Case
5.1 R^2CCL-Balance
- After failover, the failed node's traffic is split across its remaining healthy NICs in proportion to their available bandwidth.
- Latency-optimal for small messages where the bottleneck is per-message overhead rather than aggregate throughput.
5.2 R^2CCL-AllReduce (multi-phase pipelined)
- For large messages with substantial bandwidth loss, a balance-only approach leaves bandwidth on the table because the slow node throttles the global pipeline.
- R^2CCL-AllReduce splits the data:
- Stage A: a global AllReduce throttled at the slow node's speed.
- Stage B: a partial AllReduce + Broadcast over the healthy subset, running concurrently with Stage A.
- Bottleneck workload reduced from 2D (ring AllReduce) to 1.75D.
5.3 Switching rule
- The decision between Balance and AllReduce is made by an alpha-beta performance model on (latency, bandwidth, message size).
- Switch to R^2CCL-AllReduce when bandwidth loss X exceeds ng/(3ng-2), where n is participant count and g a bandwidth term.
- The optimal data partition ratio Y between Stage A and Stage B is derived in closed form from the cost model.
6. Optimize Scheduling: Multi-Failure Case
Topology-aware logical re-ranking:
- When concurrent failures fragment the logical ring (adjacent ranks both lose NICs), R^2CCL inserts a high-connectivity "bridge node" into the rank order to re-stitch the ring.
Recursive AllReduce decomposition:
- For severe concurrent multi-NIC failures, R^2CCL recursively peels off sub-rings of healthy nodes, runs AllReduce within each sub-ring, then combines results.
- This generalizes Stage A/Stage B into a recursive decomposition.
7. Implementation
Integration with NCCL:
- Hooks: ncclNet plugin (transport), proxy send/receive loop (data plane), bootstrap (control plane).
- Backup connections: pre-established but inactive QPs on every NIC pair.
- CQ monitoring: continuous poll for completion-queue error codes (transport errors, work-request errors).
- OOB channel: MPI or TCP control bus on a designated management NIC.
NCCL internals touched:
- ncclCommInitRank: extended to register buffers with all NICs and allocate backup QPs.
- ncclNet plugin: error path now triggers migration rather than abort.
- Proxy thread: maintains the DMA-buffer rollback window.
- Ring/Tree topology: re-derivable at runtime by the schedule optimizer rather than fixed at init.
8. Evaluation
8.1 Setup
Hardware:
- Two physical servers, each with 8x NVIDIA H100 GPUs and 8x Mellanox ConnectX-7 400 Gbps NICs.
- Large-scale results obtained via SimAI simulation up to 1024 GPUs.
Workloads:
- Training: Megatron-LM GPT-3 2.7B and 13B; DeepSpeed-Chat RLHF on a 175B model.
- Inference: vLLM serving Llama-3.1 70B and 405B, OPT-66B, BLOOM-176B.
Baselines:
- Vanilla NCCL (no fault tolerance).
- AdapCC (reconfigures NCCL between training rounds).
- DejaVu (inference-side KV-cache replication for fault tolerance).
Metrics:
- Throughput (tokens/sec for both training and inference).
- Overhead percentage under fault.
- TTFT (time-to-first-token) and TPOT (time-per-output-token) for inference latency.
- nccl-tests bus bandwidth microbenchmarks.
8.2 Headline Results
| Workload | Baseline | R^2CCL overhead under single NIC fault | Speedup vs. competitor |
|---|---|---|---|
| Megatron-LM training | NCCL | <1% | ~12x faster recovery vs. AdapCC |
| vLLM inference | NCCL | <3% | ~47x faster recovery vs. DejaVu |
| Sustained throughput under fault | — | up to 93% of fault-free | — |
- nccl-tests microbenchmarks show R^2CCL preserves bus bandwidth within a few percent of fault-free NCCL after a NIC drop.
- Multi-failure scenarios tested up to roughly half of NICs failed; the recursive AllReduce decomposition keeps progress alive.
9. Related Work
- Checkpointing and rollback: high cost, coarse granularity.
- Offline communication synthesis (TACCL, etc.): pre-computes optimal schedule but cannot react to runtime faults.
- AdapCC: between-round reconfiguration — limited to training, slow.
- DejaVu: inference-only KV-cache replication — high storage cost.
- R^2CCL is the first to combine fault-tolerance and online scheduling in the same CCL plugin for both training and inference.
10. Conclusions
- R^2CCL sustains large-scale LLM workloads through routine network faults without rolling back checkpoints or restarting requests.
- Plugin-level integration into NCCL means existing PyTorch / Megatron / vLLM stacks benefit transparently.
- Future directions include support for process-level fault tolerance and intra-node NVLink fault recovery.
Limitations
- Cannot recover from process-level crashes, OS failures, or GPU hardware failures.
- Does not handle intra-node NVLink/NVSwitch faults.
- Assumes at least one healthy path between every pair of communicating nodes — no full-network-partition recovery.
- Evaluation on a 2-node H100 testbed; full-cluster numbers come from SimAI simulation rather than from a real 1024-GPU run.
Action / State Surface Summary (for DynamICCL mapping)
| Element | R^2CCL's Decision Surface |
|---|---|
| Algorithm | Ring, Tree, R^2CCL-AllReduce (multi-phase) |
| Protocol | Simple, LL, LL128 (inherited from NCCL) |
| nChannels | parallel channel count |
| numThreads | per-proxy / per-kernel thread count |
| Chunk size | pipelining granularity |
| Backup-QP pool | sleeping QPs on every NIC pair |
| Failover chain | NIC ordering by PCIe/NUMA proximity |
| Stall timeout | CQE poll deadline before declaring NIC dead |
| Data partition Y | split between Stage A and Stage B in R^2CCL-AllReduce |
| Logical rank order | re-rankable at runtime for ring topology |
| Observation Signals R^2CCL Uses |
|---|
| Per-NIC available bandwidth |
| CQE error codes (transport / work-request) |
| OOB peer status notifications |
| RDMA triangulation probes |
| Link-up / link-down events |
| Per-NIC PCIe/NUMA distance to GPU |
Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer where Agent-2 selects per-collective (algorithm, protocol, nChannels, numThreads) on HPC GPU clusters to minimize collective completion time. R^2CCL is the single most directly relevant systems paper for DynamICCL because it operates on the exact same NCCL action surface but for a different objective (fault survival rather than nominal performance).
Direct structural analogies:
| R^2CCL element | DynamICCL analog |
|---|---|
| (algorithm, protocol, nChannels, numThreads, chunkSize) | DynamICCL Agent-2 action vector — verbatim |
| alpha-beta cost model switching Balance vs. AllReduce | DynamICCL's hand-rule baseline that the RL policy must beat |
| Per-NIC bandwidth + link health observations | DynamICCL state input under multi-NIC topology |
| Online schedule optimizer at ms latency | Inference-time deployment budget for DynamICCL Agent-2 |
| ncclNet plugin + proxy hook integration | Same plugin point DynamICCL should use to inject configs |
Key lessons for DynamICCL:
- The action space is validated: R^2CCL confirms that the (algorithm, protocol, nChannels, numThreads, chunkSize) tuple is the correct knob set — no need to expand it for DynamICCL Agent-2.
- Online reconfiguration is feasible at production latency: R^2CCL demonstrates millisecond-scale online ring/tree re-derivation. DynamICCL has at least the same time budget per inference step.
- R^2CCL's switching rule is the brittle baseline DynamICCL must beat: the closed-form alpha-beta switch between Balance and AllReduce is exactly the kind of hand-rule that fails to generalize across topologies — perfect target for an RL policy.
- Plugin integration pattern is reusable: R^2CCL's ncclNet transport hook plus proxy send/receive instrumentation is the same integration point DynamICCL needs to inject Agent-2's selected configuration without forking NCCL.
- State signals overlap: per-NIC available bandwidth and link health are first-class observations for both systems. DynamICCL's state vector should include them.
- The two systems compose, not compete: R^2CCL handles the failure axis; DynamICCL handles the nominal-performance axis. A joint deployment would let DynamICCL consume R^2CCL's per-link-bandwidth signal as state input and produce a fault-aware optimal config.
- Pre-registration removes failover latency cost: R^2CCL's GPU-NIC multi-registration trick — pay setup cost once, switch instantly later — is a useful design pattern for DynamICCL when deploying RL-selected configs that may need quick rollback.