R2CCL: Reliable and Resilient Collective Communication Library for LLM Training and Serving — Detailed Summary

Wei Wang, Nengneng Yu, Sixian Xiong, Zaoxing Liu | University of Maryland, College Park | arXiv:2512.25059v1 [cs.DC], Dec 2025 / Jan 2026

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points.

Abstract

R^2CCL is a fault-tolerant collective communication library for large-scale LLM training and inference, positioned as a drop-in replacement / extension of NCCL/RCCL.
It targets the 10-15% GPU-hour waste caused by slow recovery from network faults (median ~68 minutes per incident under existing systems).
It exploits multi-NIC hardware redundancy already present in modern GPU servers (8 NICs/node typical) to provide lossless, low-overhead failover.
Three primitives: (1) rapid connection migration onto backup NICs, (2) bandwidth-aware load redistribution across surviving paths, (3) resilient collective algorithms that maintain progress without checkpoint rollback (training) or request restart (inference).

1. Introduction

Cost of network failures at scale:

LLM training and inference run on clusters of tens of thousands of GPUs with multiple NICs per node — fault rates scale linearly with hardware.
A single NIC failure with vanilla NCCL stalls the affected collective, hangs the whole job, and forces a checkpoint rollback (training) or request reprocessing (inference).
Median recovery time on the order of 68 minutes; cumulative GPU-hour waste 10-15% across the cluster.

The redundancy gap:

Modern GPU servers already have multiple HCAs and heterogeneous intra-node interconnects (PCIe, NVLink) — the redundancy needed to survive a NIC fault is physically present but unused by NCCL/RCCL.
The contribution is a CCL that exposes and orchestrates that redundancy.

Approach summary:

Detect and localize the fault (CQ poll + OOB peer signaling + three-point triangulation).
Migrate in-flight traffic to a pre-registered backup NIC.
Re-optimize the collective schedule online so surviving NICs are loaded proportionally to their remaining bandwidth.

2. Background

GPU-cluster networking:

Two-tier interconnect: intra-node NVLink/NVSwitch + inter-node RDMA fabric (InfiniBand or RoCE).
Each modern GPU server has 1 NIC per GPU (8 NICs/node typical).

Why current CCLs are fragile:

NCCL ring/tree topology is computed once at ncclCommInitRank time and is static for the life of the communicator.
The ncclNet plugin opens RDMA QPs to specific peers up front; on QP error or CQE error, the proxy thread stalls and the communicator becomes unusable.
Recovery requires job restart from the last checkpoint.

Failure types:

NIC hardware/port failure, cable/optic failure, ToR-port failure, RDMA transport-level errors, link flapping, CRC errors.
Out of scope: NVLink/NVSwitch failure, switch-wide outage, GPU/OS/process crash, full network partition.

3. R^2CCL Overview

Three-step recovery loop:

Detect and localize — CQE error codes plus OOB peer notification plus three-point RDMA triangulation pinpoint whether the fault is a local NIC, a cable, or a remote peer.
Migrate — switch active traffic to a pre-registered backup NIC using DMA-buffer rollback for losslessness.
Optimize — re-derive the collective schedule (algorithm choice, ranking, partition ratio) for the new bandwidth profile.

Plugin integration:

R^2CCL hooks into NCCL's existing ncclNet plugin layer; no fork of NCCL core required.
Pre-allocates "sleeping" backup QPs and pre-registers GPU buffers with every NIC during ncclCommInitRank to remove on-demand registration latency from the failover path.

4. Failure Detection and Mitigation

Bilateral failure awareness:

Out-of-band channel (MPI over a management NIC, or TCP) carries failure notifications between peers so both sides agree on which NIC is dead before retransmission begins.
Avoids split-brain where one side has migrated and the other has not.

Three-point triangulation:

Zero-byte RDMA writes from the failed-node's peer and an auxiliary third node distinguish "my NIC failed" vs. "your NIC failed" vs. "cable in between failed."
Triangulation completes in milliseconds.

GPU-NIC multi-registration:

At startup, every GPU buffer is registered with every NIC on the node using ibv_reg_mr.
On failover, no fresh registration is needed — the backup NIC's rkey/lkey is already valid.

DMA-buffer rollback:

The R^2CCL proxy keeps a sliding window of acknowledged chunks.
On NIC failure, communication state is rewound to the last acknowledged chunk and resumed on the backup NIC — losslessly.

5. Optimize Scheduling: Single-Failure Case

5.1 R^2CCL-Balance

After failover, the failed node's traffic is split across its remaining healthy NICs in proportion to their available bandwidth.
Latency-optimal for small messages where the bottleneck is per-message overhead rather than aggregate throughput.

5.2 R^2CCL-AllReduce (multi-phase pipelined)

For large messages with substantial bandwidth loss, a balance-only approach leaves bandwidth on the table because the slow node throttles the global pipeline.
R^2CCL-AllReduce splits the data:
- Stage A: a global AllReduce throttled at the slow node's speed.
- Stage B: a partial AllReduce + Broadcast over the healthy subset, running concurrently with Stage A.
Bottleneck workload reduced from 2D (ring AllReduce) to 1.75D.

5.3 Switching rule

The decision between Balance and AllReduce is made by an alpha-beta performance model on (latency, bandwidth, message size).
Switch to R^2CCL-AllReduce when bandwidth loss X exceeds ng/(3ng-2), where n is participant count and g a bandwidth term.
The optimal data partition ratio Y between Stage A and Stage B is derived in closed form from the cost model.

6. Optimize Scheduling: Multi-Failure Case

Topology-aware logical re-ranking:

When concurrent failures fragment the logical ring (adjacent ranks both lose NICs), R^2CCL inserts a high-connectivity "bridge node" into the rank order to re-stitch the ring.

Recursive AllReduce decomposition:

For severe concurrent multi-NIC failures, R^2CCL recursively peels off sub-rings of healthy nodes, runs AllReduce within each sub-ring, then combines results.
This generalizes Stage A/Stage B into a recursive decomposition.

7. Implementation

Integration with NCCL:

Hooks: ncclNet plugin (transport), proxy send/receive loop (data plane), bootstrap (control plane).
Backup connections: pre-established but inactive QPs on every NIC pair.
CQ monitoring: continuous poll for completion-queue error codes (transport errors, work-request errors).
OOB channel: MPI or TCP control bus on a designated management NIC.

NCCL internals touched:

ncclCommInitRank: extended to register buffers with all NICs and allocate backup QPs.
ncclNet plugin: error path now triggers migration rather than abort.
Proxy thread: maintains the DMA-buffer rollback window.
Ring/Tree topology: re-derivable at runtime by the schedule optimizer rather than fixed at init.

8. Evaluation

8.1 Setup

Hardware:

Two physical servers, each with 8x NVIDIA H100 GPUs and 8x Mellanox ConnectX-7 400 Gbps NICs.
Large-scale results obtained via SimAI simulation up to 1024 GPUs.

Workloads:

Training: Megatron-LM GPT-3 2.7B and 13B; DeepSpeed-Chat RLHF on a 175B model.
Inference: vLLM serving Llama-3.1 70B and 405B, OPT-66B, BLOOM-176B.

Baselines:

Vanilla NCCL (no fault tolerance).
AdapCC (reconfigures NCCL between training rounds).
DejaVu (inference-side KV-cache replication for fault tolerance).

Metrics:

Throughput (tokens/sec for both training and inference).
Overhead percentage under fault.
TTFT (time-to-first-token) and TPOT (time-per-output-token) for inference latency.
nccl-tests bus bandwidth microbenchmarks.

8.2 Headline Results

Workload	Baseline	R^2CCL overhead under single NIC fault	Speedup vs. competitor
Megatron-LM training	NCCL	<1%	~12x faster recovery vs. AdapCC
vLLM inference	NCCL	<3%	~47x faster recovery vs. DejaVu
Sustained throughput under fault	—	up to 93% of fault-free	—

nccl-tests microbenchmarks show R^2CCL preserves bus bandwidth within a few percent of fault-free NCCL after a NIC drop.
Multi-failure scenarios tested up to roughly half of NICs failed; the recursive AllReduce decomposition keeps progress alive.

Checkpointing and rollback: high cost, coarse granularity.
Offline communication synthesis (TACCL, etc.): pre-computes optimal schedule but cannot react to runtime faults.
AdapCC: between-round reconfiguration — limited to training, slow.
DejaVu: inference-only KV-cache replication — high storage cost.
R^2CCL is the first to combine fault-tolerance and online scheduling in the same CCL plugin for both training and inference.

10. Conclusions

R^2CCL sustains large-scale LLM workloads through routine network faults without rolling back checkpoints or restarting requests.
Plugin-level integration into NCCL means existing PyTorch / Megatron / vLLM stacks benefit transparently.
Future directions include support for process-level fault tolerance and intra-node NVLink fault recovery.

Limitations

Cannot recover from process-level crashes, OS failures, or GPU hardware failures.
Does not handle intra-node NVLink/NVSwitch faults.
Assumes at least one healthy path between every pair of communicating nodes — no full-network-partition recovery.
Evaluation on a 2-node H100 testbed; full-cluster numbers come from SimAI simulation rather than from a real 1024-GPU run.

Action / State Surface Summary (for DynamICCL mapping)

Element	R^2CCL's Decision Surface
Algorithm	Ring, Tree, R^2CCL-AllReduce (multi-phase)
Protocol	Simple, LL, LL128 (inherited from NCCL)
nChannels	parallel channel count
numThreads	per-proxy / per-kernel thread count
Chunk size	pipelining granularity
Backup-QP pool	sleeping QPs on every NIC pair
Failover chain	NIC ordering by PCIe/NUMA proximity
Stall timeout	CQE poll deadline before declaring NIC dead
Data partition Y	split between Stage A and Stage B in R^2CCL-AllReduce
Logical rank order	re-rankable at runtime for ring topology

Observation Signals R^2CCL Uses
Per-NIC available bandwidth
CQE error codes (transport / work-request)
OOB peer status notifications
RDMA triangulation probes
Link-up / link-down events
Per-NIC PCIe/NUMA distance to GPU

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer where Agent-2 selects per-collective (algorithm, protocol, nChannels, numThreads) on HPC GPU clusters to minimize collective completion time. R^2CCL is the single most directly relevant systems paper for DynamICCL because it operates on the exact same NCCL action surface but for a different objective (fault survival rather than nominal performance).

Direct structural analogies:

R^2CCL element	DynamICCL analog
(algorithm, protocol, nChannels, numThreads, chunkSize)	DynamICCL Agent-2 action vector — verbatim
alpha-beta cost model switching Balance vs. AllReduce	DynamICCL's hand-rule baseline that the RL policy must beat
Per-NIC bandwidth + link health observations	DynamICCL state input under multi-NIC topology
Online schedule optimizer at ms latency	Inference-time deployment budget for DynamICCL Agent-2
ncclNet plugin + proxy hook integration	Same plugin point DynamICCL should use to inject configs

Key lessons for DynamICCL:

The action space is validated: R^2CCL confirms that the (algorithm, protocol, nChannels, numThreads, chunkSize) tuple is the correct knob set — no need to expand it for DynamICCL Agent-2.
Online reconfiguration is feasible at production latency: R^2CCL demonstrates millisecond-scale online ring/tree re-derivation. DynamICCL has at least the same time budget per inference step.
R^2CCL's switching rule is the brittle baseline DynamICCL must beat: the closed-form alpha-beta switch between Balance and AllReduce is exactly the kind of hand-rule that fails to generalize across topologies — perfect target for an RL policy.
Plugin integration pattern is reusable: R^2CCL's ncclNet transport hook plus proxy send/receive instrumentation is the same integration point DynamICCL needs to inject Agent-2's selected configuration without forking NCCL.
State signals overlap: per-NIC available bandwidth and link health are first-class observations for both systems. DynamICCL's state vector should include them.
The two systems compose, not compete: R^2CCL handles the failure axis; DynamICCL handles the nominal-performance axis. A joint deployment would let DynamICCL consume R^2CCL's per-link-bandwidth signal as state input and produce a fault-aware optimal config.
Pre-registration removes failover latency cost: R^2CCL's GPU-NIC multi-registration trick — pay setup cost once, switch instantly later — is a useful design pattern for DynamICCL when deploying RL-selected configs that may need quick rollback.