The Big Send-off: Scalable and Performant Collectives for Deep Learning (PCCL)

Siddharth Singh, Keshav Pradeep, Mahua Singh, Cunyang Wei, Abhinav Bhatele | University of Maryland, IIT Guwahati | arXiv (cs.DC) 2026

Problem

Large-scale deep learning training (DeepSpeed ZeRO-3, FSDP, DDP) issues moderate-to-large collective messages (tens of MBs to GBs) across thousands of GPUs. On modern supercomputers (Frontier, Perlmutter) the vendor collective libraries — Cray-MPICH, NCCL, and RCCL — fail to maintain ideal scaling at high GCD/GPU counts. Cray-MPICH funnels all write traffic through a single NIC and all read traffic through another, leaving most of the four available Slingshot-11 NICs idle, and performs reductions on the CPU. NCCL and RCCL only implement the ring algorithm for all-gather and reduce-scatter, whose latency grows linearly O(p) in the number of ranks. RCCL also exhibits reliability problems (failures, performance variability) and forces traffic into the NIC's slow software "overflow" path at scale (200x higher overflow counter than PCCL). The result is a widening communication bottleneck for LLM training.

Core Insight

Decompose every global collective into a hierarchical two-level form (inter-node + intra-node), pick a logarithmic-latency recursive algorithm for the inter-node phase, and use a learned SVM dispatcher to choose the best backend per (message size, GPU count) at runtime — concurrently exploiting all NICs by issuing one inter-node sub-collective per local-rank GPU.

Method

PCCL (Performant Collective Communication Library) implements every collective as three phases:

Inter-node phase: rank-aligned sub-communicators (e.g., all rank-0 GPUs, all rank-1 GPUs, ...) run inter-node sub-collectives concurrently. PCCL implements two backends here — PCCL_ring (bandwidth-optimal, linear latency) and PCCL_rec (recursive halving / doubling, logarithmic latency) — both built directly on MPI point-to-point send/recv to keep traffic on the Cassini NIC's hardware-accelerated priority list.
Intra-node phase: vendor library (NCCL on NVIDIA, RCCL on AMD) handles the intra-node collective over NVLink / Infinity Fabric.
Device-local shuffle: a CUDA/HIP kernel transposes per-GPU buffers into the correct global element order.

Inter-node reductions are offloaded to GPU kernels rather than the CPU. By running one inter-node sub-collective per local rank, PCCL spreads inter-node traffic across all four NICs simultaneously instead of pinning to one.

The adaptive dispatcher is a small SVM classifier with input features (message size, GPU count). It outputs one of {PCCL_rec, PCCL_ring, native MPI, NCCL/RCCL}. The SVM is trained offline on benchmark sweeps over message sizes 1 MB - 1024 MB and GPU counts 4 - 2048; reported dispatch accuracy is 75% - 95.4% depending on collective and machine. At runtime PCCL queries the SVM and routes the call to the predicted backend. This is the system's only learning component; the algorithms themselves are deterministic.

The library is implemented in C++ with CUDA/HIP kernels and exposes a Pybind11 Python API for direct use from PyTorch and DeepSpeed.

Results

Evaluated on Frontier (AMD MI250X, Slingshot-11) and Perlmutter (NVIDIA A100, Slingshot-11):

Reduce-scatter on Frontier, 2048 GCDs: up to 168x speedup over RCCL.
All-gather on Frontier, 2048 GCDs: up to 33x speedup over RCCL.
All-reduce on Frontier, 2048 GCDs: up to 10x speedup over RCCL.
DeepSpeed ZeRO-3 training (GPT-7B/13B): up to 4.9x end-to-end training speedup over RCCL.
PyTorch DDP training (GPT-1.3B): up to 2.4x speedup.
NIC overflow counter lpe_net_match_overflow_0: PCCL reduces by 200x vs. RCCL, confirming traffic stays on the hardware fast-path.
SVM dispatcher accuracy: 75% - 95.4% across collectives and machines.

Limitations

Evaluated only on Slingshot-11; InfiniBand evaluation is future work.
At small GPU counts with very large messages, vendor libraries (RCCL) can still beat PCCL — the SVM dispatcher is required to fall back to them.
The "resilience" angle is qualitative: the paper motivates choosing MPI point-to-point because RCCL has been observed to fail at scale, but it does not implement explicit fault detection, multi-path failover, retransmission, or checkpoint/restart machinery. Resilience here means "uses a more reliable backend" rather than "tolerates failures during a collective".
Architecture-aware topology mapping is left as future work.

Relevance to DynamICCL

PCCL is the closest work in spirit to DynamICCL: both use a learned model to pick collective configuration at runtime instead of trusting NCCL/RCCL's built-in static decisions. Direct lessons:

(Algorithm, backend) is a real, evaluated decision dimension: PCCL shows that picking ring vs. recursive at the inter-node level alone yields 168x gains. DynamICCL's algorithm-selection action (Ring / Tree / etc.) is the same decision lifted into NCCL's tuner-plugin API.
Decision features (message size, GPU count) are sufficient for high accuracy: the SVM hits 75 - 95% with just two features. This is a strong prior for DynamICCL's state design — message size and rank count are first-class observations; richer history may be over-engineering for the coarse algorithm/protocol choice.
Hierarchical decomposition implies a factored action space: PCCL chooses inter-node algorithm and intra-node algorithm independently. DynamICCL can mirror this with a factored action head — separate logits for inter-node algorithm and intra-node protocol — which is more sample-efficient than one big joint softmax.
Concurrent rank-aligned sub-collectives = an nChannels analog: PCCL gets multi-NIC utilization by running one sub-collective per local rank simultaneously. NCCL's nChannels plays the same role on a single host (one channel per SM cluster, ideally one per NIC). DynamICCL's nChannels action should be informed by the same physical reality — match the channel count to the available NICs/links per node.
Offline-trained classifier, runtime inference: PCCL's SVM is trained once on a benchmark sweep and queried per call. DynamICCL can adopt the same deployment shape — offline RL training over Chameleon trace replays, then a frozen policy queried inside the tuner plugin.
The "resilience" framing is mostly backend-choice, not active recovery: if DynamICCL adds a true fault-tolerance dimension (retry, fallback path, path migration), it would extend beyond what PCCL does and would need new instrumentation hooks in NCCL.

PCCL validates the central DynamICCL hypothesis at petascale: a tiny learned dispatcher beats vendor heuristics by an order of magnitude on real LLM training. DynamICCL extends this to richer NCCL knobs (protocol, nChannels, numThreads) and uses RL instead of supervised classification, allowing optimization of latency/goodput targets that lack labeled "best backend" ground truth.