The Big Send-off: Scalable and Performant Collectives for Deep Learning (PCCL) — Detailed Summary
Siddharth Singh, Keshav Pradeep, Mahua Singh, Cunyang Wei, Abhinav Bhatele | University of Maryland, IIT Guwahati | arXiv (cs.DC) 2026
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points.
Abstract
- Communication overhead is a primary scaling bottleneck for distributed deep learning on modern GPU supercomputers (Frontier, Perlmutter).
- Vendor collective libraries — Cray-MPICH, NCCL, and RCCL — fail to maintain ideal scaling at high GCD/GPU counts for the moderate-to-large messages characteristic of LLM training (ZeRO-3, FSDP, DDP).
- PCCL (Performant Collective Communication Library) introduces a hierarchical two-level decomposition of all-gather, reduce-scatter, and all-reduce, plus an SVM-based runtime dispatcher that selects the best backend per call.
- On 2048 GCDs of Frontier, PCCL achieves up to 168x speedup for reduce-scatter and 33x for all-gather over RCCL, translating to 4.9x faster DeepSpeed ZeRO-3 training and 2.4x faster DDP training.
1. Introduction / Motivation
- Distributed DL workloads (parameter sharding, gradient all-reduce) issue collectives ranging from tens of megabytes to gigabytes — far from the small-message regime that classical HPC collective libraries optimize.
- As LLMs scale to thousands of GPUs, communication time grows from a small fraction of step time to dominate it; closing this gap is a first-order performance lever.
- The authors observe three categories of failure in the existing
stack:
- Cray-MPICH on Slingshot-11 routes all writes through NIC-0 and all reads through NIC-3, leaving NICs 1 and 2 idle, and performs reductions on the CPU host.
- NCCL and RCCL implement only the ring algorithm for all-gather and reduce-scatter; ring latency grows linearly O(p), so log-latency algorithms become preferable beyond a moderate scale.
- RCCL exhibits reliability issues at large GPU counts (failures,
performance variability) and forces traffic onto the NIC's software
overflow path, evidenced by a 200x higher
lpe_net_match_overflow_0counter relative to PCCL.
2. Background
- Sharded Data Parallel (SDP) (ZeRO-3, FSDP): the model is sharded across ranks; each forward/backward step issues an all-gather to materialize parameters and a reduce-scatter to redistribute gradient shards.
- Distributed Data Parallel (DDP): each rank holds a full copy; gradients are all-reduced once per step.
- Ring algorithm: bandwidth-optimal, sends a chunk around a logical ring; latency is O(p) hops per chunk, problematic at high p.
- Recursive doubling (all-gather) / halving (reduce-scatter): log p latency, slightly higher per-step bandwidth, much better at scale.
- The interplay of these algorithms with hardware (NIC count per node, inter-node fabric, intra-node link) determines real performance.
3. Current State of Collective Libraries
- Profiling Cray-MPICH on Frontier: communication time fails to flatten with rank count even though the per-rank message shrinks. Hardware counters reveal one NIC is the bottleneck; the other three are idle.
- Profiling NCCL/RCCL: communication time grows roughly linearly with rank count for all-gather and reduce-scatter — the signature of the ring algorithm being the only available choice.
- Profiling RCCL specifically: high
lpe_net_match_overflow_0counter indicates the NIC is dropping into the slow software overflow list; this is associated with both lower throughput and observed reliability issues at high GCD counts. - Implication: a successful library must (i) use multiple NICs concurrently, (ii) use log-latency algorithms inter-node, (iii) keep traffic on the NIC's hardware priority list.
4. Design of PCCL
4.1 Hierarchical Two-Level Decomposition
PCCL replaces a single global collective with three phases:
- Inter-node phase: define p_local rank-aligned sub-communicators — sub-comm i contains the rank-i GPU from every node. All p_local sub-collectives execute concurrently. Because each rank-aligned sub-collective is pinned to a different local GPU, and each local GPU has an affinity to a different NIC, the system uses all NICs simultaneously.
- Intra-node phase: each node's local GPUs run an intra-node collective over NVLink / Infinity Fabric using the vendor library (NCCL or RCCL), which is well-tuned for the on-node fabric.
- Device-local shuffle: a CUDA/HIP kernel transposes the per-GPU buffer into the correct global element order, since the two-level decomposition permutes the natural global indexing.
4.2 Inter-Node Backends
PCCL_ring: bandwidth-optimal ring built directly on MPI point-to-point send/recv, with vector reduction performed by a GPU kernel rather than the CPU.PCCL_rec: recursive halving (reduce-scatter) and recursive doubling (all-gather), giving O(log p) latency.- Both backends use MPI_Send / MPI_Recv on GPU buffers (CUDA-aware / ROCm-aware MPI), which keeps Cassini NIC traffic on the hardware-accelerated priority list rather than the slow software overflow list.
4.3 Adaptive Dispatcher (SVM)
- A lightweight SVM classifier sits between the user API and the backends.
- Input features: (message size, GPU count). Two features only.
- Output classes: {PCCL_rec, PCCL_ring, native MPI, NCCL/RCCL}.
- Trained offline on a benchmark sweep over message sizes 1 MB - 1024 MB and GPU counts 4 - 2048.
- Reported dispatch accuracy: 75% - 95.4% across collectives (all-gather, reduce-scatter, all-reduce) and machines (Frontier, Perlmutter).
- At runtime PCCL queries the SVM and routes the call. The SVM is the only learned component; the algorithms themselves are deterministic.
4.4 Architecture Diagram (ASCII)
+--------------------------------------------------------+
| User API (PyTorch / DeepSpeed) |
| all_gather / reduce_scatter / all_reduce |
+----------------------------+---------------------------+
|
v
+------------+--------------+
| Adaptive Dispatcher |
| (SVM: msg_size, p) |
+---+-----------+-----------+
| | |
+-----------+ +-------+----+ +--+----------+
| PCCL_rec | | PCCL_ring | | NCCL / RCCL |
| (log p) | | (linear p) | | (vendor) |
+-----+-----+ +-----+------+ +------+------+
| | |
+---------------+---------+-------+
|
v
+-----------------------+----------------+
| Phase 1: Inter-node sub-collectives |
| (p_local concurrent groups, MPI P2P) |
| GPU-side reduction kernels (CUDA/HIP)|
+----------------------+-----------------+
|
v
+----------------------+-----------------+
| Phase 2: Intra-node collective |
| (vendor NCCL/RCCL over NVLink/IF) |
+----------------------+-----------------+
|
v
+----------------------+-----------------+
| Phase 3: Device-local shuffle kernel |
+----------------------------------------+
NICs: each rank-aligned inter-node sub-comm pins to a different
local GPU, which has affinity to a different NIC ->
all 4 Slingshot-11 NICs active in parallel.
5. Resilience Framing (clarified)
- The paper's "resilient" claim is qualitative and concentrated in backend choice, not in active fault-tolerance machinery.
- Specifically: PCCL builds inter-node phases on MPI point-to-point primitives because MPI has lower observed performance variability than RCCL at large scale and because RCCL has been reported to fail outright at very large GPU counts.
- There is no explicit retransmission strategy, multi-path failover, checkpoint/restart, redundancy, or erasure coding implemented in PCCL.
- The "performant" claim is fully realized; the "resilient" claim should be read as "uses a more dependable backend" rather than "tolerates faults during a collective".
6. Configuration Knobs
| Knob | Purpose |
|---|---|
| Backend choice | PCCL_rec, PCCL_ring, native MPI,
NCCL/RCCL |
| Algorithm at inter-node phase | recursive (log p) vs. ring (linear p) |
| Sub-communicator layout | rank-aligned grouping (p_local groups), determines NIC mapping |
| Adaptive vs. manual | SVM-driven or operator override |
| Message-size feature | first SVM input |
| GPU-count feature | second SVM input |
| Python API | Pybind11 bindings — drop-in for PyTorch/DeepSpeed |
PCCL deliberately exposes a small surface; the SVM picks among a handful of backends.
7. Implementation
- Core library: C++.
- GPU kernels (vector reduction, local shuffle): CUDA (NVIDIA) and HIP (AMD).
- Inter-node communication: MPI point-to-point (CUDA-aware / ROCm-aware MPI).
- Intra-node communication: NCCL (NVIDIA) or RCCL (AMD).
- Python integration: Pybind11.
- Targeted hardware: Slingshot-11 (Frontier, Perlmutter); InfiniBand support is left as future work.
8. Evaluation
8.1 Testbeds
- Frontier (OLCF): AMD MI250X GCDs (each MI250X has 2 GCDs), Slingshot-11 interconnect with 4 Cassini NICs per node, ROCm 6.2.4 / 6.4.1.
- Perlmutter (NERSC): NVIDIA A100 GPUs, Slingshot-11 interconnect, CUDA 12.4 / 12.9.
8.2 Workloads
- Standalone collective microbenchmarks: all-gather, reduce-scatter, all-reduce, message sizes 16 MB - 1 GB, GPU counts up to 2048.
- Production training: GPT-7B and GPT-13B with DeepSpeed ZeRO-3; GPT-1.3B with PyTorch DDP.
8.3 Baselines
- Cray-MPICH, NCCL, RCCL.
- Related work HiCCL referenced as a hierarchical predecessor that fails at large scale; ACCLAiM and similar auto-tuners cited as supervised-tuning comparisons.
8.4 Headline Numbers
| Result | Value |
|---|---|
| Reduce-scatter on Frontier, 2048 GCDs | up to 168x vs. RCCL |
| All-gather on Frontier, 2048 GCDs | up to 33x vs. RCCL |
| All-reduce on Frontier, 2048 GCDs | up to 10x vs. RCCL |
| ZeRO-3 GPT-7B/13B training | up to 4.9x end-to-end |
| DDP GPT-1.3B training | up to 2.4x end-to-end |
| NIC overflow counter reduction | 200x lower than RCCL |
| SVM dispatcher accuracy | 75% - 95.4% |
8.5 Diagnostic Findings
- The 200x reduction in
lpe_net_match_overflow_0confirms that PCCL keeps traffic on the Cassini priority list (zero-copy, hardware-matched), whereas RCCL spills to the software overflow path. - Multi-NIC utilization is achieved by construction: with p_local rank-aligned sub-comms, each local GPU's traffic uses a distinct NIC.
9. Limitations and Future Work (as stated)
- Evaluation limited to Slingshot-11; InfiniBand benchmarks are deferred.
- At small GPU counts with very large messages, vendor libraries can still win, motivating the dispatcher's fallback option.
- No implementation of explicit fault tolerance — only backend reliability selection.
- Topology-aware mapping (e.g., dragonfly group placement) is identified as a natural extension.
10. Adaptive / Learning Logic Summary
| Element | PCCL's Definition |
|---|---|
| Decision frequency | Per collective call |
| Model | SVM (RBF/linear kernel implied) |
| Features | (message_size, GPU_count) |
| Output | one of {PCCL_rec, PCCL_ring, native MPI, NCCL/RCCL} |
| Training data | offline benchmark sweeps, sizes 1 MB - 1024 MB, p = 4 - 2048 |
| Training paradigm | supervised classification (label = fastest measured backend) |
| Online adaptation | none — frozen model at deploy time |
| Reported accuracy | 75% - 95.4% |
The rest of the system is deterministic: algorithms, sub-communicator layout, GPU reduction kernels, and the local shuffle are all fixed.
11. Specific Quotes / Numbers Worth Remembering
- "up to 168x for reduce-scatter, 33x for all-gather and 10x for all-reduce" on 2048 GCDs of Frontier.
- "up to 4.9x speedup over RCCL in DeepSpeed ZeRO-3 training" for GPT-7B/13B.
- "RCCL exhibits 200x higher value for the
lpe_net_match_overflow_0counter, compared to PCCL" — the smoking gun for hardware fast-path use. - "NCCL and RCCL only support the ring algorithm for all-gather and reduce-scatter ... causing the total communication time to grow linearly with the number of processes."
12. Relevance to DynamICCL
DynamICCL selects (algorithm, protocol, nChannels, numThreads) per collective on HPC GPU clusters via RL, exposed through NCCL's tuner-plugin API. PCCL is the closest existing system in design philosophy — both replace static vendor heuristics with a learned per-call dispatcher.
Direct structural analogies:
| PCCL element | DynamICCL analog |
|---|---|
| Adaptive dispatcher (SVM) | RL policy network |
| Decision per collective call | Decision per collective call |
| Backend choice (rec/ring/MPI/NCCL) | Algorithm choice (Ring/Tree/...) |
| Two-level inter/intra split | NCCL inter-node vs intra-node algo |
| Concurrent rank-aligned sub-comms | nChannels (channels-per-collective) |
| (msg_size, p) features | NCCL state: msg_size, ranks, topology, history |
| Offline sweep -> classifier | Offline RL training -> frozen policy |
| End-to-end speedup (4.9x ZeRO-3) | DynamICCL's target metric |
Mechanisms in PCCL that generalize as DynamICCL action-space dimensions:
- Algorithm dimension (already in DynamICCL). PCCL shows ring vs. recursive at the inter-node level alone yields 168x. NCCL's Ring/Tree choice is the analog; DynamICCL must include it as a first-class action.
- Inter-node vs. intra-node decomposition (factored action). PCCL decides inter-node and intra-node algorithms independently. DynamICCL should mirror this by factoring its action head: separate logits for the inter-node algorithm and the intra-node protocol/algorithm. Factored action heads are far more sample-efficient than a flat joint softmax over the cartesian product, especially in RL where data is expensive.
- NIC-parallelism dimension (nChannels). PCCL gets
multi-NIC use by running p_local sub-collectives concurrently. NCCL's
nChannelsis the direct analog on a single host: more channels -> more SM clusters and ideally more NICs in parallel. DynamICCL should bias the nChannels action toward (or factor it through) the hardware NIC count of the node, just as PCCL's design implicitly does. - Backend / protocol choice as a discrete action (with mask). PCCL's dispatcher uses a small discrete output set. DynamICCL's protocol action (LL / LL128 / Simple) is the same shape and can use the masked-softmax trick from Pensieve to disable invalid combinations on a given message size or topology.
- Two features go a long way. The SVM hits 75 - 95% with just (msg_size, p). This is a strong prior for DynamICCL's state design: message size and rank count should dominate the early state representation; richer history (numPipeOps history, prior chunk timings) matters most for the finer knobs (numThreads, chunkSize) rather than the coarse algorithm choice.
- Offline-train, runtime-infer deployment shape. PCCL trains the SVM once on a benchmark sweep and queries it per call; the runtime path is constant-time. DynamICCL on Chameleon Cloud should target the same shape — train offline on trace replays, freeze, query inside the tuner plugin — to keep per-collective latency overhead negligible.
What DynamICCL adds beyond PCCL:
- More knobs: PCCL only picks a backend; DynamICCL picks (algorithm, protocol, nChannels, numThreads). The action space is much larger and combinatorially harder, justifying RL over supervised classification.
- RL over classification: PCCL needs labeled "best backend" ground truth from sweeps. DynamICCL can optimize a continuous reward (negative collective completion time, or end-to-end step time) and discover trade-offs no labeling scheme exposes — e.g., configurations that are individually slower but reduce contention with concurrent collectives.
- Genuine resilience could differentiate DynamICCL: PCCL's "resilient" is just backend-choice. If DynamICCL adds a true fault-tolerance dimension (retry, fallback path, path migration on observed slowdown), it would go beyond what PCCL does, but it requires new instrumentation hooks in NCCL and a richer state with congestion / loss signals.
Bottom line: PCCL is a petascale validation of the central DynamICCL hypothesis — that a tiny learned per-call dispatcher can beat vendor heuristics by orders of magnitude on real LLM training. DynamICCL extends this to a richer NCCL-tuner action space and uses RL to shed the labeled ground-truth requirement.