The Big Send-off: Scalable and Performant Collectives for Deep Learning (PCCL) — Detailed Summary

Siddharth Singh, Keshav Pradeep, Mahua Singh, Cunyang Wei, Abhinav Bhatele | University of Maryland, IIT Guwahati | arXiv (cs.DC) 2026

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points.


Abstract


1. Introduction / Motivation


2. Background


3. Current State of Collective Libraries


4. Design of PCCL

4.1 Hierarchical Two-Level Decomposition

PCCL replaces a single global collective with three phases:

  1. Inter-node phase: define p_local rank-aligned sub-communicators — sub-comm i contains the rank-i GPU from every node. All p_local sub-collectives execute concurrently. Because each rank-aligned sub-collective is pinned to a different local GPU, and each local GPU has an affinity to a different NIC, the system uses all NICs simultaneously.
  2. Intra-node phase: each node's local GPUs run an intra-node collective over NVLink / Infinity Fabric using the vendor library (NCCL or RCCL), which is well-tuned for the on-node fabric.
  3. Device-local shuffle: a CUDA/HIP kernel transposes the per-GPU buffer into the correct global element order, since the two-level decomposition permutes the natural global indexing.

4.2 Inter-Node Backends

4.3 Adaptive Dispatcher (SVM)

4.4 Architecture Diagram (ASCII)

+--------------------------------------------------------+
|              User API (PyTorch / DeepSpeed)            |
|              all_gather / reduce_scatter / all_reduce  |
+----------------------------+---------------------------+
                             |
                             v
                +------------+--------------+
                |  Adaptive Dispatcher      |
                |  (SVM: msg_size, p)       |
                +---+-----------+-----------+
                    |           |           |
        +-----------+   +-------+----+   +--+----------+
        | PCCL_rec  |   | PCCL_ring  |   | NCCL / RCCL |
        | (log p)   |   | (linear p) |   | (vendor)    |
        +-----+-----+   +-----+------+   +------+------+
              |               |                 |
              +---------------+---------+-------+
                                        |
                                        v
                +-----------------------+----------------+
                |   Phase 1: Inter-node sub-collectives  |
                |   (p_local concurrent groups, MPI P2P) |
                |   GPU-side reduction kernels (CUDA/HIP)|
                +----------------------+-----------------+
                                       |
                                       v
                +----------------------+-----------------+
                |   Phase 2: Intra-node collective       |
                |   (vendor NCCL/RCCL over NVLink/IF)    |
                +----------------------+-----------------+
                                       |
                                       v
                +----------------------+-----------------+
                |   Phase 3: Device-local shuffle kernel |
                +----------------------------------------+

NICs: each rank-aligned inter-node sub-comm pins to a different
local GPU, which has affinity to a different NIC ->
all 4 Slingshot-11 NICs active in parallel.

5. Resilience Framing (clarified)


6. Configuration Knobs

Knob Purpose
Backend choice PCCL_rec, PCCL_ring, native MPI, NCCL/RCCL
Algorithm at inter-node phase recursive (log p) vs. ring (linear p)
Sub-communicator layout rank-aligned grouping (p_local groups), determines NIC mapping
Adaptive vs. manual SVM-driven or operator override
Message-size feature first SVM input
GPU-count feature second SVM input
Python API Pybind11 bindings — drop-in for PyTorch/DeepSpeed

PCCL deliberately exposes a small surface; the SVM picks among a handful of backends.


7. Implementation


8. Evaluation

8.1 Testbeds

8.2 Workloads

8.3 Baselines

8.4 Headline Numbers

Result Value
Reduce-scatter on Frontier, 2048 GCDs up to 168x vs. RCCL
All-gather on Frontier, 2048 GCDs up to 33x vs. RCCL
All-reduce on Frontier, 2048 GCDs up to 10x vs. RCCL
ZeRO-3 GPT-7B/13B training up to 4.9x end-to-end
DDP GPT-1.3B training up to 2.4x end-to-end
NIC overflow counter reduction 200x lower than RCCL
SVM dispatcher accuracy 75% - 95.4%

8.5 Diagnostic Findings


9. Limitations and Future Work (as stated)


10. Adaptive / Learning Logic Summary

Element PCCL's Definition
Decision frequency Per collective call
Model SVM (RBF/linear kernel implied)
Features (message_size, GPU_count)
Output one of {PCCL_rec, PCCL_ring, native MPI, NCCL/RCCL}
Training data offline benchmark sweeps, sizes 1 MB - 1024 MB, p = 4 - 2048
Training paradigm supervised classification (label = fastest measured backend)
Online adaptation none — frozen model at deploy time
Reported accuracy 75% - 95.4%

The rest of the system is deterministic: algorithms, sub-communicator layout, GPU reduction kernels, and the local shuffle are all fixed.


11. Specific Quotes / Numbers Worth Remembering


12. Relevance to DynamICCL

DynamICCL selects (algorithm, protocol, nChannels, numThreads) per collective on HPC GPU clusters via RL, exposed through NCCL's tuner-plugin API. PCCL is the closest existing system in design philosophy — both replace static vendor heuristics with a learned per-call dispatcher.

Direct structural analogies:

PCCL element DynamICCL analog
Adaptive dispatcher (SVM) RL policy network
Decision per collective call Decision per collective call
Backend choice (rec/ring/MPI/NCCL) Algorithm choice (Ring/Tree/...)
Two-level inter/intra split NCCL inter-node vs intra-node algo
Concurrent rank-aligned sub-comms nChannels (channels-per-collective)
(msg_size, p) features NCCL state: msg_size, ranks, topology, history
Offline sweep -> classifier Offline RL training -> frozen policy
End-to-end speedup (4.9x ZeRO-3) DynamICCL's target metric

Mechanisms in PCCL that generalize as DynamICCL action-space dimensions:

  1. Algorithm dimension (already in DynamICCL). PCCL shows ring vs. recursive at the inter-node level alone yields 168x. NCCL's Ring/Tree choice is the analog; DynamICCL must include it as a first-class action.
  2. Inter-node vs. intra-node decomposition (factored action). PCCL decides inter-node and intra-node algorithms independently. DynamICCL should mirror this by factoring its action head: separate logits for the inter-node algorithm and the intra-node protocol/algorithm. Factored action heads are far more sample-efficient than a flat joint softmax over the cartesian product, especially in RL where data is expensive.
  3. NIC-parallelism dimension (nChannels). PCCL gets multi-NIC use by running p_local sub-collectives concurrently. NCCL's nChannels is the direct analog on a single host: more channels -> more SM clusters and ideally more NICs in parallel. DynamICCL should bias the nChannels action toward (or factor it through) the hardware NIC count of the node, just as PCCL's design implicitly does.
  4. Backend / protocol choice as a discrete action (with mask). PCCL's dispatcher uses a small discrete output set. DynamICCL's protocol action (LL / LL128 / Simple) is the same shape and can use the masked-softmax trick from Pensieve to disable invalid combinations on a given message size or topology.
  5. Two features go a long way. The SVM hits 75 - 95% with just (msg_size, p). This is a strong prior for DynamICCL's state design: message size and rank count should dominate the early state representation; richer history (numPipeOps history, prior chunk timings) matters most for the finer knobs (numThreads, chunkSize) rather than the coarse algorithm choice.
  6. Offline-train, runtime-infer deployment shape. PCCL trains the SVM once on a benchmark sweep and queries it per call; the runtime path is constant-time. DynamICCL on Chameleon Cloud should target the same shape — train offline on trace replays, freeze, query inside the tuner plugin — to keep per-collective latency overhead negligible.

What DynamICCL adds beyond PCCL:

Bottom line: PCCL is a petascale validation of the central DynamICCL hypothesis — that a tiny learned per-call dispatcher can beat vendor heuristics by orders of magnitude on real LLM training. DynamICCL extends this to a richer NCCL-tuner action space and uses RL to shed the labeled ground-truth requirement.