SCCL: Synthesizing Optimal Collective Algorithms

Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi | Australian National University / University of Utah / Microsoft Research | PPoPP '21 | DOI: 10.1145/3437801.3441620

Problem

Collective communication (AllReduce, AllGather, AllToAll, Broadcast, Reduce) is the Amdahl's-Law bottleneck of data-parallel deep learning: for an 8.3B parameter Megatron model, ~30% of training time is spent in AllReduce on medium (10-100 MB) buffers. Modern accelerator servers (DGX-1, AMD MI50 boxes, IPU pods, TPU pods) ship with irregular, asymmetric interconnects (NVLink mesh + PCIe islands + xGMI bridges) that no longer fit the regular ring/tree topologies that classic algorithms target. Vendor libraries (NCCL, RCCL) hand-tune one or two algorithms per collective; for novel topologies, new collectives, or unusual buffer sizes, the result is leaving large amounts of bandwidth and latency on the table. Manually re-deriving optimal schedules for every (topology, collective, size) triple is intractable.

Core Insight

Collective-algorithm discovery can be cast as a program-synthesis problem over a tractable class of k-synchronous schedules and discharged to an SMT solver, producing the entire Pareto-frontier between latency-optimal and bandwidth-optimal algorithms automatically — and the resulting synthesized kernels match or beat hand-tuned NCCL/RCCL kernels at speeds up to 6.8x while taking minutes (not weeks) to derive per topology.

Method

+------------------------------------------------------------------+
| Inputs                                                           |
|   - Topology graph G (nodes = GPUs, edges = links + bandwidths) |
|   - Collective spec: (pre, post) relations from Table 2          |
|     e.g. AllGather: pre=Scattered, post=All                      |
|   - k = relaxation slack (rounds R <= steps S + k)               |
+------------------------------------------------------------------+
                            |
                            v
+------------------------------------------------------------------+
| 1. Pareto-Synthesize loop (Algorithm 1)                          |
|    - Lower bound a_l = Diameter(G); b_l = 1 / BisectionBW(G)     |
|    - For S = a_l, a_l+1, ...:                                    |
|        For (R, C) with S <= R <= S+k and R/C >= b_l:             |
|           Build SMT formula; query Z3                            |
|           If SAT: emit (S, R, C) algorithm; widen R/C            |
|           Stop when R/C == b_l (bandwidth-optimal hit)           |
+------------------------------------------------------------------+
                            |
                            v
+------------------------------------------------------------------+
| 2. SMT encoding (quantifier-free linear-int arithmetic)          |
|    Variables:                                                    |
|     time_{c,n}: int   (step at which chunk c reaches node n)    |
|     snd_{n,c,n'}: bool (does n send chunk c to n'?)             |
|    Constraints C1-C6: pre/post, validity, causality, bandwidth   |
+------------------------------------------------------------------+
                            |
                            v
+------------------------------------------------------------------+
| 3. Code generation                                               |
|    - SPMD CUDA/HIP code, top-level switch on rank                |
|    - PUSH model over CUDA IPC memory handles                     |
|    - Single fused kernel for all steps                           |
|    - Fine-grained signal/wait via shared flags +                 |
|      __threadfence_system() (no global barrier)                  |
+------------------------------------------------------------------+

The cost model is the standard (alpha, beta) model: cost = a*alpha + b*L*beta, where a is the number of synchronization steps S and b is the bandwidth coefficient R/C (rounds per chunk). The Pareto-Synthesize algorithm walks this 2D space corner-by-corner.

Experimental Setup

Component	NVIDIA testbed	AMD testbed
Server	Nvidia DGX-1	Gigabyte Z52
GPUs	8 x V100	8 x MI50
GPU links	6 x NVLink @ 25 GB/s per GPU	xGMI @ 33 GB/s + PCIe 4.0 @ 27 GB/s
CPU	2 x Xeon E5-2698 v4	2 x AMD EPYC 7002
OS	Ubuntu 20.04	Ubuntu 20.04
Comm runtime	CUDA 10.2	ROCm 3.5.0
SMT solver	Z3 v4.8.8	Z3 v4.8.8
Baselines	NCCL 2.7.8-1, OpenMPI/UCX	RCCL, OpenMPI/UCX
Collectives	AllGather, AllReduce, AllToAll, Broadcast, Reduce	AllGather, AllReduce, AllToAll
Buffer-size sweep	small (KB) -> large (~GB)	small -> large
Metric	Avg latency (ms); speedup vs. vendor lib	same

Headline Quantitative Results

Speedup over NCCL on DGX-1 (Figures 4-6):

Collective	Small-buffer speedup	Large-buffer speedup
AllGather	2.2x	1.14x
AllReduce	1.8x	1.06x
AllToAll	1.4x	6.8x

Synthesis-time + Pareto-frontier on DGX-1 (Table 4):

Collective	(C, S, R)	Optimality	SMT time
AllGather	(1, 2, 2)	Latency	0.3 s
AllGather	(6, 7, 7)	Bandwidth	4.6 s
AllGather	(6, 3, 7)	Bandwidth (lower S)	6.6 s
AllReduce	(8, 4, 4)	Latency	0.3 s
AllReduce	(48, 14, 14)	Bandwidth	12.8 s
AllReduce	(16, 4, 6)	Latency-tight	0.8 s
Broadcast	(2, 2, 2)	Latency	0.1 s
AllToAll	(8, 2, 3)	Latency	3.0 s
AllToAll	(24, 8, 8)	Bandwidth	133.7 s
AllToAll	(24, 2, 8)	Both	24.3 s

NCCL baselines (Table 3, DGX-1):

AllGather/ReduceScatter: 6 chunks, 7 steps, 7 rounds (cost 7*alpha + (7/6)*L*beta)
AllReduce: 48 chunks, 14 steps, 14 rounds
SCCL synthesizes a (6, 3, 7) AllGather at the same bandwidth coefficient (7/6) but only 3 steps instead of 7 — the latency win dominates at small buffers.

AMD MI50 / Z52 (Table 5):

AllGather (1, 4, 4): latency-optimal, 0.5 s synthesis.
AllGather (2, 7, 7): bandwidth-optimal, 1.3 s synthesis.
AllToAll (8, 4, 8): combined optimal, 8.2 s synthesis.
SCCL beats RCCL on large AllGather; trails on small/medium where RCCL is hand-tuned.

Synthesis scalability:

Most algorithms synthesize in < 10 seconds.
Worst case: 24-chunk AllToAll = 133.7 s.
A naive Boolean encoding failed to terminate within 60 minutes on the same problem — the integer/bool hybrid encoding is what makes synthesis tractable.

Implementation knobs:

DMA copies are ~10% faster than CUDA-kernel copies on NVLink (DMA emits maximum-sized packets vs. 128-byte cache lines).
PUSH model is ~10% faster than PULL (avoids request-round-trip).
Single-fused-kernel + __threadfence_system() flags eliminate the per- step kernel-launch overhead that defeats multi-cudaMemcpy lowering.

Limitations

AMD topology modeling ignores the dotted xGMI cross-island links because SCCL "could not utilize both xGMI and PCIe at the same time" — the bisection-bandwidth bound is therefore loose for some Z52 configurations.
Pareto-Synthesize may not terminate on certain topologies that admit unbounded Pareto-optimal algorithms (Section 3.7).
Only intra-node, single-server settings are evaluated; multi-node hierarchical synthesis is left to follow-up work (MSCCL).
Synthesis is offline and topology-specific: the resulting algorithm is baked into a single fused kernel — no run-time selection across (S, R, C) variants is implemented in SCCL itself; the user picks per buffer size.
The set of collectives is restricted to those expressible by the (pre, post) relations in Table 2 — combining collectives with custom reductions beyond Reduce/AllReduce/ReduceScatter need additional inversion.
Hardware: only V100 + NVLink and MI50 + xGMI; A100/H100 + NVSwitch and multi-rail IB networks are unevaluated.

Open Problems Called Out

Run-time algorithm switching across the Pareto frontier. SCCL produces multiple algorithms per collective but does not pick between them online — this is left to the caller. An adaptive selector conditioned on buffer size and current load is implied future work.
Multi-node and hierarchical synthesis. The k-synchronous class handles intra-node well but does not scale to inter-node DAGs at fleet scale. Cited as the natural follow-on (later realized as MSCCL).
Hardware-software co-design. Because SCCL reports infeasibility of (bandwidth, steps) corners, it can identify topology weaknesses and guide interconnect design.
Custom reduction operators / non-associative ops. SCCL's "combining collective" inversion (Reduce = Broadcast inverse) requires associativity and commutativity — extending to general reductions is open.

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective parameters (algorithm, protocol, nChannels, numThreads, chunkSize) to minimize collective completion time on HPC GPU clusters. SCCL is both a complementary technology and a natural research neighbor.

Action-space prior — algorithm taxonomy is richer than Ring/Tree. SCCL's synthesized (C, S, R) tuples for DGX-1 demonstrate that bandwidth-optimal at the same R/C can come at very different step counts (e.g. 7 steps vs. 3 steps for AllGather). DynamICCL should treat the NCCL algorithm axis (Ring / Tree / CollNet / NVLS) as a coarse proxy for these synthesized variants — when SCCL/MSCCL plugins are loaded, the action space must expand to include them.
State feature — message-size regime selects across the Pareto frontier. SCCL's headline finding is that latency-optimal algorithms dominate at small buffers and bandwidth-optimal at large buffers, with an explicit cross-over per topology. DynamICCL's log-binned message-size feature is exactly the right primary axis; the agent's job is precisely to learn the topology-specific cross-over point that SCCL leaves to the caller.
State feature — topology fingerprint must capture link-graph asymmetry. SCCL synthesizes different schedules for DGX-1 (NVLink mesh with two Hamiltonian cycles) vs. Z52 (xGMI islands + PCIe bridges). DynamICCL's topology feature must distinguish NVLink-only vs. NVLink+PCIe vs. PCIe+IB vs. Ethernet — confirms the existing 4-bin topology fingerprint at minimum, and motivates extending it to a bisection-bandwidth bin.
Reward shaping — collective wall-clock is the right signal. SCCL reports speedups in absolute latency, not in algBw/busBw proxies. This confirms DynamICCL's reward r = -collective_wall_clock_us is the correct end-to-end metric.
Exploration budget — bandwidth-optimal vs. latency-optimal is a discrete branch. Within a single algorithm family, SCCL shows the step/round trade-off is non-trivial (3 vs. 7 steps at the same bandwidth). For DynamICCL's nChannels + chunkSize actions, the exploration policy should treat (nChannels=1, chunkSize=large) as a distinct branch from (nChannels=8, chunkSize=small) and avoid linear epsilon-greedy that mixes the two.
Research positioning — SCCL/MSCCL fix the algorithm; DynamICCL fixes the configuration. SCCL synthesizes the best schedule for a topology; DynamICCL chooses, online, which protocol / chunk size / channel count the existing NCCL kernel uses. The two layers are complementary: a future stack would load SCCL-synthesized algorithms into NCCL via the tuner-plugin API and let DynamICCL select among them at run time.
Open-problem alignment — runtime Pareto selection is exactly the gap. SCCL's Limitation #1 ("user picks per buffer size") is precisely the problem DynamICCL solves: a learned policy that, given message size + topology + load, picks the right algorithm from the Pareto frontier without manual heuristics.