SCCL: Synthesizing Optimal Collective Algorithms
Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi | Australian National University / University of Utah / Microsoft Research | PPoPP '21 | DOI: 10.1145/3437801.3441620
Problem
Collective communication (AllReduce, AllGather, AllToAll, Broadcast, Reduce) is the Amdahl's-Law bottleneck of data-parallel deep learning: for an 8.3B parameter Megatron model, ~30% of training time is spent in AllReduce on medium (10-100 MB) buffers. Modern accelerator servers (DGX-1, AMD MI50 boxes, IPU pods, TPU pods) ship with irregular, asymmetric interconnects (NVLink mesh + PCIe islands + xGMI bridges) that no longer fit the regular ring/tree topologies that classic algorithms target. Vendor libraries (NCCL, RCCL) hand-tune one or two algorithms per collective; for novel topologies, new collectives, or unusual buffer sizes, the result is leaving large amounts of bandwidth and latency on the table. Manually re-deriving optimal schedules for every (topology, collective, size) triple is intractable.
Core Insight
Collective-algorithm discovery can be cast as a program-synthesis
problem over a tractable class of k-synchronous schedules
and discharged to an SMT solver, producing the entire Pareto-frontier
between latency-optimal and bandwidth-optimal algorithms automatically —
and the resulting synthesized kernels match or beat hand-tuned NCCL/RCCL
kernels at speeds up to 6.8x while taking minutes (not weeks) to derive
per topology.
Method
+------------------------------------------------------------------+
| Inputs |
| - Topology graph G (nodes = GPUs, edges = links + bandwidths) |
| - Collective spec: (pre, post) relations from Table 2 |
| e.g. AllGather: pre=Scattered, post=All |
| - k = relaxation slack (rounds R <= steps S + k) |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| 1. Pareto-Synthesize loop (Algorithm 1) |
| - Lower bound a_l = Diameter(G); b_l = 1 / BisectionBW(G) |
| - For S = a_l, a_l+1, ...: |
| For (R, C) with S <= R <= S+k and R/C >= b_l: |
| Build SMT formula; query Z3 |
| If SAT: emit (S, R, C) algorithm; widen R/C |
| Stop when R/C == b_l (bandwidth-optimal hit) |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| 2. SMT encoding (quantifier-free linear-int arithmetic) |
| Variables: |
| time_{c,n}: int (step at which chunk c reaches node n) |
| snd_{n,c,n'}: bool (does n send chunk c to n'?) |
| Constraints C1-C6: pre/post, validity, causality, bandwidth |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| 3. Code generation |
| - SPMD CUDA/HIP code, top-level switch on rank |
| - PUSH model over CUDA IPC memory handles |
| - Single fused kernel for all steps |
| - Fine-grained signal/wait via shared flags + |
| __threadfence_system() (no global barrier) |
+------------------------------------------------------------------+
The cost model is the standard (alpha, beta) model: cost
= a*alpha + b*L*beta, where a is the number of
synchronization steps S and b is the bandwidth
coefficient R/C (rounds per chunk). The Pareto-Synthesize
algorithm walks this 2D space corner-by-corner.
Experimental Setup
| Component | NVIDIA testbed | AMD testbed |
|---|---|---|
| Server | Nvidia DGX-1 | Gigabyte Z52 |
| GPUs | 8 x V100 | 8 x MI50 |
| GPU links | 6 x NVLink @ 25 GB/s per GPU | xGMI @ 33 GB/s + PCIe 4.0 @ 27 GB/s |
| CPU | 2 x Xeon E5-2698 v4 | 2 x AMD EPYC 7002 |
| OS | Ubuntu 20.04 | Ubuntu 20.04 |
| Comm runtime | CUDA 10.2 | ROCm 3.5.0 |
| SMT solver | Z3 v4.8.8 | Z3 v4.8.8 |
| Baselines | NCCL 2.7.8-1, OpenMPI/UCX | RCCL, OpenMPI/UCX |
| Collectives | AllGather, AllReduce, AllToAll, Broadcast, Reduce | AllGather, AllReduce, AllToAll |
| Buffer-size sweep | small (KB) -> large (~GB) | small -> large |
| Metric | Avg latency (ms); speedup vs. vendor lib | same |
Headline Quantitative Results
Speedup over NCCL on DGX-1 (Figures 4-6):
| Collective | Small-buffer speedup | Large-buffer speedup |
|---|---|---|
| AllGather | 2.2x | 1.14x |
| AllReduce | 1.8x | 1.06x |
| AllToAll | 1.4x | 6.8x |
Synthesis-time + Pareto-frontier on DGX-1 (Table 4):
| Collective | (C, S, R) | Optimality | SMT time |
|---|---|---|---|
| AllGather | (1, 2, 2) | Latency | 0.3 s |
| AllGather | (6, 7, 7) | Bandwidth | 4.6 s |
| AllGather | (6, 3, 7) | Bandwidth (lower S) | 6.6 s |
| AllReduce | (8, 4, 4) | Latency | 0.3 s |
| AllReduce | (48, 14, 14) | Bandwidth | 12.8 s |
| AllReduce | (16, 4, 6) | Latency-tight | 0.8 s |
| Broadcast | (2, 2, 2) | Latency | 0.1 s |
| AllToAll | (8, 2, 3) | Latency | 3.0 s |
| AllToAll | (24, 8, 8) | Bandwidth | 133.7 s |
| AllToAll | (24, 2, 8) | Both | 24.3 s |
NCCL baselines (Table 3, DGX-1):
- AllGather/ReduceScatter: 6 chunks, 7 steps, 7 rounds (cost
7*alpha + (7/6)*L*beta) - AllReduce: 48 chunks, 14 steps, 14 rounds
- SCCL synthesizes a (6, 3, 7) AllGather at the same bandwidth coefficient (7/6) but only 3 steps instead of 7 — the latency win dominates at small buffers.
AMD MI50 / Z52 (Table 5):
- AllGather (1, 4, 4): latency-optimal, 0.5 s synthesis.
- AllGather (2, 7, 7): bandwidth-optimal, 1.3 s synthesis.
- AllToAll (8, 4, 8): combined optimal, 8.2 s synthesis.
- SCCL beats RCCL on large AllGather; trails on small/medium where RCCL is hand-tuned.
Synthesis scalability:
- Most algorithms synthesize in < 10 seconds.
- Worst case: 24-chunk AllToAll = 133.7 s.
- A naive Boolean encoding failed to terminate within 60 minutes on the same problem — the integer/bool hybrid encoding is what makes synthesis tractable.
Implementation knobs:
- DMA copies are ~10% faster than CUDA-kernel copies on NVLink (DMA emits maximum-sized packets vs. 128-byte cache lines).
- PUSH model is ~10% faster than PULL (avoids request-round-trip).
- Single-fused-kernel +
__threadfence_system()flags eliminate the per- step kernel-launch overhead that defeats multi-cudaMemcpylowering.
Limitations
- AMD topology modeling ignores the dotted xGMI cross-island links because SCCL "could not utilize both xGMI and PCIe at the same time" — the bisection-bandwidth bound is therefore loose for some Z52 configurations.
- Pareto-Synthesize may not terminate on certain topologies that admit unbounded Pareto-optimal algorithms (Section 3.7).
- Only intra-node, single-server settings are evaluated; multi-node hierarchical synthesis is left to follow-up work (MSCCL).
- Synthesis is offline and topology-specific: the resulting algorithm is baked into a single fused kernel — no run-time selection across (S, R, C) variants is implemented in SCCL itself; the user picks per buffer size.
- The set of collectives is restricted to those expressible by the (pre, post) relations in Table 2 — combining collectives with custom reductions beyond Reduce/AllReduce/ReduceScatter need additional inversion.
- Hardware: only V100 + NVLink and MI50 + xGMI; A100/H100 + NVSwitch and multi-rail IB networks are unevaluated.
Open Problems Called Out
- Run-time algorithm switching across the Pareto frontier. SCCL produces multiple algorithms per collective but does not pick between them online — this is left to the caller. An adaptive selector conditioned on buffer size and current load is implied future work.
- Multi-node and hierarchical synthesis. The k-synchronous class handles intra-node well but does not scale to inter-node DAGs at fleet scale. Cited as the natural follow-on (later realized as MSCCL).
- Hardware-software co-design. Because SCCL reports infeasibility of (bandwidth, steps) corners, it can identify topology weaknesses and guide interconnect design.
- Custom reduction operators / non-associative ops. SCCL's "combining collective" inversion (Reduce = Broadcast inverse) requires associativity and commutativity — extending to general reductions is open.
Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective parameters (algorithm, protocol, nChannels, numThreads, chunkSize) to minimize collective completion time on HPC GPU clusters. SCCL is both a complementary technology and a natural research neighbor.
Action-space prior — algorithm taxonomy is richer than Ring/Tree. SCCL's synthesized (C, S, R) tuples for DGX-1 demonstrate that bandwidth-optimal at the same
R/Ccan come at very different step counts (e.g. 7 steps vs. 3 steps for AllGather). DynamICCL should treat the NCCLalgorithmaxis (Ring / Tree / CollNet / NVLS) as a coarse proxy for these synthesized variants — when SCCL/MSCCL plugins are loaded, the action space must expand to include them.State feature — message-size regime selects across the Pareto frontier. SCCL's headline finding is that latency-optimal algorithms dominate at small buffers and bandwidth-optimal at large buffers, with an explicit cross-over per topology. DynamICCL's log-binned message-size feature is exactly the right primary axis; the agent's job is precisely to learn the topology-specific cross-over point that SCCL leaves to the caller.
State feature — topology fingerprint must capture link-graph asymmetry. SCCL synthesizes different schedules for DGX-1 (NVLink mesh with two Hamiltonian cycles) vs. Z52 (xGMI islands + PCIe bridges). DynamICCL's topology feature must distinguish NVLink-only vs. NVLink+PCIe vs. PCIe+IB vs. Ethernet — confirms the existing 4-bin topology fingerprint at minimum, and motivates extending it to a bisection-bandwidth bin.
Reward shaping — collective wall-clock is the right signal. SCCL reports speedups in absolute latency, not in
algBw/busBwproxies. This confirms DynamICCL's rewardr = -collective_wall_clock_usis the correct end-to-end metric.Exploration budget — bandwidth-optimal vs. latency-optimal is a discrete branch. Within a single algorithm family, SCCL shows the step/round trade-off is non-trivial (3 vs. 7 steps at the same bandwidth). For DynamICCL's
nChannels+chunkSizeactions, the exploration policy should treat (nChannels=1, chunkSize=large) as a distinct branch from (nChannels=8, chunkSize=small) and avoid linear epsilon-greedy that mixes the two.Research positioning — SCCL/MSCCL fix the algorithm; DynamICCL fixes the configuration. SCCL synthesizes the best schedule for a topology; DynamICCL chooses, online, which protocol / chunk size / channel count the existing NCCL kernel uses. The two layers are complementary: a future stack would load SCCL-synthesized algorithms into NCCL via the tuner-plugin API and let DynamICCL select among them at run time.
Open-problem alignment — runtime Pareto selection is exactly the gap. SCCL's Limitation #1 ("user picks per buffer size") is precisely the problem DynamICCL solves: a learned policy that, given message size + topology + load, picks the right algorithm from the Pareto frontier without manual heuristics.