SCCL: Synthesizing Optimal Collective Algorithms

Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi | Australian National University / University of Utah / Microsoft Research | PPoPP '21 | DOI: 10.1145/3437801.3441620


Problem

Collective communication (AllReduce, AllGather, AllToAll, Broadcast, Reduce) is the Amdahl's-Law bottleneck of data-parallel deep learning: for an 8.3B parameter Megatron model, ~30% of training time is spent in AllReduce on medium (10-100 MB) buffers. Modern accelerator servers (DGX-1, AMD MI50 boxes, IPU pods, TPU pods) ship with irregular, asymmetric interconnects (NVLink mesh + PCIe islands + xGMI bridges) that no longer fit the regular ring/tree topologies that classic algorithms target. Vendor libraries (NCCL, RCCL) hand-tune one or two algorithms per collective; for novel topologies, new collectives, or unusual buffer sizes, the result is leaving large amounts of bandwidth and latency on the table. Manually re-deriving optimal schedules for every (topology, collective, size) triple is intractable.


Core Insight

Collective-algorithm discovery can be cast as a program-synthesis problem over a tractable class of k-synchronous schedules and discharged to an SMT solver, producing the entire Pareto-frontier between latency-optimal and bandwidth-optimal algorithms automatically — and the resulting synthesized kernels match or beat hand-tuned NCCL/RCCL kernels at speeds up to 6.8x while taking minutes (not weeks) to derive per topology.


Method

+------------------------------------------------------------------+
| Inputs                                                           |
|   - Topology graph G (nodes = GPUs, edges = links + bandwidths) |
|   - Collective spec: (pre, post) relations from Table 2          |
|     e.g. AllGather: pre=Scattered, post=All                      |
|   - k = relaxation slack (rounds R <= steps S + k)               |
+------------------------------------------------------------------+
                            |
                            v
+------------------------------------------------------------------+
| 1. Pareto-Synthesize loop (Algorithm 1)                          |
|    - Lower bound a_l = Diameter(G); b_l = 1 / BisectionBW(G)     |
|    - For S = a_l, a_l+1, ...:                                    |
|        For (R, C) with S <= R <= S+k and R/C >= b_l:             |
|           Build SMT formula; query Z3                            |
|           If SAT: emit (S, R, C) algorithm; widen R/C            |
|           Stop when R/C == b_l (bandwidth-optimal hit)           |
+------------------------------------------------------------------+
                            |
                            v
+------------------------------------------------------------------+
| 2. SMT encoding (quantifier-free linear-int arithmetic)          |
|    Variables:                                                    |
|     time_{c,n}: int   (step at which chunk c reaches node n)    |
|     snd_{n,c,n'}: bool (does n send chunk c to n'?)             |
|    Constraints C1-C6: pre/post, validity, causality, bandwidth   |
+------------------------------------------------------------------+
                            |
                            v
+------------------------------------------------------------------+
| 3. Code generation                                               |
|    - SPMD CUDA/HIP code, top-level switch on rank                |
|    - PUSH model over CUDA IPC memory handles                     |
|    - Single fused kernel for all steps                           |
|    - Fine-grained signal/wait via shared flags +                 |
|      __threadfence_system() (no global barrier)                  |
+------------------------------------------------------------------+

The cost model is the standard (alpha, beta) model: cost = a*alpha + b*L*beta, where a is the number of synchronization steps S and b is the bandwidth coefficient R/C (rounds per chunk). The Pareto-Synthesize algorithm walks this 2D space corner-by-corner.


Experimental Setup

Component NVIDIA testbed AMD testbed
Server Nvidia DGX-1 Gigabyte Z52
GPUs 8 x V100 8 x MI50
GPU links 6 x NVLink @ 25 GB/s per GPU xGMI @ 33 GB/s + PCIe 4.0 @ 27 GB/s
CPU 2 x Xeon E5-2698 v4 2 x AMD EPYC 7002
OS Ubuntu 20.04 Ubuntu 20.04
Comm runtime CUDA 10.2 ROCm 3.5.0
SMT solver Z3 v4.8.8 Z3 v4.8.8
Baselines NCCL 2.7.8-1, OpenMPI/UCX RCCL, OpenMPI/UCX
Collectives AllGather, AllReduce, AllToAll, Broadcast, Reduce AllGather, AllReduce, AllToAll
Buffer-size sweep small (KB) -> large (~GB) small -> large
Metric Avg latency (ms); speedup vs. vendor lib same

Headline Quantitative Results

Speedup over NCCL on DGX-1 (Figures 4-6):

Collective Small-buffer speedup Large-buffer speedup
AllGather 2.2x 1.14x
AllReduce 1.8x 1.06x
AllToAll 1.4x 6.8x

Synthesis-time + Pareto-frontier on DGX-1 (Table 4):

Collective (C, S, R) Optimality SMT time
AllGather (1, 2, 2) Latency 0.3 s
AllGather (6, 7, 7) Bandwidth 4.6 s
AllGather (6, 3, 7) Bandwidth (lower S) 6.6 s
AllReduce (8, 4, 4) Latency 0.3 s
AllReduce (48, 14, 14) Bandwidth 12.8 s
AllReduce (16, 4, 6) Latency-tight 0.8 s
Broadcast (2, 2, 2) Latency 0.1 s
AllToAll (8, 2, 3) Latency 3.0 s
AllToAll (24, 8, 8) Bandwidth 133.7 s
AllToAll (24, 2, 8) Both 24.3 s

NCCL baselines (Table 3, DGX-1):

AMD MI50 / Z52 (Table 5):

Synthesis scalability:

Implementation knobs:


Limitations


Open Problems Called Out

  1. Run-time algorithm switching across the Pareto frontier. SCCL produces multiple algorithms per collective but does not pick between them online — this is left to the caller. An adaptive selector conditioned on buffer size and current load is implied future work.
  2. Multi-node and hierarchical synthesis. The k-synchronous class handles intra-node well but does not scale to inter-node DAGs at fleet scale. Cited as the natural follow-on (later realized as MSCCL).
  3. Hardware-software co-design. Because SCCL reports infeasibility of (bandwidth, steps) corners, it can identify topology weaknesses and guide interconnect design.
  4. Custom reduction operators / non-associative ops. SCCL's "combining collective" inversion (Reduce = Broadcast inverse) requires associativity and commutativity — extending to general reductions is open.

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective parameters (algorithm, protocol, nChannels, numThreads, chunkSize) to minimize collective completion time on HPC GPU clusters. SCCL is both a complementary technology and a natural research neighbor.

  1. Action-space prior — algorithm taxonomy is richer than Ring/Tree. SCCL's synthesized (C, S, R) tuples for DGX-1 demonstrate that bandwidth-optimal at the same R/C can come at very different step counts (e.g. 7 steps vs. 3 steps for AllGather). DynamICCL should treat the NCCL algorithm axis (Ring / Tree / CollNet / NVLS) as a coarse proxy for these synthesized variants — when SCCL/MSCCL plugins are loaded, the action space must expand to include them.

  2. State feature — message-size regime selects across the Pareto frontier. SCCL's headline finding is that latency-optimal algorithms dominate at small buffers and bandwidth-optimal at large buffers, with an explicit cross-over per topology. DynamICCL's log-binned message-size feature is exactly the right primary axis; the agent's job is precisely to learn the topology-specific cross-over point that SCCL leaves to the caller.

  3. State feature — topology fingerprint must capture link-graph asymmetry. SCCL synthesizes different schedules for DGX-1 (NVLink mesh with two Hamiltonian cycles) vs. Z52 (xGMI islands + PCIe bridges). DynamICCL's topology feature must distinguish NVLink-only vs. NVLink+PCIe vs. PCIe+IB vs. Ethernet — confirms the existing 4-bin topology fingerprint at minimum, and motivates extending it to a bisection-bandwidth bin.

  4. Reward shaping — collective wall-clock is the right signal. SCCL reports speedups in absolute latency, not in algBw/busBw proxies. This confirms DynamICCL's reward r = -collective_wall_clock_us is the correct end-to-end metric.

  5. Exploration budget — bandwidth-optimal vs. latency-optimal is a discrete branch. Within a single algorithm family, SCCL shows the step/round trade-off is non-trivial (3 vs. 7 steps at the same bandwidth). For DynamICCL's nChannels + chunkSize actions, the exploration policy should treat (nChannels=1, chunkSize=large) as a distinct branch from (nChannels=8, chunkSize=small) and avoid linear epsilon-greedy that mixes the two.

  6. Research positioning — SCCL/MSCCL fix the algorithm; DynamICCL fixes the configuration. SCCL synthesizes the best schedule for a topology; DynamICCL chooses, online, which protocol / chunk size / channel count the existing NCCL kernel uses. The two layers are complementary: a future stack would load SCCL-synthesized algorithms into NCCL via the tuner-plugin API and let DynamICCL select among them at run time.

  7. Open-problem alignment — runtime Pareto selection is exactly the gap. SCCL's Limitation #1 ("user picks per buffer size") is precisely the problem DynamICCL solves: a learned policy that, given message size + topology + load, picks the right algorithm from the Pareto frontier without manual heuristics.