SCCL: Synthesizing Optimal Collective Algorithms — Detailed Summary

Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi | Australian National University / University of Utah / Microsoft Research | PPoPP '21 | DOI: 10.1145/3437801.3441620

Per-section summary mirroring the paper's headings. Each section gives paragraph-level bullet points, all named methods, all equations, and exact quantitative results where the paper provides them.


Abstract


1. Introduction

Motivation:

Why hand-tuned libraries fall short:

Contributions:

  1. A formal class of k-synchronous algorithms that captures common collective schedules.
  2. An efficient SMT encoding combining Boolean (snd_{n,c,n'}) and integer (time_{c,n}) variables.
  3. A Pareto-Synthesize algorithm that walks the (steps, rounds/chunk) frontier corner-by-corner.
  4. A code generator that lowers synthesized schedules to fused single- kernel SPMD CUDA / HIP code.
  5. Empirical evidence of speedups up to 6.8x over NCCL on DGX-1.

Figure 1 caption: "NVLink topology of an NVIDIA DGX-1."


2. Overview

2.1 Running example: DGX-1

2.2 Topology

2.3 Cost model

2.4 Bandwidth-optimal AllGather on DGX-1

2.5 Latency-optimal AllGather on DGX-1


3. Algorithm Synthesis

3.1 k-synchronous algorithms

3.2 Non-combining collective instance

A collective is parameterized by (G, S, R, P, B, pre, post):

3.3 Common pre/post relations (Table 1)

Relation Definition
All [G] x [P] (every chunk at every node)
Root [G] x {n_root} (every chunk at one root)
Scattered {(c, n) : n = c mod P} (chunk c only at node c mod P)
Transpose {(c, n) : n = floor(c/P) mod P} (used for AllToAll)

3.4 Collective specifications (Table 2)

Collective pre post
Gather Scattered Root
AllGather Scattered All
AllToAll Scattered Transpose
Broadcast Root All
Scatter Root Scattered

3.5 SMT encoding (quantifier-free linear integer arithmetic)

Variables:

Constraints:

ID Statement English meaning
C1 forall (c, n) in pre : time_{c,n} = 0 Pre-condition
C2 forall (c, n) in post : time_{c,n} <= S Post-condition
C3 forall (c, n) not in pre, time_{c,n} <= S => sum_{(n',n) in E} snd_{n',c,n} = 1 Validity (received exactly once)
C4 snd_{n,c,n'} => time_{c,n} < time_{c,n'} Causality
C5 sum_{(c,(n,n'))} (snd_{n,c,n'} AND time_{c,n'} = s) <= b * r_s Bandwidth
C6 sum_{1 <= s <= S} r_s = R Total rounds

The hybrid (Boolean + integer) encoding is the key novelty: a pure-Boolean encoding failed to terminate within 60 minutes on the same problem where the hybrid finishes in 10-100 seconds.

3.6 Combining collectives

3.7 Pareto-Synthesize (Algorithm 1)

input:  k (slack), Coll, P, B
1:  a_l = Diameter(P, B)
2:  b_l = 1 / BisectionBW(P, B)
3:  (pre, post) = Lookup(Coll)
4:  for S = a_l, a_l + 1, ...:
5:      A = { (R, C) : S <= R <= S+k AND R/C >= b_l }
6:      for (R, C) in A in ascending order of R/C:
7:          G = ToGlobal(Coll, C)
8:          if SMT(G, S, R, P, B, pre, post) == SAT:
9:              report (S, R, C) algorithm
10:             if R/C == b_l: return       // bandwidth-optimal hit
11:             break                       // increment S, restart

Termination caveat (Section 3.7): "It is possible for this procedure to never terminate as there can sometimes be unbounded number of Pareto- optimal algorithms for certain topologies." In practice, the authors set a budget and stop.


4. Code Generation

The synthesized schedule is a list of snd_{n,c,n'} tuples annotated with time_{c,n} step indices. Lowering produces SPMD C++ + CUDA / HIP code.

Design choices:


5. Evaluation

5.1 Experimental platforms

5.1.1 NVIDIA DGX-1

5.1.2 Gigabyte Z52 (AMD)

Software

5.2 NCCL baselines (Table 3, DGX-1)

Collective C S R Cost
AllGather / ReduceScatter 6 7 7 7*alpha + (7/6)*L*beta
AllReduce 48 14 14 14*alpha + (7/24)*L*beta
Broadcast / Reduce 6m 6+m 6+m derived

5.3 Synthesized algorithms — DGX-1 (Table 4)

Collective C S R Optimality Synth time
AllGather 1 2 2 Latency 0.3 s
AllGather 6 7 7 Bandwidth 4.6 s
AllGather 6 3 7 Bandwidth (low S) 6.6 s
AllReduce 8 4 4 Latency 0.3 s
AllReduce 48 14 14 Bandwidth 12.8 s
AllReduce 16 4 6 Latency-tight 0.8 s
Broadcast 2 2 2 Latency 0.1 s
Broadcast/Reduce 18 5 5 -- 8.5 s
AllToAll 8 2 3 Latency 3.0 s
AllToAll 24 8 8 Bandwidth 133.7 s
AllToAll 24 2 8 Both 24.3 s

5.4 Synthesized algorithms — AMD Z52 (Table 5)

Collective C S R Optimality Synth time
AllGather 1 4 4 Latency 0.5 s
AllGather 2 7 7 Bandwidth 1.3 s
AllGather 2 4 7 Both 1.7 s
AllReduce 8 8 8 Latency 0.4 s
AllReduce 16 14 14 Bandwidth 0.9 s
AllToAll 8 4 8 Both 8.2 s

5.4.3 Synthesis time

5.5 End-to-end performance (Figures 4-7)

5.5.1 AllGather on DGX-1

5.5.2 AllReduce on DGX-1

5.5.3 AllToAll on DGX-1

5.5.4 AMD Z52 (Figure 7)

5.5.5 OpenMPI / UCX

End-to-end ML context


System / paper Technique Relation to SCCL
MPI [9] Standard collective primitives (mesh, hypercube, fat-tree) Basis; SCCL targets MPI's interface
NCCL [18] Hand-tuned ring + double-binary-tree Primary baseline
RCCL [2] AMD's NCCL fork AMD baseline
Blink [29] Spanning-tree packing for Broadcast/Reduce Closest related synthesizer; NOT bandwidth-optimal, no Pareto frontier
Horovod [23] Multi-node ring AllReduce wrapper Complementary (intra-node vs. inter-node)
BlueConnect [7] Hierarchical ring AllReduce Complementary
PLink [17] Multi-node ring with adaptive grouping Complementary
DistFuse etc. [13, 15, 19, 30] Compute-comm overlap Orthogonal — operates above the kernel

The key contrast with Blink: SCCL guarantees both bandwidth and latency optimality and produces the entire Pareto frontier; Blink does neither.


7. Conclusion

"This paper introduces SCCL: a systematic method to synthesize algorithms in the Pareto-frontier spanning from the latency-optimal algorithm to the bandwidth-optimal algorithm for a given collective on an input topology. We characterize a class of algorithms that captures a broad set of known algorithms and prove Pareto-optimality of both known algorithms and synthesized new algorithms. We automatically generate an implementation of these algorithms that is competitive with manually hand-tuned communication kernels in use today."


Limitations (author-stated and inferred)

  1. AMD topology incomplete. "We were unable to utilize both xGMI and PCIe at the same time so our model of the bandwidth ignores the dotted xGMI connections in Figure 3." (Section 5.2.2)
  2. Bisection bandwidth bound is loose on the AMD platform because PCIe bridges dominate.
  3. Synthesis may not terminate for topologies with unbounded Pareto- optimal algorithms (Section 3.7); a budget cutoff is needed.
  4. Single-server only. No multi-node or hierarchical synthesis.
  5. Offline synthesis. SCCL produces multiple algorithms per collective but does not select between them online — the user picks per buffer size.
  6. Limited collectives. Only those expressible by the (pre, post) relations of Table 2; custom non-associative reductions are out of scope.
  7. Hardware: only V100 + NVLink and MI50 + xGMI; A100/H100, NVSwitch, and multi-rail IB are unevaluated.

Open Problems Explicitly or Implicitly Identified

  1. Adaptive runtime selection across the Pareto frontier as a function of buffer size and current load.
  2. Multi-node hierarchical synthesis — combining intra-node SCCL schedules with inter-node patterns. (Realized later as MSCCL.)
  3. Hardware co-design. SCCL's infeasibility reports point at topology weaknesses; could drive interconnect design decisions.
  4. General reductions. Extending Reduce/ReduceScatter inversion to non-associative or non-commutative ops.

NCCL Specifics & Integration


Equations Summary

Eq Form Meaning
Cost model cost = a*alpha + b*L*beta (alpha, beta) model
Latency LB a_l = Diameter(G) min steps
Bandwidth LB b_l = 1/BisectionBW(G) min rounds/chunk
C1 (c,n) in pre => time_{c,n}=0 initial placement
C2 (c,n) in post => time_{c,n}<=S final placement
C3 (c,n) not in pre, time<=S => sum snd_{n',c,n}=1 exactly one delivery
C4 snd_{n,c,n'} => time_{c,n} < time_{c,n'} causality
C5 sum (snd AND time'=s) <= b*r_s bandwidth
C6 sum r_s = R total rounds

Cross-Cutting Empirical Take-Aways

Take-away Derived from
Pareto frontier is real and non-trivial Table 4: same R/C at S=3 vs. S=7
Latency-optimal dominates at small buffers AllGather small buffers: 2.2x speedup
Bandwidth-optimal dominates at large buffers AllReduce 48-chunk variant at 1 GB
Single-kernel fusion is decisive Multi-cudaMemcpy lowering loses ~10-20%
Push model > pull model ~10% gain
Boolean-only SMT encoding is intractable Naive encoding > 60 min vs. 24 s hybrid
Vendor libraries can be beaten by 1.06x-6.8x with no extra hardware All speedup tables
AllToAll is where vendor libraries are weakest 6.8x speedup (NCCL fakes it via N-pt-to-pt)

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective parameters (algorithm, protocol, nChannels, numThreads, chunkSize) to minimize collective completion time on HPC GPU clusters. Its policy conditions on state features: log-binned message size, model intensity I = C/D, local batch size, topology fingerprint (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet), and an LSTM-encoded window of recent collective timings. Reward is -collective_wall_clock_us.

SCCL is both a complementary technology and a research neighbor that shapes DynamICCL's design priors:

SCCL finding DynamICCL design implication
Pareto frontier exists per (topology, collective) Action-space prior: per-topology bias toward latency-optimal at small buffers, bandwidth-optimal at large
Buffer size is the primary regime selector Confirms log-binned message-size is the dominant state feature
Topology graph asymmetry changes the schedule (DGX-1 vs. Z52) Topology fingerprint must encode link-graph identity, not just bandwidth
Vendor AllToAll is weak (6.8x gap) DynamICCL exploration budget should over-sample AllToAll regimes — large gains are available
Single-kernel fusion vs. multi-launch is an implementation discontinuity numThreads + nChannels actions interact non-linearly — discrete branches in policy
Latency-optimal (S=2) vs. bandwidth-optimal (S=7) at same R/C Algorithm action is not 1D; Ring/Tree alone underspecifies — extend to SCCL/MSCCL plugin variants when available
Reward = absolute latency, not algBw Confirms reward r = -collective_wall_clock_us
User picks Pareto point per buffer size (Limitation #1) Exactly the gap DynamICCL fills with a learned policy

Specific design priors:

  1. State-vector features. SCCL's per-topology Pareto curves vindicate DynamICCL's topology-fingerprint feature. The 4-bin topology embedding should at minimum encode (NVLink-mesh, NVLink+PCIe-island, PCIe-only, IB-only, Ethernet); SCCL synthesizes substantially different schedules on DGX-1 vs. Z52, so the policy must condition on this. A bisection- bandwidth scalar feature is also useful as a continuous proxy.

  2. Action-space priors. SCCL exposes that within "AllReduce" there are at least three Pareto-optimal variants on DGX-1 ((8,4,4), (16,4,6), (48,14,14)). The NCCL algorithm axis (Ring / Tree / CollNet / NVLS) is a coarsening of this richer space. When MSCCL or SCCL plugins are loaded into NCCL, DynamICCL's action vocabulary must expand to include them; otherwise the policy is bounded by NCCL's 3-4 hard-coded variants.

  3. Exploration budget. SCCL's results show vendor libraries are weakest on AllToAll (6.8x gap). DynamICCL should bias exploration epsilon higher on AllToAll-heavy workloads (e.g., MoE training, sequence parallelism). On AllReduce-heavy workloads (data parallelism) exploitation can dominate sooner.

  4. Reward shaping. SCCL reports speedups in absolute latency and never normalizes by bus-bandwidth. This is direct evidence that collective wall-clock is the right reward signal — DynamICCL's r = -collective_wall_clock_us is correct, and any algBw/busBw proxy would obscure the latency-vs-bandwidth trade-off SCCL exposes.

  5. Discontinuity awareness. SCCL shows a 10% step from kernel-launch discontinuity (multi-cudaMemcpy vs. single fused kernel). The numThreads + nChannels interaction in NCCL has similar discontinuities (e.g., 1 channel 128 threads vs. 8 channels 512 threads). DynamICCL's exploration policy should not interpolate linearly; treat (low-channel, large-chunk) and (high-channel, small- chunk) as separate branches.

  6. Research positioning. SCCL synthesizes the best schedule for a given (topology, collective, buffer-size). DynamICCL chooses, at run time, which protocol / chunk size / channel count / algorithm a pre-existing NCCL kernel uses. The two layers are complementary: a future stack would (a) load SCCL/MSCCL-synthesized algorithms into NCCL via the algorithm plugin interface and (b) let DynamICCL select among them at run time. This positioning differentiates DynamICCL from SCCL/MSCCL in the funding/publication landscape: SCCL fixes the algorithm offline, DynamICCL chooses it online.

  7. Open-problem alignment. SCCL's stated Limitation #1 — "user picks the Pareto point per buffer size" — is the precise problem DynamICCL solves. A learned policy that observes (message size, topology, recent timings, model intensity) and picks the Pareto-optimal variant automatically is the natural successor to SCCL's offline synthesis. This makes DynamICCL the runtime selector for the static catalog that SCCL/MSCCL produces.

  8. Methodological lesson. SCCL's hybrid SMT encoding succeeded where pure Boolean failed (>60 min). For DynamICCL, the analogous lesson is that the policy's input representation matters: a discrete one-hot over (algorithm, protocol) may scale poorly compared to an embedded action representation that captures latency-vs-bandwidth duality directly. Worth exploring: factorized action heads (one for latency-favoring vs. bandwidth-favoring branch, one for granularity).