SCCL: Synthesizing Optimal Collective Algorithms — Detailed Summary
Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi | Australian National University / University of Utah / Microsoft Research | PPoPP '21 | DOI: 10.1145/3437801.3441620
Per-section summary mirroring the paper's headings. Each section gives paragraph-level bullet points, all named methods, all equations, and exact quantitative results where the paper provides them.
Abstract
- Collective communication is the Amdahl-bottleneck of data-parallel deep learning training.
- The paper introduces SCCL (Synthesized Collective Communication Library): a systematic approach to synthesizing collective algorithms tailored to a particular hardware topology.
- SCCL spans the entire Pareto frontier from latency-optimal to bandwidth-optimal implementations of a given collective.
- The synthesis problem is encoded as a quantifier-free SMT formula discharged to a theorem prover (Z3). The encoding is the key to scaling.
- SCCL synthesizes novel latency- and bandwidth-optimal algorithms not found in the literature on two popular topologies (DGX-1 and a Gigabyte AMD MI50 server).
- It lowers algorithms to implementations on both NVIDIA (CUDA) and AMD (HIP) hardware and is competitive with hand-tuned NCCL/RCCL.
1. Introduction
Motivation:
- Collective communication primitives (AllReduce, AllGather, Broadcast, Reduce, AllToAll, ReduceScatter) underpin distributed compute, particularly data-parallel deep-learning training.
- For an 8.3-billion parameter Megatron model with model parallelism, ~30% of training time is spent in AllReduce on medium (10-100 MB) buffers.
- Communication buffers span 6+ orders of magnitude — single-layer KB- sized gradients to multi-GB full-model exchanges — and no single algorithm is optimal across this range.
Why hand-tuned libraries fall short:
- NCCL hard-codes Ring and double-binary-Tree implementations for
AllReduce
- a small set of variants for other collectives.
- Modern AI servers (DGX-1, AMD MI50 boxes, IPUs, TPUs) have irregular, asymmetric interconnects (mesh of NVLinks, xGMI islands bridged by PCIe, etc.) that the regular ring/tree algorithms do not exploit.
- The author's claim: an automatic synthesizer can produce algorithms that are superior to hand-written ones by tailoring to the exact graph and bandwidth budget of the interconnect.
Contributions:
- A formal class of
k-synchronous algorithms that captures common collective schedules. - An efficient SMT encoding combining Boolean
(
snd_{n,c,n'}) and integer (time_{c,n}) variables. - A
Pareto-Synthesizealgorithm that walks the (steps, rounds/chunk) frontier corner-by-corner. - A code generator that lowers synthesized schedules to fused single- kernel SPMD CUDA / HIP code.
- Empirical evidence of speedups up to 6.8x over NCCL on DGX-1.
Figure 1 caption: "NVLink topology of an NVIDIA DGX-1."
2. Overview
2.1 Running example: DGX-1
- 8 V100 GPUs.
- Interconnect: each GPU has 6 NVLink ports at 25 GB/s.
- Topology contains 2 non-overlapping Hamiltonian cycles: one cycle uses two NVLinks per edge, the other uses one. Total: 6 logical rings can be carved from the graph.
- Diameter = 2 (any GPU can reach any other in <= 2 hops).
2.2 Topology
- Modeled as a directed graph G with bandwidth labels per edge.
- Bandwidth-bisection lower bound
b_l = 1 / BisectionBW(G). - Latency lower bound
a_l = Diameter(G).
2.3 Cost model
- Standard
(alpha, beta)model:cost = a*alpha + b*L*beta, where:alpha= per-message latencybeta= inverse bandwidthL= total bytes transferred per GPUa= number of synchronization stepsSb= bandwidth coefficient =R/C(rounds per chunk)
2.4 Bandwidth-optimal AllGather on DGX-1
- NCCL Ring AllGather:
- Cost =
7*alpha + (7/6)*L*beta - Uses 6 chunks, 7 steps, 7 rounds.
- Cost =
- SCCL synthesized AllGather:
- Cost =
3*alpha + (7/6)*L*beta - Same bandwidth coefficient (
7/6), but only 3 steps instead of 7. - Achieved by pipelining across both Hamiltonian cycles simultaneously.
- Cost =
2.5 Latency-optimal AllGather on DGX-1
- SCCL synthesized:
- Cost =
2*alpha + (3/2)*L*beta - 1 chunk, 2 steps, 2 rounds.
- 2 steps matches the diameter — provably latency-optimal.
- Cost =
3. Algorithm Synthesis
3.1 k-synchronous algorithms
- Algorithms where total rounds
Ris bounded above byS + k(steps + slack). - This class is rich enough to express NCCL's ring, double-binary-tree, recursive doubling, and Bruck's algorithm.
- Bounding
R - S <= kkeeps the SMT formula's variable count tractable.
3.2 Non-combining collective instance
A collective is parameterized by
(G, S, R, P, B, pre, post):
G: total number of global chunks (after splitting per-GPU buffer intoCper-GPU chunks:G = P * Cfor AllGather;G = Cfor Broadcast).S: synchronization steps.R: total rounds.P: number of nodes (GPUs).B: bandwidth relation (sum of chunks sent along each link set must not exceed link capacity per round).pre: which (chunk, node) pairs hold the chunk initially.post: which (chunk, node) pairs must hold the chunk at end.
3.3 Common pre/post relations (Table 1)
| Relation | Definition |
|---|---|
All |
[G] x [P] (every chunk at every node) |
Root |
[G] x {n_root} (every chunk at one root) |
Scattered |
{(c, n) : n = c mod P} (chunk c only at node c mod
P) |
Transpose |
{(c, n) : n = floor(c/P) mod P} (used for
AllToAll) |
3.4 Collective specifications (Table 2)
| Collective | pre | post |
|---|---|---|
| Gather | Scattered | Root |
| AllGather | Scattered | All |
| AllToAll | Scattered | Transpose |
| Broadcast | Root | All |
| Scatter | Root | Scattered |
3.5 SMT encoding (quantifier-free linear integer arithmetic)
Variables:
time_{c,n}: integer, the step at which chunkcfirst arrives at noden. Range[0, S].snd_{n,c,n'}: Boolean, true iffnsendscton'along edge(n, n').
Constraints:
| ID | Statement | English meaning |
|---|---|---|
| C1 | forall (c, n) in pre : time_{c,n} = 0 |
Pre-condition |
| C2 | forall (c, n) in post : time_{c,n} <= S |
Post-condition |
| C3 | forall (c, n) not in pre, time_{c,n} <= S => sum_{(n',n) in E} snd_{n',c,n} = 1 |
Validity (received exactly once) |
| C4 | snd_{n,c,n'} => time_{c,n} < time_{c,n'} |
Causality |
| C5 | sum_{(c,(n,n'))} (snd_{n,c,n'} AND time_{c,n'} = s) <= b * r_s |
Bandwidth |
| C6 | sum_{1 <= s <= S} r_s = R |
Total rounds |
The hybrid (Boolean + integer) encoding is the key novelty: a pure-Boolean encoding failed to terminate within 60 minutes on the same problem where the hybrid finishes in 10-100 seconds.
3.6 Combining collectives
- Reduce is the inverse of Broadcast (run the broadcast schedule in reverse, replacing sends with reductions).
- ReduceScatter is the inverse of AllGather.
- AllReduce = ReduceScatter -> AllGather.
- This duality means SCCL only needs to synthesize the non-combining half; combining variants come for free.
3.7 Pareto-Synthesize (Algorithm 1)
input: k (slack), Coll, P, B
1: a_l = Diameter(P, B)
2: b_l = 1 / BisectionBW(P, B)
3: (pre, post) = Lookup(Coll)
4: for S = a_l, a_l + 1, ...:
5: A = { (R, C) : S <= R <= S+k AND R/C >= b_l }
6: for (R, C) in A in ascending order of R/C:
7: G = ToGlobal(Coll, C)
8: if SMT(G, S, R, P, B, pre, post) == SAT:
9: report (S, R, C) algorithm
10: if R/C == b_l: return // bandwidth-optimal hit
11: break // increment S, restart
Termination caveat (Section 3.7): "It is possible for this procedure to never terminate as there can sometimes be unbounded number of Pareto- optimal algorithms for certain topologies." In practice, the authors set a budget and stop.
4. Code Generation
The synthesized schedule is a list of snd_{n,c,n'}
tuples annotated with time_{c,n} step indices. Lowering
produces SPMD C++ + CUDA / HIP code.
Design choices:
- DMA vs. kernel copies (Sec 4.1): DMA (cudaMemcpy) emits maximum-size packets on NVLink; kernel copies are limited to 128-byte cache lines. DMA is ~10% faster on NVLink. SCCL emits both and selects per benchmark.
- Push vs. Pull (Sec 4.2): the push model (sender writes to receiver's buffer) avoids the round-trip request/response of pull and is ~10% faster.
- Single-kernel fusion (Sec 4.3): all steps of the
synthesized schedule are fused into one CUDA kernel. Per-step
synchronization is done by shared flags +
__threadfence_system()rather than via separate kernel launches. This eliminates the multi-launch overhead that cripples a naivecudaMemcpy-per-step lowering. - Memory transport: uses CUDA IPC memory handles for direct cross-process peer pointer access.
- Thread blocks: SCCL dedicates a fixed number of thread blocks per link and empirically searches over (block count, chunk size) at install time for each generated kernel.
5. Evaluation
5.1 Experimental platforms
5.1.1 NVIDIA DGX-1
- 8 x V100 GPUs.
- 6 x NVLink ports per GPU at 25 GB/s.
- 2 x Intel Xeon E5-2698 v4 CPUs.
5.1.2 Gigabyte Z52 (AMD)
- 8 x AMD MI50 GPUs.
- xGMI links @ 33 GB/s + PCIe 4.0 @ 27 GB/s.
- Two xGMI islands bridged by PCIe; GPUs 1 and 5 act as the bridge.
- 2 x AMD EPYC 7002 CPUs.
Software
- Ubuntu 20.04, CUDA 10.2, ROCm 3.5.0, Z3 v4.8.8, NCCL 2.7.8-1, RCCL.
- OpenMPI w/ UCX as a third baseline.
5.2 NCCL baselines (Table 3, DGX-1)
| Collective | C | S | R | Cost |
|---|---|---|---|---|
| AllGather / ReduceScatter | 6 | 7 | 7 | 7*alpha + (7/6)*L*beta |
| AllReduce | 48 | 14 | 14 | 14*alpha + (7/24)*L*beta |
| Broadcast / Reduce | 6m | 6+m | 6+m | derived |
5.3 Synthesized algorithms — DGX-1 (Table 4)
| Collective | C | S | R | Optimality | Synth time |
|---|---|---|---|---|---|
| AllGather | 1 | 2 | 2 | Latency | 0.3 s |
| AllGather | 6 | 7 | 7 | Bandwidth | 4.6 s |
| AllGather | 6 | 3 | 7 | Bandwidth (low S) | 6.6 s |
| AllReduce | 8 | 4 | 4 | Latency | 0.3 s |
| AllReduce | 48 | 14 | 14 | Bandwidth | 12.8 s |
| AllReduce | 16 | 4 | 6 | Latency-tight | 0.8 s |
| Broadcast | 2 | 2 | 2 | Latency | 0.1 s |
| Broadcast/Reduce | 18 | 5 | 5 | -- | 8.5 s |
| AllToAll | 8 | 2 | 3 | Latency | 3.0 s |
| AllToAll | 24 | 8 | 8 | Bandwidth | 133.7 s |
| AllToAll | 24 | 2 | 8 | Both | 24.3 s |
5.4 Synthesized algorithms — AMD Z52 (Table 5)
| Collective | C | S | R | Optimality | Synth time |
|---|---|---|---|---|---|
| AllGather | 1 | 4 | 4 | Latency | 0.5 s |
| AllGather | 2 | 7 | 7 | Bandwidth | 1.3 s |
| AllGather | 2 | 4 | 7 | Both | 1.7 s |
| AllReduce | 8 | 8 | 8 | Latency | 0.4 s |
| AllReduce | 16 | 14 | 14 | Bandwidth | 0.9 s |
| AllToAll | 8 | 4 | 8 | Both | 8.2 s |
5.4.3 Synthesis time
- Most algorithms finish in < 10 seconds.
- Worst case: 24-chunk AllToAll = 133.7 seconds.
- The naive Boolean-only encoding failed to terminate within 60 minutes.
5.5 End-to-end performance (Figures 4-7)
5.5.1 AllGather on DGX-1
- Small buffers: SCCL up to 2.2x faster than NCCL.
- Large buffers: 1.14x faster.
- Lowering matters: a (6, 7, 7) bandwidth-optimal algorithm using multi- cudaMemcpy is significantly slower than the same algorithm fused into a single push-copy kernel.
5.5.2 AllReduce on DGX-1
- Small buffers (8-chunk variant): 1.8x faster than NCCL.
- Large buffers (48-chunk variant): 1.06x faster.
- NCCL faster in 4 mid-range buffer sizes — not a complete sweep.
5.5.3 AllToAll on DGX-1
- Small buffers: 1.4x speedup.
- Large buffers: 6.8x speedup.
- NCCL has no native AllToAll; implements it as
Npoint-to-point exchanges, which is neither bandwidth- nor latency-optimal. SCCL closes this gap automatically.
5.5.4 AMD Z52 (Figure 7)
- AllGather: SCCL faster than RCCL on large buffers; slower on small/medium where RCCL is hand-tuned.
- Latency-optimal (1, 4, 4) wins at small inputs; bandwidth-optimal (2, 7, 7) wins at large inputs.
- Bisection bandwidth limited by PCIe bridges between xGMI islands; the authors note SCCL does not utilize both xGMI and PCIe simultaneously in the modeled topology.
5.5.5 OpenMPI / UCX
- "Subpar performance compared with our NCCL baselines" — included for completeness; not a serious competitor on DGX-1.
End-to-end ML context
- 8.3B Megatron model: 30% of training time in AllReduce on 10-100 MB buffers (motivation only; SCCL is not integrated end-to-end with training in this paper).
6. Related Work
| System / paper | Technique | Relation to SCCL |
|---|---|---|
| MPI [9] | Standard collective primitives (mesh, hypercube, fat-tree) | Basis; SCCL targets MPI's interface |
| NCCL [18] | Hand-tuned ring + double-binary-tree | Primary baseline |
| RCCL [2] | AMD's NCCL fork | AMD baseline |
| Blink [29] | Spanning-tree packing for Broadcast/Reduce | Closest related synthesizer; NOT bandwidth-optimal, no Pareto frontier |
| Horovod [23] | Multi-node ring AllReduce wrapper | Complementary (intra-node vs. inter-node) |
| BlueConnect [7] | Hierarchical ring AllReduce | Complementary |
| PLink [17] | Multi-node ring with adaptive grouping | Complementary |
| DistFuse etc. [13, 15, 19, 30] | Compute-comm overlap | Orthogonal — operates above the kernel |
The key contrast with Blink: SCCL guarantees both bandwidth and latency optimality and produces the entire Pareto frontier; Blink does neither.
7. Conclusion
"This paper introduces SCCL: a systematic method to synthesize algorithms in the Pareto-frontier spanning from the latency-optimal algorithm to the bandwidth-optimal algorithm for a given collective on an input topology. We characterize a class of algorithms that captures a broad set of known algorithms and prove Pareto-optimality of both known algorithms and synthesized new algorithms. We automatically generate an implementation of these algorithms that is competitive with manually hand-tuned communication kernels in use today."
Limitations (author-stated and inferred)
- AMD topology incomplete. "We were unable to utilize both xGMI and PCIe at the same time so our model of the bandwidth ignores the dotted xGMI connections in Figure 3." (Section 5.2.2)
- Bisection bandwidth bound is loose on the AMD platform because PCIe bridges dominate.
- Synthesis may not terminate for topologies with unbounded Pareto- optimal algorithms (Section 3.7); a budget cutoff is needed.
- Single-server only. No multi-node or hierarchical synthesis.
- Offline synthesis. SCCL produces multiple algorithms per collective but does not select between them online — the user picks per buffer size.
- Limited collectives. Only those expressible by the (pre, post) relations of Table 2; custom non-associative reductions are out of scope.
- Hardware: only V100 + NVLink and MI50 + xGMI; A100/H100, NVSwitch, and multi-rail IB are unevaluated.
Open Problems Explicitly or Implicitly Identified
- Adaptive runtime selection across the Pareto frontier as a function of buffer size and current load.
- Multi-node hierarchical synthesis — combining intra-node SCCL schedules with inter-node patterns. (Realized later as MSCCL.)
- Hardware co-design. SCCL's infeasibility reports point at topology weaknesses; could drive interconnect design decisions.
- General reductions. Extending Reduce/ReduceScatter inversion to non-associative or non-commutative ops.
NCCL Specifics & Integration
- SCCL does not replace NCCL — it is a separate synthesized library that emits SPMD CUDA/HIP code. (The follow-up MSCCL integrates synthesized algorithms into NCCL via a plugin interface.)
- Protocol mapping: SCCL uses "a protocol similar to
the simple protocol (i.e., NCCL_PROTO=Simple)" as its baseline — direct
stores with
__threadfence_system()-based completion. No LL/LL128 protocol emulation. - Wire-format: uses CUDA IPC memory handles; push model.
- Synchronization: fine-grained shared flags between
thread blocks +
__threadfence_system(). No global kernel barrier per step. - Empirical knob search: SCCL searches at install
time over (thread- block count per link, chunk size) for each generated
kernel — analogous to NCCL's
nChannelsandchunkSizetuner knobs. - No mention of LL/LL128 protocol or NCCL's nChannels / numThreads configuration knobs by name; these live in the layer SCCL replaces.
Equations Summary
| Eq | Form | Meaning |
|---|---|---|
| Cost model | cost = a*alpha + b*L*beta |
(alpha, beta) model |
| Latency LB | a_l = Diameter(G) |
min steps |
| Bandwidth LB | b_l = 1/BisectionBW(G) |
min rounds/chunk |
| C1 | (c,n) in pre => time_{c,n}=0 |
initial placement |
| C2 | (c,n) in post => time_{c,n}<=S |
final placement |
| C3 | (c,n) not in pre, time<=S => sum snd_{n',c,n}=1 |
exactly one delivery |
| C4 | snd_{n,c,n'} => time_{c,n} < time_{c,n'} |
causality |
| C5 | sum (snd AND time'=s) <= b*r_s |
bandwidth |
| C6 | sum r_s = R |
total rounds |
Cross-Cutting Empirical Take-Aways
| Take-away | Derived from |
|---|---|
| Pareto frontier is real and non-trivial | Table 4: same R/C at S=3 vs. S=7 |
| Latency-optimal dominates at small buffers | AllGather small buffers: 2.2x speedup |
| Bandwidth-optimal dominates at large buffers | AllReduce 48-chunk variant at 1 GB |
| Single-kernel fusion is decisive | Multi-cudaMemcpy lowering loses ~10-20% |
| Push model > pull model | ~10% gain |
| Boolean-only SMT encoding is intractable | Naive encoding > 60 min vs. 24 s hybrid |
| Vendor libraries can be beaten by 1.06x-6.8x with no extra hardware | All speedup tables |
| AllToAll is where vendor libraries are weakest | 6.8x speedup (NCCL fakes it via N-pt-to-pt) |
Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that selects
per-collective parameters (algorithm,
protocol, nChannels, numThreads,
chunkSize) to minimize collective completion time on HPC
GPU clusters. Its policy conditions on state features: log-binned
message size, model intensity I = C/D, local batch size,
topology fingerprint (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet),
and an LSTM-encoded window of recent collective timings. Reward is
-collective_wall_clock_us.
SCCL is both a complementary technology and a research neighbor that shapes DynamICCL's design priors:
| SCCL finding | DynamICCL design implication |
|---|---|
| Pareto frontier exists per (topology, collective) | Action-space prior: per-topology bias toward latency-optimal at small buffers, bandwidth-optimal at large |
| Buffer size is the primary regime selector | Confirms log-binned message-size is the dominant state feature |
| Topology graph asymmetry changes the schedule (DGX-1 vs. Z52) | Topology fingerprint must encode link-graph identity, not just bandwidth |
| Vendor AllToAll is weak (6.8x gap) | DynamICCL exploration budget should over-sample AllToAll regimes — large gains are available |
| Single-kernel fusion vs. multi-launch is an implementation discontinuity | numThreads + nChannels actions interact non-linearly — discrete branches in policy |
Latency-optimal (S=2) vs. bandwidth-optimal (S=7) at same
R/C |
Algorithm action is not 1D; Ring/Tree alone underspecifies — extend to SCCL/MSCCL plugin variants when available |
| Reward = absolute latency, not algBw | Confirms reward r = -collective_wall_clock_us |
| User picks Pareto point per buffer size (Limitation #1) | Exactly the gap DynamICCL fills with a learned policy |
Specific design priors:
State-vector features. SCCL's per-topology Pareto curves vindicate DynamICCL's topology-fingerprint feature. The 4-bin topology embedding should at minimum encode (NVLink-mesh, NVLink+PCIe-island, PCIe-only, IB-only, Ethernet); SCCL synthesizes substantially different schedules on DGX-1 vs. Z52, so the policy must condition on this. A bisection- bandwidth scalar feature is also useful as a continuous proxy.
Action-space priors. SCCL exposes that within "AllReduce" there are at least three Pareto-optimal variants on DGX-1 ((8,4,4), (16,4,6), (48,14,14)). The NCCL
algorithmaxis (Ring / Tree / CollNet / NVLS) is a coarsening of this richer space. When MSCCL or SCCL plugins are loaded into NCCL, DynamICCL's action vocabulary must expand to include them; otherwise the policy is bounded by NCCL's 3-4 hard-coded variants.Exploration budget. SCCL's results show vendor libraries are weakest on AllToAll (6.8x gap). DynamICCL should bias exploration epsilon higher on AllToAll-heavy workloads (e.g., MoE training, sequence parallelism). On AllReduce-heavy workloads (data parallelism) exploitation can dominate sooner.
Reward shaping. SCCL reports speedups in absolute latency and never normalizes by bus-bandwidth. This is direct evidence that collective wall-clock is the right reward signal — DynamICCL's
r = -collective_wall_clock_usis correct, and any algBw/busBw proxy would obscure the latency-vs-bandwidth trade-off SCCL exposes.Discontinuity awareness. SCCL shows a 10% step from kernel-launch discontinuity (multi-cudaMemcpy vs. single fused kernel). The
numThreads+nChannelsinteraction in NCCL has similar discontinuities (e.g., 1 channel 128 threads vs. 8 channels 512 threads). DynamICCL's exploration policy should not interpolate linearly; treat (low-channel, large-chunk) and (high-channel, small- chunk) as separate branches.Research positioning. SCCL synthesizes the best schedule for a given (topology, collective, buffer-size). DynamICCL chooses, at run time, which protocol / chunk size / channel count / algorithm a pre-existing NCCL kernel uses. The two layers are complementary: a future stack would (a) load SCCL/MSCCL-synthesized algorithms into NCCL via the algorithm plugin interface and (b) let DynamICCL select among them at run time. This positioning differentiates DynamICCL from SCCL/MSCCL in the funding/publication landscape: SCCL fixes the algorithm offline, DynamICCL chooses it online.
Open-problem alignment. SCCL's stated Limitation #1 — "user picks the Pareto point per buffer size" — is the precise problem DynamICCL solves. A learned policy that observes (message size, topology, recent timings, model intensity) and picks the Pareto-optimal variant automatically is the natural successor to SCCL's offline synthesis. This makes DynamICCL the runtime selector for the static catalog that SCCL/MSCCL produces.
Methodological lesson. SCCL's hybrid SMT encoding succeeded where pure Boolean failed (>60 min). For DynamICCL, the analogous lesson is that the policy's input representation matters: a discrete one-hot over (algorithm, protocol) may scale poorly compared to an embedded action representation that captures latency-vs-bandwidth duality directly. Worth exploring: factorized action heads (one for latency-favoring vs. bandwidth-favoring branch, one for granularity).