SCCL: Synthesizing Optimal Collective Algorithms — Detailed Summary

Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi | Australian National University / University of Utah / Microsoft Research | PPoPP '21 | DOI: 10.1145/3437801.3441620

Per-section summary mirroring the paper's headings. Each section gives paragraph-level bullet points, all named methods, all equations, and exact quantitative results where the paper provides them.

Abstract

Collective communication is the Amdahl-bottleneck of data-parallel deep learning training.
The paper introduces SCCL (Synthesized Collective Communication Library): a systematic approach to synthesizing collective algorithms tailored to a particular hardware topology.
SCCL spans the entire Pareto frontier from latency-optimal to bandwidth-optimal implementations of a given collective.
The synthesis problem is encoded as a quantifier-free SMT formula discharged to a theorem prover (Z3). The encoding is the key to scaling.
SCCL synthesizes novel latency- and bandwidth-optimal algorithms not found in the literature on two popular topologies (DGX-1 and a Gigabyte AMD MI50 server).
It lowers algorithms to implementations on both NVIDIA (CUDA) and AMD (HIP) hardware and is competitive with hand-tuned NCCL/RCCL.

1. Introduction

Motivation:

Collective communication primitives (AllReduce, AllGather, Broadcast, Reduce, AllToAll, ReduceScatter) underpin distributed compute, particularly data-parallel deep-learning training.
For an 8.3-billion parameter Megatron model with model parallelism, ~30% of training time is spent in AllReduce on medium (10-100 MB) buffers.
Communication buffers span 6+ orders of magnitude — single-layer KB- sized gradients to multi-GB full-model exchanges — and no single algorithm is optimal across this range.

Why hand-tuned libraries fall short:

NCCL hard-codes Ring and double-binary-Tree implementations for AllReduce
- a small set of variants for other collectives.
Modern AI servers (DGX-1, AMD MI50 boxes, IPUs, TPUs) have irregular, asymmetric interconnects (mesh of NVLinks, xGMI islands bridged by PCIe, etc.) that the regular ring/tree algorithms do not exploit.
The author's claim: an automatic synthesizer can produce algorithms that are superior to hand-written ones by tailoring to the exact graph and bandwidth budget of the interconnect.

Contributions:

A formal class of k-synchronous algorithms that captures common collective schedules.
An efficient SMT encoding combining Boolean (snd_{n,c,n'}) and integer (time_{c,n}) variables.
A Pareto-Synthesize algorithm that walks the (steps, rounds/chunk) frontier corner-by-corner.
A code generator that lowers synthesized schedules to fused single- kernel SPMD CUDA / HIP code.
Empirical evidence of speedups up to 6.8x over NCCL on DGX-1.

Figure 1 caption: "NVLink topology of an NVIDIA DGX-1."

2. Overview

2.1 Running example: DGX-1

8 V100 GPUs.
Interconnect: each GPU has 6 NVLink ports at 25 GB/s.
Topology contains 2 non-overlapping Hamiltonian cycles: one cycle uses two NVLinks per edge, the other uses one. Total: 6 logical rings can be carved from the graph.
Diameter = 2 (any GPU can reach any other in <= 2 hops).

2.2 Topology

Modeled as a directed graph G with bandwidth labels per edge.
Bandwidth-bisection lower bound b_l = 1 / BisectionBW(G).
Latency lower bound a_l = Diameter(G).

2.3 Cost model

Standard (alpha, beta) model: cost = a*alpha + b*L*beta, where:
- alpha = per-message latency
- beta = inverse bandwidth
- L = total bytes transferred per GPU
- a = number of synchronization steps S
- b = bandwidth coefficient = R/C (rounds per chunk)

2.4 Bandwidth-optimal AllGather on DGX-1

NCCL Ring AllGather:
- Cost = 7*alpha + (7/6)*L*beta
- Uses 6 chunks, 7 steps, 7 rounds.
SCCL synthesized AllGather:
- Cost = 3*alpha + (7/6)*L*beta
- Same bandwidth coefficient (7/6), but only 3 steps instead of 7.
- Achieved by pipelining across both Hamiltonian cycles simultaneously.

2.5 Latency-optimal AllGather on DGX-1

SCCL synthesized:
- Cost = 2*alpha + (3/2)*L*beta
- 1 chunk, 2 steps, 2 rounds.
- 2 steps matches the diameter — provably latency-optimal.

3. Algorithm Synthesis

3.1 k-synchronous algorithms

Algorithms where total rounds R is bounded above by S + k (steps + slack).
This class is rich enough to express NCCL's ring, double-binary-tree, recursive doubling, and Bruck's algorithm.
Bounding R - S <= k keeps the SMT formula's variable count tractable.

3.2 Non-combining collective instance

A collective is parameterized by (G, S, R, P, B, pre, post):

G: total number of global chunks (after splitting per-GPU buffer into C per-GPU chunks: G = P * C for AllGather; G = C for Broadcast).
S: synchronization steps.
R: total rounds.
P: number of nodes (GPUs).
B: bandwidth relation (sum of chunks sent along each link set must not exceed link capacity per round).
pre: which (chunk, node) pairs hold the chunk initially.
post: which (chunk, node) pairs must hold the chunk at end.

3.3 Common pre/post relations (Table 1)

Relation	Definition
`All`	`[G] x [P]` (every chunk at every node)
`Root`	`[G] x {n_root}` (every chunk at one root)
`Scattered`	`{(c, n) : n = c mod P}` (chunk c only at node c mod P)
`Transpose`	`{(c, n) : n = floor(c/P) mod P}` (used for AllToAll)

3.4 Collective specifications (Table 2)

Collective	pre	post
Gather	Scattered	Root
AllGather	Scattered	All
AllToAll	Scattered	Transpose
Broadcast	Root	All
Scatter	Root	Scattered

3.5 SMT encoding (quantifier-free linear integer arithmetic)

Variables:

time_{c,n} : integer, the step at which chunk c first arrives at node n. Range [0, S].
snd_{n,c,n'} : Boolean, true iff n sends c to n' along edge (n, n').

Constraints:

ID	Statement	English meaning
C1	`forall (c, n) in pre : time_{c,n} = 0`	Pre-condition
C2	`forall (c, n) in post : time_{c,n} <= S`	Post-condition
C3	`forall (c, n) not in pre, time_{c,n} <= S => sum_{(n',n) in E} snd_{n',c,n} = 1`	Validity (received exactly once)
C4	`snd_{n,c,n'} => time_{c,n} < time_{c,n'}`	Causality
C5	`sum_{(c,(n,n'))} (snd_{n,c,n'} AND time_{c,n'} = s) <= b * r_s`	Bandwidth
C6	`sum_{1 <= s <= S} r_s = R`	Total rounds

The hybrid (Boolean + integer) encoding is the key novelty: a pure-Boolean encoding failed to terminate within 60 minutes on the same problem where the hybrid finishes in 10-100 seconds.

3.6 Combining collectives

Reduce is the inverse of Broadcast (run the broadcast schedule in reverse, replacing sends with reductions).
ReduceScatter is the inverse of AllGather.
AllReduce = ReduceScatter -> AllGather.
This duality means SCCL only needs to synthesize the non-combining half; combining variants come for free.

3.7 Pareto-Synthesize (Algorithm 1)

input:  k (slack), Coll, P, B
1:  a_l = Diameter(P, B)
2:  b_l = 1 / BisectionBW(P, B)
3:  (pre, post) = Lookup(Coll)
4:  for S = a_l, a_l + 1, ...:
5:      A = { (R, C) : S <= R <= S+k AND R/C >= b_l }
6:      for (R, C) in A in ascending order of R/C:
7:          G = ToGlobal(Coll, C)
8:          if SMT(G, S, R, P, B, pre, post) == SAT:
9:              report (S, R, C) algorithm
10:             if R/C == b_l: return       // bandwidth-optimal hit
11:             break                       // increment S, restart

Termination caveat (Section 3.7): "It is possible for this procedure to never terminate as there can sometimes be unbounded number of Pareto- optimal algorithms for certain topologies." In practice, the authors set a budget and stop.

4. Code Generation

The synthesized schedule is a list of snd_{n,c,n'} tuples annotated with time_{c,n} step indices. Lowering produces SPMD C++ + CUDA / HIP code.

Design choices:

DMA vs. kernel copies (Sec 4.1): DMA (cudaMemcpy) emits maximum-size packets on NVLink; kernel copies are limited to 128-byte cache lines. DMA is ~10% faster on NVLink. SCCL emits both and selects per benchmark.
Push vs. Pull (Sec 4.2): the push model (sender writes to receiver's buffer) avoids the round-trip request/response of pull and is ~10% faster.
Single-kernel fusion (Sec 4.3): all steps of the synthesized schedule are fused into one CUDA kernel. Per-step synchronization is done by shared flags + __threadfence_system() rather than via separate kernel launches. This eliminates the multi-launch overhead that cripples a naive cudaMemcpy-per-step lowering.
Memory transport: uses CUDA IPC memory handles for direct cross-process peer pointer access.
Thread blocks: SCCL dedicates a fixed number of thread blocks per link and empirically searches over (block count, chunk size) at install time for each generated kernel.

5. Evaluation

5.1 Experimental platforms

5.1.1 NVIDIA DGX-1

8 x V100 GPUs.
6 x NVLink ports per GPU at 25 GB/s.
2 x Intel Xeon E5-2698 v4 CPUs.

5.1.2 Gigabyte Z52 (AMD)

8 x AMD MI50 GPUs.
xGMI links @ 33 GB/s + PCIe 4.0 @ 27 GB/s.
Two xGMI islands bridged by PCIe; GPUs 1 and 5 act as the bridge.
2 x AMD EPYC 7002 CPUs.

Software

Ubuntu 20.04, CUDA 10.2, ROCm 3.5.0, Z3 v4.8.8, NCCL 2.7.8-1, RCCL.
OpenMPI w/ UCX as a third baseline.

5.2 NCCL baselines (Table 3, DGX-1)

Collective	C	S	R	Cost
AllGather / ReduceScatter	6	7	7	`7alpha + (7/6)L*beta`
AllReduce	48	14	14	`14alpha + (7/24)L*beta`
Broadcast / Reduce	6m	6+m	6+m	derived

5.3 Synthesized algorithms — DGX-1 (Table 4)

Collective	C	S	R	Optimality	Synth time
AllGather	1	2	2	Latency	0.3 s
AllGather	6	7	7	Bandwidth	4.6 s
AllGather	6	3	7	Bandwidth (low S)	6.6 s
AllReduce	8	4	4	Latency	0.3 s
AllReduce	48	14	14	Bandwidth	12.8 s
AllReduce	16	4	6	Latency-tight	0.8 s
Broadcast	2	2	2	Latency	0.1 s
Broadcast/Reduce	18	5	5	--	8.5 s
AllToAll	8	2	3	Latency	3.0 s
AllToAll	24	8	8	Bandwidth	133.7 s
AllToAll	24	2	8	Both	24.3 s

5.4 Synthesized algorithms — AMD Z52 (Table 5)

Collective	C	S	R	Optimality	Synth time
AllGather	1	4	4	Latency	0.5 s
AllGather	2	7	7	Bandwidth	1.3 s
AllGather	2	4	7	Both	1.7 s
AllReduce	8	8	8	Latency	0.4 s
AllReduce	16	14	14	Bandwidth	0.9 s
AllToAll	8	4	8	Both	8.2 s

5.4.3 Synthesis time

Most algorithms finish in < 10 seconds.
Worst case: 24-chunk AllToAll = 133.7 seconds.
The naive Boolean-only encoding failed to terminate within 60 minutes.

5.5 End-to-end performance (Figures 4-7)

5.5.1 AllGather on DGX-1

Small buffers: SCCL up to 2.2x faster than NCCL.
Large buffers: 1.14x faster.
Lowering matters: a (6, 7, 7) bandwidth-optimal algorithm using multi- cudaMemcpy is significantly slower than the same algorithm fused into a single push-copy kernel.

5.5.2 AllReduce on DGX-1

Small buffers (8-chunk variant): 1.8x faster than NCCL.
Large buffers (48-chunk variant): 1.06x faster.
NCCL faster in 4 mid-range buffer sizes — not a complete sweep.

5.5.3 AllToAll on DGX-1

Small buffers: 1.4x speedup.
Large buffers: 6.8x speedup.
NCCL has no native AllToAll; implements it as N point-to-point exchanges, which is neither bandwidth- nor latency-optimal. SCCL closes this gap automatically.

5.5.4 AMD Z52 (Figure 7)

AllGather: SCCL faster than RCCL on large buffers; slower on small/medium where RCCL is hand-tuned.
Latency-optimal (1, 4, 4) wins at small inputs; bandwidth-optimal (2, 7, 7) wins at large inputs.
Bisection bandwidth limited by PCIe bridges between xGMI islands; the authors note SCCL does not utilize both xGMI and PCIe simultaneously in the modeled topology.

5.5.5 OpenMPI / UCX

"Subpar performance compared with our NCCL baselines" — included for completeness; not a serious competitor on DGX-1.

End-to-end ML context

8.3B Megatron model: 30% of training time in AllReduce on 10-100 MB buffers (motivation only; SCCL is not integrated end-to-end with training in this paper).

System / paper	Technique	Relation to SCCL
MPI [9]	Standard collective primitives (mesh, hypercube, fat-tree)	Basis; SCCL targets MPI's interface
NCCL [18]	Hand-tuned ring + double-binary-tree	Primary baseline
RCCL [2]	AMD's NCCL fork	AMD baseline
Blink [29]	Spanning-tree packing for Broadcast/Reduce	Closest related synthesizer; NOT bandwidth-optimal, no Pareto frontier
Horovod [23]	Multi-node ring AllReduce wrapper	Complementary (intra-node vs. inter-node)
BlueConnect [7]	Hierarchical ring AllReduce	Complementary
PLink [17]	Multi-node ring with adaptive grouping	Complementary
DistFuse etc. [13, 15, 19, 30]	Compute-comm overlap	Orthogonal — operates above the kernel

The key contrast with Blink: SCCL guarantees both bandwidth and latency optimality and produces the entire Pareto frontier; Blink does neither.

7. Conclusion

"This paper introduces SCCL: a systematic method to synthesize algorithms in the Pareto-frontier spanning from the latency-optimal algorithm to the bandwidth-optimal algorithm for a given collective on an input topology. We characterize a class of algorithms that captures a broad set of known algorithms and prove Pareto-optimality of both known algorithms and synthesized new algorithms. We automatically generate an implementation of these algorithms that is competitive with manually hand-tuned communication kernels in use today."

Limitations (author-stated and inferred)

AMD topology incomplete. "We were unable to utilize both xGMI and PCIe at the same time so our model of the bandwidth ignores the dotted xGMI connections in Figure 3." (Section 5.2.2)
Bisection bandwidth bound is loose on the AMD platform because PCIe bridges dominate.
Synthesis may not terminate for topologies with unbounded Pareto- optimal algorithms (Section 3.7); a budget cutoff is needed.
Single-server only. No multi-node or hierarchical synthesis.
Offline synthesis. SCCL produces multiple algorithms per collective but does not select between them online — the user picks per buffer size.
Limited collectives. Only those expressible by the (pre, post) relations of Table 2; custom non-associative reductions are out of scope.
Hardware: only V100 + NVLink and MI50 + xGMI; A100/H100, NVSwitch, and multi-rail IB are unevaluated.

Open Problems Explicitly or Implicitly Identified

Adaptive runtime selection across the Pareto frontier as a function of buffer size and current load.
Multi-node hierarchical synthesis — combining intra-node SCCL schedules with inter-node patterns. (Realized later as MSCCL.)
Hardware co-design. SCCL's infeasibility reports point at topology weaknesses; could drive interconnect design decisions.
General reductions. Extending Reduce/ReduceScatter inversion to non-associative or non-commutative ops.

NCCL Specifics & Integration

SCCL does not replace NCCL — it is a separate synthesized library that emits SPMD CUDA/HIP code. (The follow-up MSCCL integrates synthesized algorithms into NCCL via a plugin interface.)
Protocol mapping: SCCL uses "a protocol similar to the simple protocol (i.e., NCCL_PROTO=Simple)" as its baseline — direct stores with __threadfence_system()-based completion. No LL/LL128 protocol emulation.
Wire-format: uses CUDA IPC memory handles; push model.
Synchronization: fine-grained shared flags between thread blocks + __threadfence_system(). No global kernel barrier per step.
Empirical knob search: SCCL searches at install time over (thread- block count per link, chunk size) for each generated kernel — analogous to NCCL's nChannels and chunkSize tuner knobs.
No mention of LL/LL128 protocol or NCCL's nChannels / numThreads configuration knobs by name; these live in the layer SCCL replaces.

Equations Summary

Eq	Form	Meaning
Cost model	`cost = aalpha + bL*beta`	(alpha, beta) model
Latency LB	`a_l = Diameter(G)`	min steps
Bandwidth LB	`b_l = 1/BisectionBW(G)`	min rounds/chunk
C1	`(c,n) in pre => time_{c,n}=0`	initial placement
C2	`(c,n) in post => time_{c,n}<=S`	final placement
C3	`(c,n) not in pre, time<=S => sum snd_{n',c,n}=1`	exactly one delivery
C4	`snd_{n,c,n'} => time_{c,n} < time_{c,n'}`	causality
C5	`sum (snd AND time'=s) <= b*r_s`	bandwidth
C6	`sum r_s = R`	total rounds

Cross-Cutting Empirical Take-Aways

Take-away	Derived from
Pareto frontier is real and non-trivial	Table 4: same `R/C` at S=3 vs. S=7
Latency-optimal dominates at small buffers	AllGather small buffers: 2.2x speedup
Bandwidth-optimal dominates at large buffers	AllReduce 48-chunk variant at 1 GB
Single-kernel fusion is decisive	Multi-cudaMemcpy lowering loses ~10-20%
Push model > pull model	~10% gain
Boolean-only SMT encoding is intractable	Naive encoding > 60 min vs. 24 s hybrid
Vendor libraries can be beaten by 1.06x-6.8x with no extra hardware	All speedup tables
AllToAll is where vendor libraries are weakest	6.8x speedup (NCCL fakes it via N-pt-to-pt)

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective parameters (algorithm, protocol, nChannels, numThreads, chunkSize) to minimize collective completion time on HPC GPU clusters. Its policy conditions on state features: log-binned message size, model intensity I = C/D, local batch size, topology fingerprint (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet), and an LSTM-encoded window of recent collective timings. Reward is -collective_wall_clock_us.

SCCL is both a complementary technology and a research neighbor that shapes DynamICCL's design priors:

SCCL finding	DynamICCL design implication
Pareto frontier exists per (topology, collective)	Action-space prior: per-topology bias toward latency-optimal at small buffers, bandwidth-optimal at large
Buffer size is the primary regime selector	Confirms log-binned message-size is the dominant state feature
Topology graph asymmetry changes the schedule (DGX-1 vs. Z52)	Topology fingerprint must encode link-graph identity, not just bandwidth
Vendor AllToAll is weak (6.8x gap)	DynamICCL exploration budget should over-sample AllToAll regimes — large gains are available
Single-kernel fusion vs. multi-launch is an implementation discontinuity	numThreads + nChannels actions interact non-linearly — discrete branches in policy
Latency-optimal (S=2) vs. bandwidth-optimal (S=7) at same `R/C`	Algorithm action is not 1D; Ring/Tree alone underspecifies — extend to SCCL/MSCCL plugin variants when available
Reward = absolute latency, not algBw	Confirms reward `r = -collective_wall_clock_us`
User picks Pareto point per buffer size (Limitation #1)	Exactly the gap DynamICCL fills with a learned policy

Specific design priors:

State-vector features. SCCL's per-topology Pareto curves vindicate DynamICCL's topology-fingerprint feature. The 4-bin topology embedding should at minimum encode (NVLink-mesh, NVLink+PCIe-island, PCIe-only, IB-only, Ethernet); SCCL synthesizes substantially different schedules on DGX-1 vs. Z52, so the policy must condition on this. A bisection- bandwidth scalar feature is also useful as a continuous proxy.
Action-space priors. SCCL exposes that within "AllReduce" there are at least three Pareto-optimal variants on DGX-1 ((8,4,4), (16,4,6), (48,14,14)). The NCCL algorithm axis (Ring / Tree / CollNet / NVLS) is a coarsening of this richer space. When MSCCL or SCCL plugins are loaded into NCCL, DynamICCL's action vocabulary must expand to include them; otherwise the policy is bounded by NCCL's 3-4 hard-coded variants.
Exploration budget. SCCL's results show vendor libraries are weakest on AllToAll (6.8x gap). DynamICCL should bias exploration epsilon higher on AllToAll-heavy workloads (e.g., MoE training, sequence parallelism). On AllReduce-heavy workloads (data parallelism) exploitation can dominate sooner.
Reward shaping. SCCL reports speedups in absolute latency and never normalizes by bus-bandwidth. This is direct evidence that collective wall-clock is the right reward signal — DynamICCL's r = -collective_wall_clock_us is correct, and any algBw/busBw proxy would obscure the latency-vs-bandwidth trade-off SCCL exposes.
Discontinuity awareness. SCCL shows a 10% step from kernel-launch discontinuity (multi-cudaMemcpy vs. single fused kernel). The numThreads + nChannels interaction in NCCL has similar discontinuities (e.g., 1 channel 128 threads vs. 8 channels 512 threads). DynamICCL's exploration policy should not interpolate linearly; treat (low-channel, large-chunk) and (high-channel, small- chunk) as separate branches.
Research positioning. SCCL synthesizes the best schedule for a given (topology, collective, buffer-size). DynamICCL chooses, at run time, which protocol / chunk size / channel count / algorithm a pre-existing NCCL kernel uses. The two layers are complementary: a future stack would (a) load SCCL/MSCCL-synthesized algorithms into NCCL via the algorithm plugin interface and (b) let DynamICCL select among them at run time. This positioning differentiates DynamICCL from SCCL/MSCCL in the funding/publication landscape: SCCL fixes the algorithm offline, DynamICCL chooses it online.
Open-problem alignment. SCCL's stated Limitation #1 — "user picks the Pareto point per buffer size" — is the precise problem DynamICCL solves. A learned policy that observes (message size, topology, recent timings, model intensity) and picks the Pareto-optimal variant automatically is the natural successor to SCCL's offline synthesis. This makes DynamICCL the runtime selector for the static catalog that SCCL/MSCCL produces.
Methodological lesson. SCCL's hybrid SMT encoding succeeded where pure Boolean failed (>60 min). For DynamICCL, the analogous lesson is that the policy's input representation matters: a discrete one-hot over (algorithm, protocol) may scale poorly compared to an embedded action representation that captures latency-vs-bandwidth duality directly. Worth exploring: factorized action heads (one for latency-favoring vs. bandwidth-favoring branch, one for granularity).