SyCCL: Exploiting Symmetry for Efficient Collective Communication Scheduling — Detailed Summary
Jiamin Cao, Shangfeng Shi, Jiaqi Gao, Weisen Liu, Yifan Yang, Yichi Xu, Zhilong Zheng, Yu Guan, Kun Qian, Ying Liu, Mingwei Xu, Tianshu Wang, Ning Wang, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai | Alibaba Cloud + Tsinghua University | ACM SIGCOMM 2025, Coimbra, Portugal | September 8-11 2025 | DOI: 10.1145/3718958.3750499
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.
Abstract
- Collective-communication schedule quality is decisive for ML training efficiency and GPU-cluster utilization.
- Open-source CCLs (NCCL, RCCL) ship fixed schedules (Ring, double binary tree) that cannot adjust to varying topology or model requirements.
- State-of-the-art synthesizers (TE-CCL, TACCL) use Mixed Integer Linear Programming (MILP) but encounter search-space explosion and scalability challenges on production fabrics.
- SyCCL is proposed as a scalable collective schedule synthesizer that produces near-optimal schedules in tens of minutes for production- scale ML jobs.
- It exploits collective symmetry and topology symmetry to decompose collective demands into smaller sub-demands within smaller topology subsets, then synthesizes and integrates the corresponding sub-schedules.
- Experiments on a 32-A100 testbed and a production-scale H800 simulation report up to 127% performance improvement while reducing synthesis time by 2-4 orders of magnitude vs. prior art.
1. Introduction
Communication wall in ML training:
- For models like GPT-22B and LLaMa-7B, communication accounts for
30% of iteration time.
- Production GPU clusters have grown to hundreds of GPUs across many servers; communication scheduling is now a first-order performance problem.
Two camps of CCLs and their gaps:
- Fixed-schedule libraries (NCCL, RCCL): low overhead but template- bound; cannot adapt to new bandwidth ratios or topologies.
- Synthesis-based libraries (SCCL/TACCL/TE-CCL): produce topology- aware schedules but solver time blows up at production scale.
SyCCL's contribution thesis:
- A new abstraction (the Sketch) and a search-and-synthesis pipeline that exploits symmetry to deliver near-optimal schedules in tens of minutes for production-scale topologies.
2. Background and Motivation
2.1 Collective Communication
- Collectives like AllGather, ReduceScatter, AllToAll move large volumes of data between every GPU pair following well-defined semantics.
- Schedules differ in path (which links each chunk traverses) and timing (which epoch each chunk is sent).
2.2 Search Space for Collective Schedules
- For a topology with N GPUs and L links, the number of possible schedules is combinatorial in (chunks x epochs x paths).
- Prior synthesizers attack this directly with SAT (SCCL) or MILP (TACCL, TE-CCL); both have unfavorable scaling.
2.3 Limitations of Existing Synthesizers
- SCCL (SAT-based): >24 hours for a 16-GPU AllGather. Cannot scale beyond a single chassis.
- TACCL (sketch + MILP, human-in-the-loop): fails to synthesize a 128-GPU AllGather within 8 hours; sketch quality is operator- dependent.
- TE-CCL (epoch-based MILP, automatic): faster than SCCL/TACCL, but loses up to 20% schedule performance and struggles to pick epoch duration tau on multi-dimensional networks.
- None of the three solves the regime SyCCL targets: 128-512 GPUs on multi-rail fabrics with mismatched NVLink/network bandwidth.
3. Insight and Design Overview
3.1 Insight: Topology and Collective Symmetry
- Real cluster fabrics are highly symmetric: NVLink islands within a server are isomorphic; rails across servers are isomorphic; collective semantics treat every rank identically.
- Therefore optimal schedules contain repeating sub-schedules across isomorphic groups — solving the same sub-problem N times is wasted work.
3.2 SyCCL Sketch (Concept)
- A Sketch is a structured decomposition of a collective demand into sub-demands, each scoped to a small group of GPUs in the topology.
- Each sub-demand is solved independently by MILP; results are merged into a global schedule.
- Sketches are derived by enumerating dimensions D_k (NVLink, rail, inter-server), groups G_{d,k}, and source/destination GPU subsets V^s -> V^r.
3.3 SyCCL Overview (Workflow)
- Stage 1: Sketch Exploration — enumerate, prune, combine.
- Stage 2: Schedule Synthesis — per-sketch MILP, isomorphism reuse, parallel solves.
- Stage 3: Global Selection — simulator-driven selection of the best merged schedule.
- Output: MSCCL-executor XML, runnable on NCCL/RCCL.
4. Sketch Exploration
4.1 Searching for Sketches
Enumeration-based search:
- Enumerate dimensions D_k of the topology (NVLink island, rail, inter-server level).
- Enumerate isomorphic groups G_{d,k} within each dimension.
- Enumerate source GPU set V^s and destination GPU set V^r per sub-demand.
Three pruning strategies:
- P1 — Isomorphism dedup. Drop sketches isomorphic to ones already enumerated.
- P2 — Consistency. Enforce uniform destination-to-source ratios across symmetric groups so the merged schedule maintains symmetry.
- P3 — Relay limits (Scatter). Bound how many relay hops a chunk can traverse to prevent combinatorial blowup.
4.2 Generating Sketch Combinations (Bandwidth Balancing)
- Mapping-based replication mechanism. When a Sketch leaves some groups idle (e.g., rails underused), a replica Sketch is generated by mapping the original GPU/group identities onto the idle ones. This balances workload across the asymmetric NVLink:rail capacity ratio.
- Chunk allocation mechanism. Solve a small linear problem with |D| variables — fractions t_i of total chunks routed via dimension d_i — under the constraint that workload proportion equals capacity proportion: w_d = u_d. This produces a per-dimension chunk split matched to the actual bandwidth ratio.
4.3 Extending to All-To-All Collectives
- AllToAll has dense source-destination demand and can produce far more sketches than AllGather/ReduceScatter.
- SyCCL caps the number of stages at <= 3, removing the long tail of high-stage sketches whose marginal benefit does not justify their solve time.
- Empirically, capping stages saves 95%-97% of synthesis time (Fig. 17b) with negligible schedule-quality loss.
5. Schedule Synthesis
5.1 Synthesizing Sub-Schedules (MILP)
Model:
- alpha-beta link model: t = alpha + beta * s, where alpha is link latency, beta is reciprocal bandwidth, s is chunk size.
- Bandwidth constraint: each link transmits volume <= tau / beta within an epoch of duration tau.
- Workload equations:
- Broadcast: w_{d,g} = sum_{k in K} |V^r_{k,d,g}|
- Scatter: w_{d,g} = sum_{k in K, v in V^r_{k,d,g}} (f(v) + 1), where f(v) = number of descendants of v in the relay tree.
- Bandwidth-proportion constraint: w_d = u_d (workload share equals capacity share per dimension).
- Epoch-duration helper: g(r) = ceil(f(r)) - f(r), where f(r) = (alpha + beta * s) / (r * beta * s). Used to pick tau values that minimize epoch wastage.
5.2 Synthesizing the Optimal Schedule (Simulator-based Selection)
- Each Sketch yields multiple candidate sub-schedules from MILP.
- Sub-schedules are merged into candidate global schedules.
- A fine-grained simulator based on ASTRA-sim evaluates each candidate's completion time and the best-scoring schedule wins.
5.3 Accelerating Synthesis
Two-step synthesis:
- Coarse-grained pass: large tau, fast filter to discard obviously worse sketches.
- Fine-grained pass: small tau on the surviving candidates for accurate selection.
Isomorphism reuse:
- If two sub-demands are isomorphic, solve once and reuse.
Parallelism:
- Independent sub-demands solve concurrently on the 192-core synthesis host.
6. Implementation
- Codebase: ~7K lines of C++.
- Components: profiler (measures alpha, beta per link), synthesizer (Stages 1-3), executor wrapper that emits MSCCL-XML.
- Synthesis host: 192-core Intel Xeon Platinum 8469C.
- Executor: MSCCL-executor (NCCL-compatible), supports tunable number of communication channels.
7. Evaluation
7.1 Setup
| Component | Value |
|---|---|
| A100 testbed | 4 servers, 8x A800 GPUs/server (NVLink 180 GB/s), 4x 200 Gb/s RDMA NICs/server, 2-layer Clos |
| H800 fabric (simulated) | 64 servers, 8x H800 GPUs/server (NVLink), 8x 400 Gb/s RDMA NICs, multi-rail (HPN-style) |
| Synthesis host | 192-core Intel Xeon Platinum 8469C |
| Codebase | SyCCL ~7K lines C++ |
| MILP solver | Standard MILP backend; sub-MILPs solved in parallel |
| Simulator | ASTRA-sim-based fine-grained simulator |
| Executor | MSCCL-executor |
| Baselines | NCCL, TE-CCL, TACCL, SCCL |
| Workloads | AllGather, AllToAll, ReduceScatter |
| Data sizes | 1 KB to 4 GB |
| End-to-end models | GPT-3 6.7B (DP=16), GPT-22B, LLaMa-7B |
| Metric | Bus bandwidth (busbw); synthesis wall-clock; iteration time |
7.2 Schedule Performance
Bus-bandwidth gains (Section 7.2):
- 32-A100 testbed: SyCCL up to 108% faster than NCCL.
- 32-A100 testbed: SyCCL up to 91% faster than TE-CCL.
- 512-GPU H800 simulated: SyCCL up to 127% faster than NCCL on AllGather.
- TE-CCL on 32 A100 is sometimes worse than NCCL; SyCCL is always >= NCCL (up to 1.04x even on small sizes where NCCL is highly optimized).
Why SyCCL wins:
- NCCL Ring is implicitly tuned for a 7:1 NVLink:network bandwidth ratio. H800 ships 14:1. NCCL flows therefore underuse NVLink by half.
- SyCCL's chunk-allocation linear program solves for the actual 14:1 ratio and matches workload to capacity per dimension.
- At 512 GPUs, NCCL's Ring is a 511-hop chain — latency dominates. SyCCL synthesizes a 2D schedule that reduces effective hop count.
Where SyCCL is weakest:
- Very small message sizes where NCCL's hand-tuned reduction and coordination kernels outperform the MSCCL-executor's runtime overhead. Sec. 7.2 explicitly flags this as an MSCCL-executor limitation, not a SyCCL limitation.
7.3 Synthesis Time
| Topology / Collective | SyCCL (mean) | TE-CCL | Speedup |
|---|---|---|---|
| 16 A100 AllGather | 0.8 s | 1193 s | 1554x |
| 16 A100 AllToAll | 3.6 s | 15759 s | 4321x |
| 64 H800 AllGather | 1.6 s | 28200 s | 17286x |
| 512 H800 AllGather | 37 min (85.5 s min, 14146 s max) | TIMED OUT | n/a |
SCCL / TACCL on the same scales:
- SCCL: > 24 hours on a 16-GPU AllGather. Infeasible.
- TACCL: fails to complete a 128-GPU AllGather within 8 hours.
7.4 Impact of Varying Synthesizing Policy
Pruning ablations (Fig. 17):
- Redundancy + consistency pruning: synthesis time reduced 20.8% to 48.1% (Fig. 17a).
- Capping AllToAll stages at <=3: 95%-97% synthesis-time savings (Fig. 17b) with negligible schedule-quality loss.
Coarse-vs-fine tau:
- Two-step synthesis (coarse filter + fine resolve) is decisive for the 512-GPU scale; without it the worst-case solve time exceeds the experiment timeout.
7.5 End-to-End Performance
- GPT-3 6.7B (DP=16): SyCCL delivers 6.3% iteration-time speedup vs. NCCL.
- Larger models (GPT-22B, LLaMa-7B) used as motivating workloads — communication is >30% of iteration time, so even single-digit collective-time wins translate to meaningful end-to-end gains.
8. Discussion and Limitations
- No optimality guarantee. Sketch decomposition + per-sketch MILP yields near-optimal but not provably global-optimal schedules.
- Asymmetric collectives unsupported. AllToAll(v) for MoE training, where per-rank data sizes differ, breaks the symmetry abstraction.
- Heterogeneity / irregular fabrics. If a cluster has unequal rails or bespoke wiring, isomorphic-group enumeration fails and Stage 1 cannot decompose efficiently.
- Static schedules only. Pre-computed XML cannot react to multi-tenant interference, link drops, or workload shifts.
- Coordination cost at small sizes. NCCL's tightly-tuned coordination/reduction kernels can outperform the MSCCL-executor at very small messages — flagged as a runtime gap, not a schedule-quality gap.
9. Related Work
| System | Approach | Headline failure mode |
|---|---|---|
| NCCL | Fixed Ring / double-binary-tree | Implicit 7:1 bandwidth assumption; 511-hop ring at 512 GPUs |
| RCCL | NCCL port for AMD | Same template constraints |
| SCCL | SAT-based synthesis | >24h on 16-GPU AllGather |
| TACCL | Human-sketched MILP | Fails 128-GPU AllGather in 8h |
| TE-CCL | Epoch-based MILP, multi-commodity flow | Up to 20% schedule quality loss; tau selection unstable on multi-dim networks |
| MSCCL / MSCCLang / GC3 | DSLs / IRs for collectives (cited as related, not directly benchmarked here) | Codegen-side; orthogonal to schedule synthesis |
| HPN | Alibaba's multi-rail high-perf network design | Cited as the topology family that motivates SyCCL |
10. Conclusion
- SyCCL is the first synthesizer that scales to 512-GPU multi-rail fabrics while delivering >NCCL bus-bandwidth on AllGather, AllToAll, and ReduceScatter.
- The win comes from exploiting collective + topology symmetry to decompose monolithic MILP into small parallel sub-problems.
- 1554x-17286x synthesis speedup vs. TE-CCL, 127% busbw improvement vs. NCCL, 6.3% end-to-end iteration-time speedup on GPT-3 6.7B.
Appendix Highlights
- A. MILP Modeling and Solving. Full MILP formulation with variables, constraints, and the alpha-beta epoch model.
- B. Other GPU cluster topology examples. Demonstrates the symmetry decomposition on additional production fabrics.
- C. Comparison to expert-optimized schedules. SyCCL's automated schedules approach hand-tuned expert performance on the topologies examined.
11. Cross-Cutting Quantitative Take-Aways
| Take-away | Derived from |
|---|---|
| 1554x-17286x synthesis speedup vs. TE-CCL | Table 5 (synthesis-time matrix) |
| 127% busbw vs. NCCL on 512-GPU AllGather | Sec. 7.2, H800 simulation |
| 108% busbw vs. NCCL on 32-A100 testbed | Sec. 7.2 |
| 91% busbw vs. TE-CCL on 32-A100 testbed | Sec. 7.2 |
| 6.3% iteration-time speedup, GPT-3 6.7B | Sec. 7.5 |
| Pruning saves 20.8-48.1% synth time | Fig. 17a |
| Stage-cap saves 95-97% AllToAll synth time | Fig. 17b |
| 7:1 vs. 14:1 NVLink:network ratio is the latent gap NCCL leaves on table | Sec. 7.2 narrative |
| 511-hop Ring at 512 GPUs is the latency wall NCCL hits | Sec. 7.2 narrative |
12. Named Methods, Acronyms, and Concepts
- SyCCL. Symmetry-exploiting CCL synthesizer (this paper).
- Sketch. Core abstraction: decomposition of a collective demand into sub-demands over isomorphic topology subsets.
- MILP. Mixed Integer Linear Program (per-sketch solve).
- alpha-beta model. t = alpha + beta * s; standard latency- bandwidth link model.
- tau. Epoch duration in the MILP.
- busbw. Bus bandwidth, the standard NCCL benchmarking metric.
- MSCCL / MSCCL-executor. Microsoft Collective Communication Library; SyCCL's runtime backend that executes XML schedules on NCCL/RCCL.
- HPN. High-Performance Network — Alibaba's multi-rail fabric family that motivates SyCCL.
- ASTRA-sim. The fine-grained network simulator used by SyCCL's Stage 3 for candidate-schedule selection.
- Two-step synthesis. Coarse-tau filter pass followed by fine-tau resolve pass.
- Mapping-based replication. Generates Sketch replicas to balance load across asymmetric NVLink/rail capacities.
13. Discussion of NCCL Specifically
- NCCL is the dominant baseline. Its Ring algorithm assumes a 7:1 NVLink:network bandwidth ratio (implicit in flow allocation).
- Modern fabrics (H800 with 8x 400 Gb/s NICs) ship 14:1, leaving roughly half of NVLink unused under NCCL.
- At 512 GPUs, NCCL's Ring becomes a 511-hop chain — latency-bound.
- NCCL's small-message coordination/reduction kernels are tightly tuned and remain a baseline that synthesized schedules struggle to beat at very small sizes (a runtime gap on the MSCCL-executor side, flagged in Sec. 7.2).
- SyCCL targets the regime where NCCL leaves performance on the table: medium-to-large messages on multi-dimensional, multi-rail fabrics at 32-512 GPUs.
14. Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that selects
per-collective algorithm (Ring/Tree/CollNet/NVLS),
protocol (LL/LL128/Simple), nChannels,
numThreads, and chunkSize to minimize
collective wall-clock time on HPC GPU clusters. State features include
log-binned message size, model intensity I = C/D, local batch size,
topology fingerprint, and an LSTM-encoded recent- collective timing
window. Reward is -collective_wall_clock_us. SyCCL provides
DynamICCL with both action-space priors and
state-feature design evidence, plus a clear
research positioning.
Direct mappings to DynamICCL design:
| SyCCL finding | DynamICCL design implication |
|---|---|
| NCCL Ring assumes 7:1 NVLink:network ratio; H800 is 14:1; gap = 50% NVLink unused | Topology fingerprint must encode the actual per-cluster bandwidth ratio, not just a categorical fabric label. Add a continuous "NVLink:NIC ratio" feature. |
| 511-hop Ring at 512 GPUs is latency-bound | At rank > ~256 the action-space prior must penalize Ring vs. Tree / 2D / hierarchical algorithms. |
| Sketch decomposition exploits symmetry | DynamICCL's policy can amortize learning across isomorphic clusters by using a per-fabric-class embedding rather than a single global policy. |
| Chunk-allocation w_d = u_d balances load across dimensions | chunkSize is the right tuner-level analog of SyCCL's
t_i fractions; the RL agent should learn cluster-specific chunkSize
priors that match capacity ratios. |
| Two-step (coarse-then-fine) synthesis is decisive at scale | RL exploration budget should be similarly hierarchical: cheap coarse exploration of (algorithm, protocol) then fine-grained search of (chunkSize, nChannels, numThreads). |
| ASTRA-sim used to score candidates by completion time | DynamICCL's reward -collective_wall_clock_us matches;
busbw is for plotting only. |
| Up to 20% perf left on table by TE-CCL's coarse tau | Demonstrates that careful chunk/epoch granularity matters; DynamICCL's chunkSize action axis is real. |
| SyCCL pruning saves 20-97% synthesis time | Aggressive action-space pruning is fair — collapse {nChannels, numThreads} at small messages, reserve exploration budget for big-message regimes. |
Specific design priors for the RL agent:
Topology fingerprint extension. Add a continuous feature for intra-server NVLink bandwidth and a continuous feature for per-NIC network bandwidth. Their ratio is the latent quantity SyCCL exploits. DynamICCL's policy can then differentiate H800 (14:1) from A100 (7:1) automatically.
Algorithm prior at scale. Above ~256 ranks, bias action sampling away from Ring and toward Tree, CollNet, or NVLS. SyCCL's 2D schedule supersedes Ring at 512 GPUs — DynamICCL should inherit this prior.
chunkSize as load-balancer. SyCCL's chunk-allocation linear program shows chunkSize is a real bandwidth-balancing knob. The RL agent should learn that on multi-rail / multi-island fabrics, the right chunkSize matches the dimension-wise capacity proportions.
Reward design (validation). SyCCL evaluates candidates in ASTRA-sim using completion time — exactly what DynamICCL's reward
-collective_wall_clock_usmeasures. This is an independent confirmation that bandwidth proxies are the wrong reward signal.Hierarchical exploration. Mirror SyCCL's two-step synthesis: first coarse exploration of (algorithm, protocol) at 1-2 candidate chunkSize values; then fine exploration over chunkSize / nChannels / numThreads. This bounds total RL training cost.
Action-space pruning at small messages. SyCCL prunes 95-97% of AllToAll stages with negligible quality loss. DynamICCL should similarly fix nChannels=1 and numThreads=128 at message sizes <=64 KiB — exploration there is wasteful.
Research positioning. SyCCL operates at the schedule- synthesis layer (offline, static XML). DynamICCL operates at the runtime-tuner layer (online, per-collective parameter selection). Both consume the same NCCL stack; they compose. An integrated story: SyCCL synthesizes the best static schedule for a fabric; DynamICCL chooses among synthesized schedules and tunes parameters per collective at run time. SyCCL's open problem #4 ("multi-tenant / dynamic environments") is exactly DynamICCL's value proposition.
Open-problem alignment. SyCCL Open Problem #1 (faster intra- group solvers) does not concern DynamICCL — that is a synthesis- side problem. SyCCL Open Problem #2 (asymmetric collectives / MoE AllToAll(v)) concerns both: DynamICCL needs an action-space extension for asymmetric workloads. SyCCL Open Problem #3 (heterogeneous / irregular fabrics) maps cleanly onto DynamICCL's topology-conditioned policy heads. SyCCL Open Problem #4 (multi- tenant dynamic) is DynamICCL's central use case — the online-RL tuner observing recent-collective timing windows is the answer SyCCL gestures at but cannot implement statically.
Cross-paper context (with 0031-0035 and 0030). Together with SCCL (search), TACCL (sketch-MILP), MSCCLang (DSL), GC3 (compiler), and TE-CCL (multi-commodity flow), SyCCL is the most production- ready synthesizer to date. The family-wide pattern: every paper pays a synthesis-time cost up front to amortize over many collective invocations. DynamICCL completes the picture by being the online layer that selects among these synthesized schedules and tunes the residual parameters — closing the loop the synthesis family deliberately leaves open.