SyCCL: Exploiting Symmetry for Efficient Collective Communication Scheduling — Detailed Summary

Jiamin Cao, Shangfeng Shi, Jiaqi Gao, Weisen Liu, Yifan Yang, Yichi Xu, Zhilong Zheng, Yu Guan, Kun Qian, Ying Liu, Mingwei Xu, Tianshu Wang, Ning Wang, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai | Alibaba Cloud + Tsinghua University | ACM SIGCOMM 2025, Coimbra, Portugal | September 8-11 2025 | DOI: 10.1145/3718958.3750499

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.


Abstract


1. Introduction

Communication wall in ML training:

Two camps of CCLs and their gaps:

SyCCL's contribution thesis:


2. Background and Motivation

2.1 Collective Communication

2.2 Search Space for Collective Schedules

2.3 Limitations of Existing Synthesizers


3. Insight and Design Overview

3.1 Insight: Topology and Collective Symmetry

3.2 SyCCL Sketch (Concept)

3.3 SyCCL Overview (Workflow)


4. Sketch Exploration

4.1 Searching for Sketches

Enumeration-based search:

  1. Enumerate dimensions D_k of the topology (NVLink island, rail, inter-server level).
  2. Enumerate isomorphic groups G_{d,k} within each dimension.
  3. Enumerate source GPU set V^s and destination GPU set V^r per sub-demand.

Three pruning strategies:

4.2 Generating Sketch Combinations (Bandwidth Balancing)

4.3 Extending to All-To-All Collectives


5. Schedule Synthesis

5.1 Synthesizing Sub-Schedules (MILP)

Model:

5.2 Synthesizing the Optimal Schedule (Simulator-based Selection)

5.3 Accelerating Synthesis

Two-step synthesis:

Isomorphism reuse:

Parallelism:


6. Implementation


7. Evaluation

7.1 Setup

Component Value
A100 testbed 4 servers, 8x A800 GPUs/server (NVLink 180 GB/s), 4x 200 Gb/s RDMA NICs/server, 2-layer Clos
H800 fabric (simulated) 64 servers, 8x H800 GPUs/server (NVLink), 8x 400 Gb/s RDMA NICs, multi-rail (HPN-style)
Synthesis host 192-core Intel Xeon Platinum 8469C
Codebase SyCCL ~7K lines C++
MILP solver Standard MILP backend; sub-MILPs solved in parallel
Simulator ASTRA-sim-based fine-grained simulator
Executor MSCCL-executor
Baselines NCCL, TE-CCL, TACCL, SCCL
Workloads AllGather, AllToAll, ReduceScatter
Data sizes 1 KB to 4 GB
End-to-end models GPT-3 6.7B (DP=16), GPT-22B, LLaMa-7B
Metric Bus bandwidth (busbw); synthesis wall-clock; iteration time

7.2 Schedule Performance

Bus-bandwidth gains (Section 7.2):

Why SyCCL wins:

Where SyCCL is weakest:

7.3 Synthesis Time

Topology / Collective SyCCL (mean) TE-CCL Speedup
16 A100 AllGather 0.8 s 1193 s 1554x
16 A100 AllToAll 3.6 s 15759 s 4321x
64 H800 AllGather 1.6 s 28200 s 17286x
512 H800 AllGather 37 min (85.5 s min, 14146 s max) TIMED OUT n/a

SCCL / TACCL on the same scales:

7.4 Impact of Varying Synthesizing Policy

Pruning ablations (Fig. 17):

Coarse-vs-fine tau:

7.5 End-to-End Performance


8. Discussion and Limitations


System Approach Headline failure mode
NCCL Fixed Ring / double-binary-tree Implicit 7:1 bandwidth assumption; 511-hop ring at 512 GPUs
RCCL NCCL port for AMD Same template constraints
SCCL SAT-based synthesis >24h on 16-GPU AllGather
TACCL Human-sketched MILP Fails 128-GPU AllGather in 8h
TE-CCL Epoch-based MILP, multi-commodity flow Up to 20% schedule quality loss; tau selection unstable on multi-dim networks
MSCCL / MSCCLang / GC3 DSLs / IRs for collectives (cited as related, not directly benchmarked here) Codegen-side; orthogonal to schedule synthesis
HPN Alibaba's multi-rail high-perf network design Cited as the topology family that motivates SyCCL

10. Conclusion


Appendix Highlights


11. Cross-Cutting Quantitative Take-Aways

Take-away Derived from
1554x-17286x synthesis speedup vs. TE-CCL Table 5 (synthesis-time matrix)
127% busbw vs. NCCL on 512-GPU AllGather Sec. 7.2, H800 simulation
108% busbw vs. NCCL on 32-A100 testbed Sec. 7.2
91% busbw vs. TE-CCL on 32-A100 testbed Sec. 7.2
6.3% iteration-time speedup, GPT-3 6.7B Sec. 7.5
Pruning saves 20.8-48.1% synth time Fig. 17a
Stage-cap saves 95-97% AllToAll synth time Fig. 17b
7:1 vs. 14:1 NVLink:network ratio is the latent gap NCCL leaves on table Sec. 7.2 narrative
511-hop Ring at 512 GPUs is the latency wall NCCL hits Sec. 7.2 narrative

12. Named Methods, Acronyms, and Concepts


13. Discussion of NCCL Specifically


14. Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective algorithm (Ring/Tree/CollNet/NVLS), protocol (LL/LL128/Simple), nChannels, numThreads, and chunkSize to minimize collective wall-clock time on HPC GPU clusters. State features include log-binned message size, model intensity I = C/D, local batch size, topology fingerprint, and an LSTM-encoded recent- collective timing window. Reward is -collective_wall_clock_us. SyCCL provides DynamICCL with both action-space priors and state-feature design evidence, plus a clear research positioning.

Direct mappings to DynamICCL design:

SyCCL finding DynamICCL design implication
NCCL Ring assumes 7:1 NVLink:network ratio; H800 is 14:1; gap = 50% NVLink unused Topology fingerprint must encode the actual per-cluster bandwidth ratio, not just a categorical fabric label. Add a continuous "NVLink:NIC ratio" feature.
511-hop Ring at 512 GPUs is latency-bound At rank > ~256 the action-space prior must penalize Ring vs. Tree / 2D / hierarchical algorithms.
Sketch decomposition exploits symmetry DynamICCL's policy can amortize learning across isomorphic clusters by using a per-fabric-class embedding rather than a single global policy.
Chunk-allocation w_d = u_d balances load across dimensions chunkSize is the right tuner-level analog of SyCCL's t_i fractions; the RL agent should learn cluster-specific chunkSize priors that match capacity ratios.
Two-step (coarse-then-fine) synthesis is decisive at scale RL exploration budget should be similarly hierarchical: cheap coarse exploration of (algorithm, protocol) then fine-grained search of (chunkSize, nChannels, numThreads).
ASTRA-sim used to score candidates by completion time DynamICCL's reward -collective_wall_clock_us matches; busbw is for plotting only.
Up to 20% perf left on table by TE-CCL's coarse tau Demonstrates that careful chunk/epoch granularity matters; DynamICCL's chunkSize action axis is real.
SyCCL pruning saves 20-97% synthesis time Aggressive action-space pruning is fair — collapse {nChannels, numThreads} at small messages, reserve exploration budget for big-message regimes.

Specific design priors for the RL agent:

  1. Topology fingerprint extension. Add a continuous feature for intra-server NVLink bandwidth and a continuous feature for per-NIC network bandwidth. Their ratio is the latent quantity SyCCL exploits. DynamICCL's policy can then differentiate H800 (14:1) from A100 (7:1) automatically.

  2. Algorithm prior at scale. Above ~256 ranks, bias action sampling away from Ring and toward Tree, CollNet, or NVLS. SyCCL's 2D schedule supersedes Ring at 512 GPUs — DynamICCL should inherit this prior.

  3. chunkSize as load-balancer. SyCCL's chunk-allocation linear program shows chunkSize is a real bandwidth-balancing knob. The RL agent should learn that on multi-rail / multi-island fabrics, the right chunkSize matches the dimension-wise capacity proportions.

  4. Reward design (validation). SyCCL evaluates candidates in ASTRA-sim using completion time — exactly what DynamICCL's reward -collective_wall_clock_us measures. This is an independent confirmation that bandwidth proxies are the wrong reward signal.

  5. Hierarchical exploration. Mirror SyCCL's two-step synthesis: first coarse exploration of (algorithm, protocol) at 1-2 candidate chunkSize values; then fine exploration over chunkSize / nChannels / numThreads. This bounds total RL training cost.

  6. Action-space pruning at small messages. SyCCL prunes 95-97% of AllToAll stages with negligible quality loss. DynamICCL should similarly fix nChannels=1 and numThreads=128 at message sizes <=64 KiB — exploration there is wasteful.

  7. Research positioning. SyCCL operates at the schedule- synthesis layer (offline, static XML). DynamICCL operates at the runtime-tuner layer (online, per-collective parameter selection). Both consume the same NCCL stack; they compose. An integrated story: SyCCL synthesizes the best static schedule for a fabric; DynamICCL chooses among synthesized schedules and tunes parameters per collective at run time. SyCCL's open problem #4 ("multi-tenant / dynamic environments") is exactly DynamICCL's value proposition.

  8. Open-problem alignment. SyCCL Open Problem #1 (faster intra- group solvers) does not concern DynamICCL — that is a synthesis- side problem. SyCCL Open Problem #2 (asymmetric collectives / MoE AllToAll(v)) concerns both: DynamICCL needs an action-space extension for asymmetric workloads. SyCCL Open Problem #3 (heterogeneous / irregular fabrics) maps cleanly onto DynamICCL's topology-conditioned policy heads. SyCCL Open Problem #4 (multi- tenant dynamic) is DynamICCL's central use case — the online-RL tuner observing recent-collective timing windows is the answer SyCCL gestures at but cannot implement statically.

  9. Cross-paper context (with 0031-0035 and 0030). Together with SCCL (search), TACCL (sketch-MILP), MSCCLang (DSL), GC3 (compiler), and TE-CCL (multi-commodity flow), SyCCL is the most production- ready synthesizer to date. The family-wide pattern: every paper pays a synthesis-time cost up front to amortize over many collective invocations. DynamICCL completes the picture by being the online layer that selects among these synthesized schedules and tunes the residual parameters — closing the loop the synthesis family deliberately leaves open.