SyCCL: Exploiting Symmetry for Efficient Collective Communication Scheduling
Jiamin Cao, Shangfeng Shi, Jiaqi Gao, Weisen Liu, Yifan Yang, Yichi Xu, Zhilong Zheng, Yu Guan, Kun Qian, Ying Liu, Mingwei Xu, Tianshu Wang, Ning Wang, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai | Alibaba Cloud + Tsinghua University | ACM SIGCOMM 2025, Coimbra, Portugal | DOI: 10.1145/3718958.3750499
Problem
Modern ML training clusters at production scale (Alibaba's H800 fabric: 64 servers, 512 GPUs, 8x 400 Gbps RDMA NICs per server, multi-rail) suffer two simultaneous schedule-quality problems that no existing CCL can resolve. Fixed-template libraries like NCCL/RCCL ship Ring and double-binary-tree algorithms whose flow patterns are tuned for old NVLink/network bandwidth ratios; on H800 with a 14:1 NVLink-to-network ratio (vs. NCCL's implicit 7:1 assumption), the Ring schedule under-uses NVLink and the 511-hop ring is latency-bound. Synthesis-based alternatives — SCCL (SAT), TACCL (sketch-MILP), and TE-CCL (epoch-MILP) — do produce topology-aware schedules, but their solver complexity explodes long before production scale. SCCL needs >24 hours for a 16-GPU AllGather. TACCL fails to synthesize a 128-GPU AllGather within 8 hours. TE-CCL accelerates the search but loses up to 20% performance because of coarse epoch-duration tau and struggles on multi-dimensional networks. The result: at 128-512 GPUs, on real production fabrics, practitioners had no synthesizer that was simultaneously fast enough to deploy, accurate enough to beat NCCL, and scalable across collectives (AllGather / AllToAll / ReduceScatter).
Core Insight
Optimal collective schedules on regular cluster fabrics are internally symmetric: identical sub-schedules repeat across isomorphic GPU groups, so a 512-GPU schedule can be decomposed into a small number of "Sketch" sub-demands inside small topology subsets, solved as separate small MILPs in parallel, and re-stitched. By exploiting collective symmetry (every rank plays the same role per collective semantics) and topology symmetry (NVLink islands and rails are interchangeable), SyCCL transforms one intractable monolithic synthesis problem into many tractable sub-problems — yielding 1554x to 17286x faster synthesis vs. TE-CCL while delivering up to 127% bus- bandwidth gains over NCCL at 512 GPUs.
Method
SyCCL is a three-stage offline synthesizer that consumes a topology graph and a target collective and emits MSCCL-executor XML schedules.
+---------------------------------------------------------------+
| INPUT |
| Topology graph (servers, NVLink islands, rails, bandwidth) |
| Collective type (AllGather / AllToAll / ReduceScatter) |
| alpha-beta link parameters |
+----------------------------+----------------------------------+
v
+---------------------------------------------------------------+
| STAGE 1: SKETCH EXPLORATION |
| - Enumerate dimensions D_k (NVLink, rail, inter-server) |
| - Enumerate isomorphic groups G_{d,k} |
| - Enumerate (V^s -> V^r) source/destination subsets |
| - Pruning: |
| [P1] Isomorphism dedup |
| [P2] Consistency: uniform dst:src ratios |
| [P3] Relay limits (Scatter) |
| - Combine sketches via mapping-based replication |
| to balance NVLink vs. rail workload |
| - Chunk allocation: solve linear t_i to satisfy |
| w_d = u_d (workload = capacity proportion per dim) |
+----------------------------+----------------------------------+
v
+---------------------------------------------------------------+
| STAGE 2: SCHEDULE SYNTHESIS (per Sketch) |
| - alpha-beta MILP: t = alpha + beta * s |
| bandwidth: link sends <= tau / beta per epoch |
| - Two-step solve: |
| (a) coarse tau -> filter candidates |
| (b) fine tau -> accurate selection |
| - Reuse solutions across isomorphic sub-demands |
| - Solve independent sub-demands in parallel |
| - epoch-duration heuristic g(r) = ceil(f(r)) - f(r) |
+----------------------------+----------------------------------+
v
+---------------------------------------------------------------+
| STAGE 3: GLOBAL SELECTION |
| - Merge sub-schedules into candidate global schedules |
| - Score via fine-grained ASTRA-sim-based simulator |
| - Pick winner; emit MSCCL-XML |
+----------------------------+----------------------------------+
v
+---------------------------------------------------------------+
| OUTPUT: MSCCL XML schedule, runs on NCCL/RCCL via |
| MSCCL-executor (with tunable nChannels etc.) |
+---------------------------------------------------------------+
A dedicated AllToAll extension caps the stage count at <=3 to bound combinatorial blowup, and a mapping-based replication mechanism balances workload across asymmetric NVLink/rail capacity ratios.
Experimental Setup
| Component | Value |
|---|---|
| Real testbed | 4-server A100 cluster, 8x A800/server (NVLink 180 GB/s), 4x 200 Gb/s RDMA NICs/server, 2-layer Clos |
| Simulated fabric | 64-server H800, 8x H800/server (NVLink), 8x 400 Gb/s RDMA NICs, multi-rail (HPN-style) |
| Synthesis host | 192-core Intel Xeon Platinum 8469C |
| Codebase | SyCCL ~7K lines of C++ |
| MILP solver | (Standard MILP backend; sub-MILPs solved in parallel) |
| Simulator | ASTRA-sim-based fine-grained network simulator |
| Executor | MSCCL-executor (NCCL-compatible XML) |
| Baselines | NCCL, TE-CCL, TACCL, SCCL |
| Workloads | AllGather, AllToAll, ReduceScatter |
| Data sizes | 1 KB to 4 GB |
| End-to-end models | GPT-3 6.7B (DP=16), GPT-22B, LLaMa-7B |
| Metric | Bus bandwidth (busbw); synthesis wall-clock; iteration time |
Headline Quantitative Results
Schedule quality (busbw) vs. baselines:
- Up to 127% over NCCL for 512-GPU H800 AllGather
- Up to 108% over NCCL on 32-A100 testbed
- Up to 91% over TE-CCL on 32-A100 testbed
- TE-CCL on 32 GPUs is worse than NCCL in some sizes; SyCCL is always >= NCCL (up to 1.04x at 32 GPUs)
End-to-end training:
- 6.3% iteration-time speedup on GPT-3 6.7B (DP16) over NCCL
Synthesis time (Table 5) — SyCCL vs. TE-CCL:
| Topology / Collective | SyCCL (mean) | TE-CCL | Speedup |
|---|---|---|---|
| 16 A100 AllGather | 0.8 s | 1193 s | 1554x |
| 16 A100 AllToAll | 3.6 s | 15759 s | 4321x |
| 64 H800 AllGather | 1.6 s | 28200 s | 17286x |
| 512 H800 AllGather | 37 min (85.5 s min, 14146 s max) | TIMED OUT | n/a (TE-CCL fails) |
SCCL / TACCL on the same scales: infeasible — SCCL > 24h on 16 GPU AllGather; TACCL fails on 128-GPU AllGather within 8h.
Pruning ablation:
- Redundancy + consistency pruning saves 20.8%-48.1% of synthesis time (Fig. 17a)
- Capping AllToAll stages at <=3 saves 95%-97% of synthesis time (Fig. 17b)
Topology-ratio insight (Sec. 7.2):
- NCCL's Ring assumes a 7:1 NVLink:network bandwidth ratio
- H800 ships 14:1 — NCCL's flows leave half of NVLink unused
- SyCCL's chunk-allocation matches the 14:1 ratio and reclaims the gap
Limitations
- Optimality is not guaranteed. Sketch decomposition + per-sketch MILP yields near-optimal but not provably global-optimal schedules.
- Asymmetric collectives unsupported. AllToAll(v) for MoE training, where per-rank data sizes differ, breaks the symmetry assumption.
- Heterogeneous / irregular fabrics break the abstraction. If rails are unbalanced or the cluster has bespoke wiring, isomorphic- group enumeration fails.
- Static schedules only. Pre-computed XML cannot adapt to multi-tenant interference, link drops, or workload shifts at run time.
- Compute cost ignored. Per-GPU reduction kernel cost, threadblock budget, and register pressure are outside the alpha-beta model.
- Coordination at small sizes. NCCL's hand-tuned coordination and reduction kernels can beat synthesized schedules at very small message sizes when the MSCCL-executor runtime is not similarly optimized — flagged in Sec. 7.2.
Open Problems Called Out
- Faster intra-group MILP solvers to shave the residual cost on extreme-scale topologies (the 14146 s worst-case at 512 GPUs).
- Heuristic extensions for asymmetric collectives (MoE AllToAll(v)) where today's symmetry approach is inapplicable.
- Heterogeneous / irregular cluster support — generalize beyond strictly isomorphic NVLink+rail fabrics.
- Multi-tenant / dynamic environments — schedules that adapt when topology or load shifts; pre-computed XML cannot handle this.
Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that selects
per-collective algorithm (Ring/Tree/CollNet/NVLS),
protocol (LL / LL128 / Simple), nChannels,
numThreads, and chunkSize to minimize
collective wall-clock time on HPC GPU clusters. State features include
log-binned message size, model intensity I = C/D, local batch size,
topology fingerprint, and an LSTM-encoded recent- collective timing
window. Reward is -collective_wall_clock_us. SyCCL
contributes the following directly to DynamICCL's design:
SyCCL is a synthesizer; DynamICCL is a runtime selector. They compose, not compete. SyCCL produces MSCCL-XML schedules that plug into NCCL via the MSCCL-executor. DynamICCL's action space should treat SyCCL's output as one entry in the "MSCCL/CollNet schedule slot" — DynamICCL picks among NCCL-Ring, NCCL-Tree, SyCCL-XML, TE-CCL-XML, and TACCL-XML at run time, conditioned on message size and topology.
NVLink-to-network ratio is a first-class topology feature. SyCCL's headline win comes from observing that NCCL is hard-coded for a 7:1 ratio while H800 is 14:1. DynamICCL's topology fingerprint must encode the per-cluster NVLink:NIC capacity ratio (not just the categorical "NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet" label) so the policy can specialize per generation.
Symmetry as a state-feature shortcut. Modern HPC fabrics are highly symmetric within an island and asymmetric across islands. DynamICCL can carry a coarse "fabric-class" embedding (single NVLink island / multi-island / multi-rail) and let the policy learn that per-class action priors are appropriate — mirroring SyCCL's per-sub-demand decomposition.
Hop count matters at scale. SyCCL's 512-GPU win over NCCL is driven by replacing a 511-hop ring with a 2D schedule. DynamICCL's exploration prior on
algorithmshould bias away from Ring at ranks > ~256 unless message size is very large; at small sizes, Tree (logarithmic depth) should dominate. This refines the simple "msg-size -> protocol" prior we already had.chunkSize is a bandwidth-balancing knob. SyCCL's chunk- allocation linear program solves t_i fractions to match w_d = u_d (workload proportion = capacity proportion per dimension). DynamICCL's
chunkSizeaction plays the same role at the tuner- plugin layer — the agent should learn that on multi-rail / multi- island fabrics, chunk granularity directly controls cross-dimension load balance.Reward = completion time, not bandwidth. SyCCL evaluates candidates with an ASTRA-sim-based simulator and selects on simulated completion time, not algBw/busBw. DynamICCL's reward
-collective_wall_clock_usis the correct micro-metric; busbw reports are useful for plotting but not for policy training.Solver-vs-RL exploration budget analogy. SyCCL spends 37 minutes (worst case 4 hours) synthesizing one 512-GPU schedule. DynamICCL spends comparable wall-clock training a per-cluster policy. SyCCL's experience that pruning is decisive (20-97% savings) suggests DynamICCL should aggressively prune the action space at small messages (collapse {nChannels, numThreads} to a single canonical setting) and reserve exploration budget for the regimes where it matters.
Open-problem alignment with DynamICCL's value proposition. SyCCL's open problem #4 ("multi-tenant / dynamic environments") and open problem #2 ("asymmetric collectives") are exactly DynamICCL's domain. SyCCL's static XML cannot adapt to congestion, but DynamICCL's online tuner-plugin can. The cleanest research positioning: SyCCL synthesizes the best static schedule per fabric-class; DynamICCL chooses among schedules and adapts parameters per collective.