Architecture & Synthesizer-Design Analysis
SyCCL — Exploiting Symmetry for Efficient Collective Communication Scheduling
Source: Cao, J.; Shi, S.; Gao, J.; Liu, W.; Yang,
Y.; Xu, Y.; Zheng, Z.; Guan, Y.; Qian, K.; Liu, Y.; Xu, M.; Wang, T.;
Wang, N.; Dong, J.; Fu, B.; Cai, D.; Zhai, E. SyCCL: Exploiting
Symmetry for Efficient Collective Communication Scheduling. In
ACM SIGCOMM '25, September 8–11, 2025, Coimbra,
Portugal. 18 pages, 22 figures (incl. appendix), 6 tables.
DOI: 10.1145/3718925.3750499 Code: https://github.com/aliyun/symccl
Authors: Alibaba Cloud + Tsinghua University (J. Cao
and S. Shi contributed equally). Reader: Direct PDF
read (gemini-reader free-tier RESOURCE_EXHAUSTED on
gemini-2.0-flash; codex-reader rejected
gpt-5.1-codex-mini for ChatGPT free-tier accounts; pages
1-18 read directly via the Read tool with pages parameter,
including the full Appendix A-C). Analyst: Vishwakarma
Date: 2026-05-04
Table of Contents
- Lineage Note — Where SyCCL Sits in the Synthesis Family (SCCL → TACCL → MSCCLang/GC3 → TE-CCL → TECCL → SyCCL)
- System / Synthesizer Architecture (the "instrument" — two-phase symmetry decomposition)
- Target-Hardware / Topology Specimens (A100 Clos, H800 Multi-rail, generalized multi-rail)
- Design-Space Diagram (axes swept by the evaluation, axes held fixed)
- The SyCCL Sketch IR — Variables, Sub-demands, and Combinations
- Algorithm / Control Flow — Sketch Search, Sketch Combination, Schedule Synthesis
- The Two Symmetry Insights (topology symmetry, collective symmetry) and Three Pruning Rules
- Quantitative Results — Empirical Findings by Regime
- Configuration-Regime Trade-off Tables
- Bottlenecks & Insights Surfaced by the Measurements
- Limitations of the Methodology
- What to Borrow for DynamICCL — Composability vs Tension with Run-Time Knob Selection
- Analogy
1. Lineage Note — Where SyCCL Sits in the Synthesis Family
The five immediately preceding papers in this corpus form a clean genealogy of automatic collective synthesis. SyCCL is the sixth generation:
+------------------------------------------------------------------+
| SCCL (PPoPP 2021, paper 0031) |
| SMT solver synthesizes optimal schedules |
| Pareto-Synthesize sweeps (S, R, C); QF_LIA encoding |
| Single-node only, takes >24 h to synthesize 16-GPU AllGather |
| | |
| v |
| TACCL (NSDI 2023, paper 0032) |
| MILP guided by user-supplied "communication sketch" |
| Three-stage: routing -> heuristic ordering -> contiguity |
| Logical topology + switch-hyperedge + symmetry hints |
| Fails to create AllGather for 128-GPU topology in 8 hours |
| | |
| v |
| MSCCLang / GC3 (ASPLOS 2023 / arXiv 2022, papers 0033, 0034) |
| DSL programmability — user *writes* algorithm, compiler lowers|
| Three IRs (Chunk DAG -> Instruction DAG -> GC3-IR), runtime |
| interpreter; synthesis is OUT OF SCOPE |
| | |
| v |
| TE-CCL (SIGCOMM 2024, paper 0035) |
| Treat CCL as a multi-commodity-flow problem from network TE |
| General MILP + scalable LP form + A*-style temporal partition |
| No sketch, no DSL. Scales to 256-GPU AllToAll in 2.8 h. |
| | |
| v |
| TECCL (NSDI 2025, the SyCCL paper's main baseline) |
| Variant of TE-CCL with epoch-duration tuning + greedy |
| interval decomposition. Suffers ~20% performance loss in |
| exchange for scalability. SyCCL cites this as ref [28]. |
| | |
| v |
| SyCCL (SIGCOMM 2025, this paper, 0036) |
| "Forget holistic encoding. Decompose the collective by |
| symmetry, solve each smaller sub-problem, recombine." |
| Two-phase: Sketch Exploration -> Schedule Synthesis. |
| Sketch IR + isomorphism pruning + parallel MILP per sketch. |
| Up to 127% better performance than NCCL, 2-4 orders of |
| magnitude faster synthesis than TECCL. |
+------------------------------------------------------------------+
^ Fig 0: Lineage of automatic CCL synthesis. SyCCL inherits the
problem statement (topology + demand -> schedule) from SCCL/TACCL
and the MILP back-end from TECCL/TE-CCL, but rejects the holistic
encoding paradigm of the prior generation. Its central claim is
that *symmetry is the missing constraint* — present in real
topologies and collectives but invisible to the solver.
The intellectual move is to decompose first, solve later. Earlier synthesizers (SCCL, TACCL, TECCL) each took the entire collective and the entire topology and handed the resulting MILP/SMT to a solver, hoping it would discover the symmetric structure on its own. SyCCL's authors observe that solvers do not exploit symmetry well: they brute-force isomorphic regions of the search space and waste hours re-solving sub-problems whose answers are already determined by earlier ones. SyCCL extracts the symmetry up front, reduces the problem to a small set of canonical sub-demands, solves those in parallel, and copies the solutions across isomorphic groups.
The paper's own framing in Section 2.3 makes this explicit:
"We identify that the culprit of the scalability issue in previous solutions is encoding the entire collective and topology into a holistic formula. While in reality, symmetries commonly exist in the collective and topology... The challenge lies in effectively leveraging these symmetries. MILP solvers may not efficiently detect complex symmetric structures. Conversely, directly encoding them as MILP constraints may further complicate the problem and degrade solver performance."
That last sentence is the key: even if the user tells the solver the symmetries (as TACCL does with rotational-symmetry hints), the extra constraints make the formula harder. So SyCCL's sketch IR replaces the symmetry constraints with structural decomposition — the MILP never sees the symmetry, only the canonical fragments.
For DynamICCL the relevant lineage observation is that SyCCL still produces a static MSCCL XML schedule per (topology, collective, message-size) cell. The runtime has no adaptation: SyCCL targets the offline catalog. DynamICCL's RL knob selector is the layer above any such catalog, choosing among pre-synthesized schedules at call time, plus the in-NCCL knobs (algo, proto, nChannels, numThreads, chunkSize) that no synthesizer covers.
2. System / Synthesizer Architecture (the "instrument" — two-phase symmetry decomposition)
SyCCL is not a monolithic solver. It is a two-phase synthesizer: a Sketch Exploration phase (Section 4) that decomposes the global demand into canonical sub-demands using symmetry, and a Schedule Synthesis phase (Section 5) that solves each sub-demand with a small MILP and stitches results back together.
+--------------------------------------------------------------------+
| SyCCL Pipeline |
| |
| +--------------------+ +-----------------------------------+ |
| | Workload Spec | | Topology Spec | |
| | - Collective name | | - V (GPUs), V_0 (GPUs+NICs+sw) | |
| | {AG, AtoA, AR, | | - E directed edges | |
| | RS, Bcast, Sct, | | - alpha_e latency per edge (us) | |
| | Reduce, Gather} | | - 1/beta_e bandwidth per edge | |
| | - Demand matrix | | - D dimensions = {0,1,2,...} | |
| | F_s, F_d mappings| | - G_d groups in dimension d | |
| | - Chunk size s | | - V_{d,g} GPUs in group g, dim d | |
| +---------+----------+ +-----------------+-----------------+ |
| | | |
| +-----------------+-----------------+ |
| | |
| v |
| +---------------------------------------------------------------+ |
| | Topology Profiler / Dimension Extractor (Sec 3.1, 6) | |
| | - Network profiler tests chunk sizes per link type | |
| | - Auto-extracts dimensions and groups from connectivity | |
| | and connection performance | |
| | - Produces V_0, E, alpha_e, beta_e, D, G_d, V_{d,g} | |
| +-------------------------+-------------------------------------+ |
| | |
| v |
| +---------------------------------------------------------------+ |
| | DECOMPOSITION (Sec 4.3) | |
| | Choose collective family: | |
| | one-to-one -> direct sketch search (special case of A2A) | |
| | all-to-one -> tree sketch (Reduce/Gather are inverses) | |
| | one-to-all -> tree sketch (Broadcast/Scatter) | |
| | all-to-all -> N independent one-to-all decompositions | |
| | (AllGather, AlltoAll, ReduceScatter) | |
| | AllReduce -> ReduceScatter ; concat ; AllGather | |
| +-------------------------+-------------------------------------+ |
| | |
| v |
| +================== PHASE 1: SKETCH EXPLORATION ===============+ |
| | (Sec 4 — pure combinatorics, no MILP yet) | |
| | | |
| | +-----------------------------+ | |
| | | (4.1) Searching for Sketches| | |
| | | Enumerate D_k, G_{d,k}, | | |
| | | V_{k,d,g}^s, V_{k,d,g}^r | per stage k = 0..K-1 | |
| | | Produce tree-shaped sketch | | |
| | | for one-to-all chunk | | |
| | | Apply pruning #1, #2, #3 | | |
| | +-------------+---------------+ | |
| | | | |
| | v | |
| | +-----------------------------+ | |
| | | (4.2) Generating Sketch | | |
| | | Combinations | | |
| | | Step 1: replicate sketches | -- balance across GROUPS | |
| | | via mapping H:V->V | (mapping replication) | |
| | | and H_d:G_d -> G_d | | |
| | | Step 2: integrate combos | -- balance across DIMS | |
| | | chunk-allocation t | (chunk allocation t_i) | |
| | +-------------+---------------+ | |
| | | | |
| | v | |
| | +-----------------------------+ | |
| | | (4.3) AllToAll Extension | | |
| | | Replicate single one-to-all| | |
| | | sketch to N copies via | | |
| | | topology symmetry | | |
| | +-------------+---------------+ | |
| | | | |
| | v | |
| | Output: M = { <S_i, t_i> } sketch combinations | |
| +-----------------------------+--------------------------------+ |
| | |
| v |
| +================== PHASE 2: SCHEDULE SYNTHESIS ===============+ |
| | (Sec 5 — MILP + simulator, parallelized) | |
| | | |
| | +-----------------------------+ | |
| | | (5.1) Synthesizing | | |
| | | Sub-Schedules | | |
| | | For each sketch combination| | |
| | | For each merged sub-demand| | |
| | | Build MILP (alpha-beta | | |
| | | + epoch tau + integer F)| | |
| | | Solve via Gurobi | | |
| | | Use isomorphism: solve | | |
| | | 1 / class, copy others | | |
| | | Run instances in PARALLEL| | |
| | +-------------+---------------+ | |
| | | | |
| | v | |
| | +-----------------------------+ | |
| | | (5.2) Synthesizing the | | |
| | | Optimal Schedule | | |
| | | Concatenate sub-schedules | | |
| | | for each combination | | |
| | | Evaluate via fine-grained | | |
| | | simulator (ASTRA-sim base)| | |
| | | Select best across combos | | |
| | +-------------+---------------+ | |
| | | | |
| | v | |
| | +-----------------------------+ | |
| | | (5.3) Acceleration | | |
| | | Two-step solving: | | |
| | | coarse (large tau) then | | |
| | | fine (small tau, top-R2) | | |
| | | R1 = 20% (filter ratio) | | |
| | | R2 = 8 (top retained) | | |
| | | E = 3.0 / 0.5 (tau auto) | | |
| | +-----------------------------+ | |
| +------------------------------------------------------------+ |
| | |
| v |
| +---------------------------------------------------------------+ |
| | Schedule Executor | |
| | - Convert synthesized schedule to MSCCL XML | |
| | - Configure runtime: CCL transport, channel count | |
| | - Inject XML into MSCCL-executor [10] via lightweight parser| |
| | - MSCCL-executor runs XML directly (no CUDA kernel changes) | |
| +-----------------------------+---------------------------------+ |
| | |
| v |
| +---------------------------------------------------------------+ |
| | Hardware Path (deployment) | |
| | - 32-GPU A100 testbed (4 servers, 8x A100, 4x200Gbps RDMA) | |
| | - Production H800 simulation (64 servers, 8x H800, | |
| | 8x400Gbps RDMA, multi-rail) | |
| | - Synthesizer host: 192-core Intel Xeon Platinum 8469C, 1 TB |
| +---------------------------------------------------------------+ |
+--------------------------------------------------------------------+
^ Fig 1: SyCCL pipeline. The two phases are pure-combinatorial
(Sketch Exploration) and MILP+simulation (Schedule Synthesis). The
hardware path reuses MSCCL-executor — SyCCL contributes the
schedule, not the runtime. The 7K lines of C++ in Section 6 implement
the core SyCCL algorithm; everything else is reused.
The deliberate factoring is what differentiates SyCCL from prior CCL solvers. SCCL had one SMT formulation; TACCL had one MILP plus a heuristic ordering hack; TECCL had three formulation choices (full MILP, LP, A* multi-round); SyCCL is the first synthesizer where the solver never sees the full collective. The MILP is invoked hundreds or thousands of times — once per (sketch combination, sub-demand, isomorphism class) cell — and each invocation is small enough to solve in under a second.
For DynamICCL this is the most important architectural lesson: when the search space explodes, the right move is not always "make the solver smarter"; sometimes it is "tell the solver less, and hand it the canonical fragment instead." DynamICCL's analog is state abstraction: never present the agent with the raw 32-dim feature vector if a 4-dim canonical projection (msg_size_bin, I, M, topology fingerprint) captures the same equivalence class.
3. Target-Hardware / Topology Specimens
3.1 The A100 Clos testbed (32 GPUs, real hardware)
+--------------------- A100 Cluster: 4 servers x 8 A100 = 32 -------+
| |
| +-------- Two-tier Clos network (oversubscribed) ----------+ |
| | 8x Spine switches | |
| | ===2x200Gbps=== | |
| | 2x Leaf switches ====8x200Gbps==== per leaf | |
| | ===4x200Gbps=== | |
| | 4x Servers (each leaf serves 2 servers) | |
| +-------------------------------------------------------------+ |
| |
| Server 0..3 each: |
| 8x A100 GPUs |
| intra-server NVSwitch (NVLink fabric) |
| 4x 200 Gbps RDMA NICs (one NIC per 2 GPUs) |
| |
| Connectivity dimensions (Sec 3.1): |
| Dim 0: intra-server NVLink (NVSwitch) |
| Dim 1: leaf-switch (intra-pod) |
| Dim 2: spine-switch (cross-pod) |
+-------------------------------------------------------------------+
^ Fig 2a: A100 testbed — 32 GPUs across 4 servers, two-tier Clos
RDMA network, on-server NVSwitch. Used for real-hardware
evaluation in Sec 7.2 (Fig 14) and Appendix C (Fig 21,22).
3.2 The H800 Multi-rail production cluster (64 servers, simulated)
+--------------- H800 Cluster: 64 servers x 8 H800 = 512 -----------+
| |
| +-------- Multi-rail topology (rail-aligned) -------------+ |
| | 8x Leaf switches ====8x400Gbps==== per server | |
| | ===8x400Gbps=== | |
| | 64x Servers, "Same-rail" connection pattern | |
| | (GPUs with same intra-server index -> same leaf) | |
| +-----------------------------------------------------------+ |
| |
| Server 0..63 each: |
| 8x Nvidia H800 GPUs |
| intra-server NVSwitch (NVLink, 180 GB/s) |
| 8x 400 Gbps RDMA NICs (one NIC per GPU) |
| |
| Bandwidth ratio observation (Sec 2.1): |
| NVLink: 180 GBps |
| Inter-server: 8x400Gbps = 400 GB/s aggregate |
| Actual ratio 3.6:1 (NVLink:network), NOT NCCL's assumed 7:1 |
| -> 48.5% / (3.6+1) = 10.6% bandwidth wastage with fixed ring |
| |
| Connectivity dimensions: |
| Dim 0: intra-server NVSwitch |
| Dim 1: leaf-switch (rail-bound) |
+-------------------------------------------------------------------+
^ Fig 2b: H800 production cluster — 512 GPUs, multi-rail topology
where GPUs with the same intra-server ID share a leaf switch.
Used for simulation evaluation in Sec 7.2 (Fig 15). The 3.6:1
bandwidth ratio is the central motivating measurement.
3.3 Generalized multi-rail and Clos examples (Appendix B)
+------- Fig 19 (Appendix B): 7-server x 4-GPU multi-rail -----+
| Spine switch -- 4x core link |
| 4x Leaf switches each serving 2 servers |
| 7 servers x 4 GPUs = 28 GPUs |
| |
| Dimension extraction (auto): |
| Dim 3 = {0,1,...,27} (1 group, all GPUs) |
| Dim 2 = {0,1,2,3}, {4,...,27} (4 groups via leaf) |
| Dim 1 = {0,4,8,12}, {1,5,9,13},... (4 groups via rail) |
| Dim 0 = {0,1,2,3}, {4,5,6,7}, ... (7 groups intra-srv) |
+--------------------------------------------------------------+
+------- Fig 20 (Appendix B): Clos topology, 8 servers x 4 GPUs +
| Core switch -- 2 spine links |
| 2x Spine switches |
| 4x Leaf switches each serving 2 servers |
| |
| Dimension extraction (auto): |
| Dim 3 = all 32 GPUs |
| Dim 2 = 2 groups via spine |
| Dim 1 = 4 groups via leaf |
| Dim 0 = 8 groups intra-server |
+----------------------------------------------------------------+
^ Fig 2c: SyCCL's dimension extraction is automatic for any
symmetric topology. The set of dimensions D and groups G_d
are derived from connectivity and link bandwidth — NOT
user-supplied as in TACCL's "logical topology" sketch.
The key topology assumption SyCCL exploits is that real production GPU clusters are built layer-by-layer with structural repetition: every server has the same number of GPUs, every leaf serves the same number of servers, every rail follows the same pattern. This is the repetition that creates the isomorphic groups SyCCL slices on.
4. Design-Space Diagram (axes swept x axes held fixed)
DESIGN SPACE
+----------------------------------------------------------------+
| |
| Axis 1: COLLECTIVE TYPE (4 evaluated) |
| [AllGather] primary focus, real-hardware + simulation |
| [ReduceScatter] real-hardware A100 |
| [AlltoAll] real-hardware A100 + simulation H800 |
| [AllReduce] end-to-end (decomposed into RS+AG) |
| (Broadcast / Scatter / Reduce / Gather: tree case in §4.1) |
| |
| Axis 2: TOPOLOGY / SCALE (3 levels per topology) |
| A100 testbed: 16 GPUs / 32 GPUs |
| H800 simulated: 64 GPUs / 512 GPUs |
| |
| Axis 3: DATA SIZE (12 powers, 1 KB to 4 GB) |
| 1K, 4K, 16K, 64K, 256K, 1M, 4M, 16M, 64M, 256M, 1G, 4G |
| |
| Axis 4: SYNTHESIZER (3 systems compared) |
| [TECCL] prior SOTA MILP synthesizer (10 h timeout) |
| [NCCL] fixed schedule baseline (default ring/tree) |
| [SyCCL] this paper |
| |
| Axis 5 (ablation, Sec 7.4): SyCCL POLICY KNOBS |
| Pruning {#1: redundancy/consistency, #2: relay limit, #3: |
| AlltoAll stage cap} |
| AlltoAll max stages X = {3, 5, 10} |
| Epoch-balance E_2 = {0.1, 0.2, 1.0} |
| R_1 = 20%, R_2 = 8 (two-step synthesis filter ratios) |
| Parallelism = {1, 2, 4, 8, 16, 32, 64, 128, 192} threads |
| |
| Held FIXED: |
| - Workload model: GPT-6.7B (DP/TP) and Llama3-8B (DP/TP) |
| - Optimizer: Adam-class, distributed |
| - MILP solver: Gurobi (TECCL [28] back-end) |
| - Runtime: MSCCL-executor [10] (Microsoft fork of NCCL) |
| - Network profiler: chunk-size sweep per link type |
| - Synthesis host: 192-core Xeon 8469C, 1 TB RAM |
+----------------------------------------------------------------+
^ Fig 3: 5-axis design space — 4 collectives x 4 scales x 12
data sizes x 3 synthesizers, with 5 ablation knobs in Sec 7.4.
The "held fixed" line is critical: SyCCL is benchmarked on
CCL-only metrics (busbw and synthesis time), not on application-
layer end-to-end except in the brief Sec 7.5 / Table 6.
Three absences define the paper's measurement scope. First, no in-NCCL knob sweep — algo, proto, nChannels, numThreads, chunkSize are fixed at MSCCL-executor's runtime defaults; SyCCL only varies the schedule (the XML chunk-route plan). Second, no adaptive runtime — every reported number is for a single synthesized schedule executed verbatim. Third, no asymmetric or heterogeneous workloads — Section 8 explicitly notes that Mixture-of-Experts AlltoAll(v) is out of scope. These absences are what leaves room for DynamICCL to compose with SyCCL: the schedules SyCCL produces are inputs to DynamICCL's runtime selector.
5. The SyCCL Sketch IR — Variables, Sub-demands, and Combinations
5.1 Notation (Tables 1, 2, 3 in the paper)
| Variable | Description |
|---|---|
| V | Set of GPUs |
| V_0 | Set of GPUs, NICs, and switches (V ⊆ V_0) |
| C | Set of chunks |
| s | Chunk size (bytes) |
| F_s : C → V | Mapping each chunk c to the GPU it starts on |
| F_d : C → P(V) | Mapping each chunk c to the set of destination GPUs |
| r ∈ {0,1} | Whether chunks are reduced on destinations |
| E | Set of directed links E ⊆ {(v_1,v_2) : v_1, v_2 ∈ V_0} |
| α_e | Latency of link e (μs) |
| 1/β_e | Bandwidth of link e |
| D | Set of dimensions D = {0, 1, ...} |
| G_d | Set of groups in dimension d ∈ D |
| V_{d,g} | Set of GPUs in group g ∈ G_d in dimension d |
| K | Number of stages K ∈ ℤ⁺ |
| V^s_{k,d,g} | Set of GPUs in group g, dimension d acting as sources at stage k (V^s_{k,d,g} ⊆ V_{d,g}) |
| V^r_{k,d,g} | Set of GPUs in group g, dimension d acting as destinations at stage k (V^r_{k,d,g} ⊆ V_{d,g}) |
| R_{k,d,g} | Communication sub-demand: V^s_{k,d,g} → V^r_{k,d,g} |
A sketch is a tuple of such variables across all stages:
Sketch = ⟨ K, { V^s_{k,d,g}, V^r_{k,d,g} }_{k,d,g} ⟩
Constraint: ∪_{d,g} V^s_{0,d,g} ∪ ∪_{k,d,g} V^r_{k,d,g} = V
(every GPU is either a source at stage 0 or a
destination exactly once at some later stage)
5.2 Sketch Combination
A single sketch handles one indivisible chunk; a sketch combination M = {(S_1, t_1), (S_2, t_2), ...} prescribes that sketch S_i carries fraction t_i of the data, with Σ t_i = 1.
+--------- Example: 16-GPU Broadcast (Fig 5 in paper) ------------+
| |
| Sketch (1): |
| Stage 0: R_{0,1,0} = {0} -> {1,2,3} |
| R_{0,1,1} = {0} -> {4,8,12} |
| Stage 1: R_{1,0,1} = {4} -> {5,6,7} |
| R_{1,0,2} = {8} -> {9,10,11} |
| R_{1,0,3} = {12} -> {13,14,15} |
| |
| Tree shape: GPU 0 broadcasts in dim 1 to {4,8,12}, then |
| each forwards in dim 0 to its 3 server-mates. |
+------------------------------------------------------------------+
^ Fig 4: A single sketch is a tree where every GPU has at most one
predecessor (Broadcast/Scatter case). Pruning #3 caps the total
number of relay hops at |D| - 1 (≤ 3 for a 4-dimensional topology).
The sketch IR is the central abstraction. Compared to other synthesis IRs:
| IR | Granularity | Solver-visible? | User-supplied? |
|---|---|---|---|
| SCCL: SMT formula | per-link, per-time | Yes (entire) | No |
| TACCL: communication sketch | logical topology + path hints | Partial | Yes |
| MSCCLang: Chunk DAG → Inst DAG → GC3-IR | per-chunk → per-instruction | Compiled | User writes algo |
| TE-CCL / TECCL: F_{s,d,i,j,k,c} | per-link per-epoch per-chunk | Yes (entire) | No |
| SyCCL: Sketch + Combination | per-stage per-group sub-demand | Sub-demand only | No (auto-enumerated) |
SyCCL's IR is the only one that is automatic and decomposed. The solver sees one R_{k,d,g} at a time, never the full collective.
6. Algorithm / Control Flow
6.1 Overall control flow
START (Topology + Collective + Demand + Chunk size)
|
v
(1) Decompose collective
|
+-- one-to-one ----------------> goto (3) with single chunk
+-- one-to-all ----------------> tree case, goto (2)
+-- all-to-one ----------------> tree case (inverse), (2)
+-- all-to-all --> N x one-to-all decomposition --> (2)
+-- AllReduce --> RS ; AG (treat separately) ----> (2)
|
v
(2) PHASE 1: Sketch Exploration (Sec 4)
(4.1) Enumerate sketches per (D_k, G_{d,k})
Apply pruning #1: remove redundant isomorphic sketches
Apply pruning #2: enforce consistency across iso groups
Apply pruning #3: cap relay count at X = |D|-1
-> Set of unique sketches {S_1, ..., S_n}
(4.2) Generate sketch combinations
Step 1: Replicate sketches via H:V->V mapping
(balance workload across groups)
Step 2: Integrate combinations across dimensions
(chunk-allocation t_i; solve linear |D|-var
problem for bandwidth ratio match)
-> Set of valid combinations M = {⟨S_i, t_i⟩}
(4.3) AllToAll: replicate the one-to-all sketch N times
with topology symmetry; produce N-sketch combination
|
v
(3) PHASE 2: Schedule Synthesis (Sec 5)
(5.1) For each combination, for each merged sub-demand:
Build alpha-beta MILP (TECCL Appendix A.1 modeling)
Use isomorphism: solve 1 representative / class
Copy solution to other isomorphic sub-demands
Run independent sub-problems in PARALLEL (192 threads)
-> Sub-schedules
(5.2) Concatenate sub-schedules per combination
Run fine-grained simulator (ASTRA-sim base, alpha-beta)
Select schedule with minimum predicted completion time
(5.3) Acceleration:
Coarse pass: large tau, all combinations
Filter: keep top R2=8 within R1=20% of best
Fine pass: small tau, only retained candidates
Auto-select tau via E_1=3.0, E_2=0.5
|
v
(4) Materialize: emit MSCCL XML schedule
Feed to MSCCL-executor [10]
END
^ Fig 5: SyCCL control flow. The MILP solver is called inside
step (5.1) only — never on the full collective. The isomorphism
optimization (last bullet of 5.1) is what cuts solver invocations
by 2-4 orders of magnitude vs TECCL on the same problem.
6.2 Sketch search algorithm (Section 4.1)
The enumeration over sketches is governed by three loops at each stage k:
for each stage k = 0 .. K-1:
for each subset D_k ⊆ D of dimensions used at stage k:
for each dimension d ∈ D_k:
for each group g ∈ G_d:
enumerate V^s_{k,d,g} ⊆ V_{d,g}
enumerate V^r_{k,d,g} ⊆ V_{d,g}
constraint: V^s_{k,d,g} ∩ V^r_{k,d,g} = ∅
V^r_{k,d,g} not seen as r at earlier k' < k
For Broadcast: |V^s| = 1 (single source)
For Scatter: |V^s| ≤ 1 (with descendants
counted via f(v) for relay
chunk-count cap)
^ Fig 6: Sketch search inner loops. Without pruning, this is
super-exponential in |V|. The three pruning rules below cut it
to polynomial-in-|G_d| per stage.
6.3 The three pruning rules (Section 4.1)
Pruning #1 — Redundant isomorphic sketches. Define a sketch S_a isomorphic to S_b iff a one-to-one mapping H : V → V exists carrying all sources/destinations of S_a onto those of S_b. SyCCL keeps only one representative per isomorphism class.
"Sketches 2 and 3 in Figure 8 are isomorphic to each other, while sketch 1 is not."
Pruning #2 — Consistency across isomorphic groups. Within a single sketch, sub-demands assigned to isomorphic groups must have identical structure. Define per-group ratio ρ_{d,g,k} = |V^r_{d,k,g}| / |V^s_{d,k,g}|. A sketch is consistent if ρ is uniform across the isomorphic groups in any given dimension (or zero, meaning the group is excluded).
"Sketches 4 and 5 display consistent sub-demands across groups in both dimensions 0 and 1. In contrast, sketch 6 shows varying ratios in dimension 1 (i.e., GPUs 1, 6, and 16 correspond to 1, 3, and 2 destination GPUs, respectively)."
Pruning #3 — Relay-hop cap for Scatter. In a Scatter tree, each relay GPU forwards n+1 chunks (its own plus n descendants'), creating bandwidth overhead. SyCCL caps the maximum hop count from root to leaf at X = |D| − 1. In practice this means each dimension is traversed at most once.
6.4 Sketch combination algorithm (Section 4.2)
Step 1: replicate sketches to balance across GROUPS
For each base sketch S generated in (4.1):
Compute per-group workload w_{d,g}
Broadcast: w_{d,g} = Σ_k |V^r_{k,d,g}|
Scatter: w_{d,g} = Σ_k Σ_{v∈V^r} (f(v)+1)
where f(v) = number of descendants of v
If groups in some dim d are imbalanced:
Replicate S as S^0, S^1, ..., S^{|C|-1}, each transmitting
fraction t = 1/|C| of the chunk, with mapping H_d : G_d -> G_d
rotating which group each replica favors
=> Output: combination C with |C| equal-weighted sketches
Step 2: integrate combinations to balance across DIMENSIONS
For each |D|-tuple of combinations C_d (one per dim d):
For each sketch S_i in the merged combination:
Compute w_{i,d} (workload of S_i in dim d)
Demand for dim d: u_d ∝ topology bandwidth share
Solve: t_i values such that w_d / u_d = Σ_i t_i * w_{i,d} / u_d
constant across d
If no solution exists -> reject this combination
Else -> output ⟨S_i, t_i⟩ as a valid combination
6.5 AllToAll extension (Section 4.3)
For all-to-all collectives (AllGather, AlltoAll, ReduceScatter), an N-GPU AllToAll = N independent one-to-all collectives. SyCCL:
- Searches sketches for one decomposed one-to-all (Sec 4.1).
- Replicates the chosen sketch to the other N − 1 GPUs via topology symmetry, ensuring even per-dim workload distribution.
- Integrates the N-sketch combination as in Sec 4.2 Step 2.
For AllReduce, SyCCL splits it as ReduceScatter ; AllGather and synthesizes each separately — the same decomposition used by NCCL's ring-allreduce internally.
6.6 Schedule synthesis MILP (Appendix A.1)
The per-sub-demand MILP is the standard alpha-beta epoch model inherited from TECCL:
Time = K × τ (K epochs of duration τ)
Per-link transmission: t = α + β·s for one chunk of size s
Bandwidth constraint: total volume per τ ≤ τ/β
Latency constraint: ⌈t/τ⌉ epochs before next chunk on same link
Variables:
F_{s,d,i,j,k,c} ∈ {0,1} chunk c on link (i,j) at epoch k
B_{v,k,c} ∈ {0,1} GPU v holds chunk c at epoch k
Objective: minimize K subject to demand satisfaction,
flow conservation, and bandwidth/latency caps
Solved via Gurobi (TECCL [28] back-end)
The auxiliary variable E (Appendix A.3) bridges bandwidth and latency constraints automatically: τ must be a multiple of both α + β·s (latency) and β·s (bandwidth). Setting ⌈f(r)⌉ = E lets SyCCL pick τ without manual tuning per link type.
7. The Two Symmetry Insights and Three Pruning Rules
7.1 Observation 1 — Topology symmetry (Section 3.1)
Modern GPU clusters are built with multi-dimensional symmetric networks. Every server has the same NVSwitch, every leaf has the same fan-out, every spine connects the same number of pods. SyCCL formalizes this by extracting dimensions D and groups G_d from the topology graph automatically: GPUs reachable through "the same type of connection" form a group, and groups in the same dimension are isomorphic.
Topology dimension extraction (auto, Sec 3.1):
Dim 0 = intra-server connections (NVSwitch)
Dim 1 = leaf-switch connections
Dim 2 = spine-switch connections
Dim 3 = core-switch connections
Each dim d defines |G_d| groups; groups in the same dim are
isomorphic by construction.
7.2 Observation 2 — Collective symmetry (Section 3.1)
one-to-all collectives (Broadcast, Scatter):
-> all destinations have identical demand structure
all-to-all collectives (AllGather, AlltoAll):
-> N independent one-to-all collectives
-> each chunk size identical
-> rank-0 schedule = rank-i schedule with relabeling
7.3 The combined insight (Section 3.1)
"Optimal schedules consist of consistent sub-schedules across isomorphic GPU groups."
This is the load-balancing argument: if isomorphic groups have identical connectivity but the schedule asymmetrically loads them, the heavier group becomes the bottleneck while the lighter group sits idle — strictly worse than a balanced schedule.
+--- Fig 4 in the paper: 3 schedules for 16-GPU Broadcast ------+
| Schedule 1: ring across all 4 groups |
| Schedule 2: tree across all 4 groups |
| Schedule 3: ring for groups 1,3 + tree for groups 0,2 |
| |
| Schedules 1 and 2: consistent across isomorphic groups -- OK |
| Schedule 3: inconsistent -- one group bottlenecks |
| |
| -> SyCCL keeps {1, 2}; filters {3} |
+----------------------------------------------------------------+
^ Fig 7: Insight visualization. The pruning is NOT removing
schedules with worse predicted bandwidth — it is removing
schedules that violate the consistency invariant. They cannot
be optimal under symmetric topology + symmetric demand.
7.4 The three pruning rules in operation (Fig 17 ablation)
| Pruning rule | Section | Synthesis-time saving | Performance impact |
|---|---|---|---|
| #1: Redundancy + consistency | §4.1, Fig 17(a) | 20.8% – 48.1% reduction | Minimal |
| #2: Consistency across iso-groups | §4.1, Fig 17(a) | (combined with #1 above) | Minimal |
| #3: Relay cap X = | D | − 1 (AlltoAll ≤ 3 stages) | §4.1, Fig 17(b) |
For DynamICCL the lesson is that symmetry-respecting policies should beat symmetry-blind ones, and the search-space reduction is free if the invariant is enforced upfront. DynamICCL's analog: tie weights of the policy network across symmetric topology positions (e.g., share parameters across NVLink-equivalent ranks).
8. Quantitative Results — Empirical Findings by Regime
8.1 Headline performance vs NCCL and TECCL
"SyCCL achieves up to 91% improvement in schedule performance compared to TECCL and 108% improvement compared to NCCL." "Our production-scale H800 simulation experiments further show that SyCCL significantly improves schedule performance by up to 127% while decreasing synthesis time by 2 to 4 orders of magnitudes compared to TECCL." "SyCCL improves end-to-end model training performance by up to 6.3% on the 32-A100 testbed compared to NCCL."
8.2 AllGather + ReduceScatter on A100 testbed (Fig 14 a/b/c)
| Scale | Data range | Best vs NCCL (busbw) | Best vs TECCL (busbw) | Notes |
|---|---|---|---|---|
| 16 GPUs, AG, ≤ 1 MB | 1K-1M | 0.43x to 0.81x | 0.37x to 0.86x | Worse than NCCL (latency dominated) |
| 16 GPUs, AG, > 1 MB | 4M-4G | 0.17x to 1.08x improvement over NCCL; up to 0.36x over TECCL | -- | Bandwidth ratio 14:1 vs NCCL's 7:1 |
| 32 GPUs, AG | full range | up to 1.04x over NCCL | 0.20x to 0.88x over TECCL | TECCL sacrifices accuracy at scale |
| 16 GPUs, RS | full range | smaller gain than AG | -- | NCCL has tailored static schedules for RS+reduction |
8.3 AlltoAll on A100 testbed (Fig 14d)
"Compared to NCCL, SyCCL increases busbw by up to 0.71x."
The smaller margin reflects that AlltoAll on a small, non-oversubscribed topology already has limited room — NCCL's PXN AlltoAll is near-optimal. TECCL's LP formulation is also near-optimal here.
8.4 AllGather on H800 simulation (Fig 15a/b)
"TECCL timed out for 512 GPUs and produced no valid output. As the number of GPUs increases, the data in AllGather becomes more fragmented and the link latency dominates. For example, with a total size of 1GB distributed across 512 GPUs, each GPU only handles 2MB. Thus the 511 hops in NCCL's Ring severely limit the performance. Instead, SyCCL synthesized a two-dimension schedule, by allowing each source GPU to first broadcast along one dimension (e.g., NVLink) then letting each GPU broadcast all their data received from the first dimension to other dimensions (e.g., rail between servers)."
| Scale | Data range | Best vs NCCL | Best vs TECCL |
|---|---|---|---|
| 64 H800 GPUs, AG | full range | up to 1.27x | up to 1.27x |
| 512 H800 GPUs, AG | full range | significant | TECCL TIMED OUT |
8.5 AlltoAll on H800 simulation (Fig 15c)
"SyCCL improves performance by up to 0.69x compared to TECCL."
NCCL's PXN AlltoAll is near-optimal in this case. SyCCL beats TECCL because TECCL's LP precision issues at 64 GPUs cost it a few percent.
8.6 Synthesis-time table (Table 5, the headline acceleration result)
| Scenario | TECCL Min/Max/Mean (s) | SyCCL Min/Max/Mean (s) | Speedup |
|---|---|---|---|
| 16 A100, AllGather | 266 / 2972 / 1193 | 0.3 / 4.3 / 0.8 | 1554x |
| 16 A100, AlltoAll | 1042 / 26206 / 15759 | 1.4 / 8.3 / 3.6 | 4321x |
| 32 A100, AllGather | 226 / 31293 / 8200.2 | 6.2 / 19.1 / 9.0 | 915x |
| 64 H800, AllGather | 1225 / 67615 / 28200 | 0.4 / 10.8 / 1.6 | 17286x |
| 64 H800, AlltoAll | 3698 / 59044 / 29371 | 1.3 / 29.8 / 5.7 | 5135x |
| 512 H800, AllGather | TIMED OUT (10 h cap) | 85.5 / 14146 / 2246 | N/A (TECCL fails) |
The 17286x speedup on 64 H800 AllGather is the paper's most striking number. It comes from three multipliers stacked: (a) the search-space reduction from sketch IR + pruning, (b) the isomorphism amortization that solves one MILP per equivalence class instead of all instances, and (c) the parallelism (192 threads) that runs independent sub-problems concurrently.
8.7 Synthesis-time breakdown (Fig 16b, 32 GPU)
Stage Time share at 32 GPU
+---------------+ +-----------------------------+
| search | | small (steady; data-size |
| | | independent) |
+---------------+ +-----------------------------+
| combine | | small (steady) |
+---------------+ +-----------------------------+
| solve1 (AG) | | grows w/ data size |
+---------------+ +-----------------------------+
| solve2 (A2A) | | grows w/ data size |
+---------------+ +-----------------------------+
^ Fig 8: Time breakdown. Sketch-exploration phase is sub-second
and constant in data size; MILP solving (per-sub-demand)
dominates and grows with data size as more chunks need finer
scheduling. The sub-second sketch phase confirms that the
symmetry-driven pre-processing is essentially free.
8.8 Effect of parallelism (Fig 16c)
Threads 1 2 4 8 16 32 64 128 192
AG (s) ~10^4 - - - - - - - ~10^1
AtoA (s) ~10^4 - - - - - - - ~10^2
1M (s) ~10^3 - - - - - - - ~10^1
^ Fig 9: Synthesis time vs number of parallel MILP threads.
Strong scaling out to 192 threads. TECCL is monolithic and
cannot scale across threads in the same way -- reflected as
a flat horizontal "TECCL" line in Fig 16c.
8.9 Pruning ablation (Fig 17)
| Configuration | Synthesis-time savings | Schedule perf |
|---|---|---|
| Pruning #1 + #2 disabled | baseline (1x) | baseline |
| Pruning #1 + #2 enabled | 20.8% to 48.1% reduction | minimal degradation |
| AlltoAll stage cap = 10 | baseline | similar |
| AlltoAll stage cap = 5 | small reduction | similar |
| AlltoAll stage cap = 3 | 95%-97% reduction | similar |
8.10 Effect of E_2 (epoch-balance auto-knob, Fig 17c)
| E_2 | Synthesis time | Schedule perf |
|---|---|---|
| 0.1 | longest (high-precision tau) | no improvement |
| 0.2 | balanced (chosen default) | strong |
| 1.0 | fastest | degrades |
E_2 = 0.2 is SyCCL's default; E_1 = 3.0 is the coarse-pass default. For 512 H800 specifically, E_2 = 1.0 was needed to fit within 4 hours.
8.11 End-to-end training (Table 6, the application-layer summary)
| Model | NCCL | TECCL | SyCCL | vs NCCL | vs TECCL |
|---|---|---|---|---|---|
| GPT3-6.7B, DP16 | 672.4 ms | 653.0 | 630.0 | 6.3% | 3.5% |
| GPT3-6.7B, TP16 | 200.0 ms | 197.7 | 192.5 | 3.8% | 2.6% |
| GPT3-6.7B, TP32 | 219.4 ms | 216.5 | 209.7 | 4.4% | 3.1% |
| Llama3-8B, DP16 | 1195.4 ms | 1153.8 | 1135.4 | 5.0% | 1.6% |
| Llama3-8B, TP16 | 433.9 ms | 422.2 | 412.6 | 4.9% | 2.3% |
| Llama3-8B, TP32 | 854.9 ms | 887.4 | 851.5 | 0.4% | 4.0% |
Note the gap between schedule-level (up to 127%) and end-to-end (up to 6.3%). The bulk of training time is GPU compute; communication optimizations move the needle on the ~10-30% slice that is comm-bound.
8.12 Comparison with hand-crafted schedules (Appendix C, Fig 21-22)
"Figure 21(a) shows that SyCCL's performance is similar to that of hand-crafted schedules on the 16-A100 testbed. ... SyCCL uses the same hierarchical schedule as the hand-crafted schedules for large data sizes, and the same direct schedule for smaller data sizes."
"Figure 21(b) shows that SyCCL outperforms hand-crafted schedules on the 64-H800 testbed for larger data sizes. This improvement is attribute to the bandwidth ratio between inter-machine and inter-machine links in the 64-H800 testbed not aligning well with the hand-crafted hierarchical schedule. In contrast, SyCCL utilizes sketch combination and alternative sketches to almost perfectly match the bandwidth ratio."
"After incorporating this hand-crafted schedule, we achieve comparable performance to SyCCL for large sizes. ... We expect such workflow of optimizing the best sketch combinations being the best practice of using SyCCL."
This is the most honest reported finding: SyCCL matches expert hand-crafting and sometimes wins through better bandwidth-ratio matching, but a knowledgeable user can hand-craft a sketch that catches up. The synthesizer's value is correctness automation, not absolute optimality.
9. Configuration-Regime Trade-off Tables
9.1 Synthesizer choice (SCCL vs TACCL vs TECCL vs SyCCL)
| Dimension | SCCL | TACCL | TECCL | SyCCL | Winner (DynamICCL upstream) |
|---|---|---|---|---|---|
| Encoding | SMT | MILP + sketch | MILP (epoch+greedy) | Sketch IR + per-sub-demand MILP | SyCCL |
| User-supplied sketch needed | No | Yes | No | No | tie (SCCL/TECCL/SyCCL) |
| Scale ceiling (AG) | ~30 GPUs | ~64 GPUs (128 fails 8h) | ~64-256 (slow) | 512 GPUs in 4 h | SyCCL |
| Synthesis time @ 16-GPU AG | >24 h | minutes-hours | 20-50 min | 0.3-4.3 s | SyCCL |
| Symmetry exploitation | None (relies on solver) | Manual hint | None | Automatic decomposition | SyCCL |
| Optimality guarantee | Optimal (single-node) | Heuristic ordering | Near-optimal w/ time limit | Optimal per sub-demand | tie SCCL/SyCCL |
| Production-cluster fit | Single chassis only | Multi-node | Multi-node | Multi-rail Clos | SyCCL |
| Output format | XML/MSCCL | XML/MSCCL | MSCCL XML | MSCCL XML | -- |
| Run-time adaptation | None | None | None | None | -- (all need DynamICCL above) |
For DynamICCL, prefer SyCCL as the upstream catalog generator because it is the only synthesizer that produces optimal schedules at production cluster scale (512 GPUs) within minutes — a regime that no prior synthesizer reaches. DynamICCL operates on the catalog SyCCL emits.
9.2 Pruning aggressiveness vs schedule quality (Sec 7.4)
| Pruning level | Synth-time reduction | Quality impact | When to use |
|---|---|---|---|
| All disabled (exhaustive) | 1x baseline | Provably optimal sketch | Research / small N |
| Pruning #1 + #2 only | 20.8%-48.1% | Minimal | Default mid-scale |
| Pruning #1+#2+#3 (X≤3) | 95%-97% | Equivalent in practice | Production / large N |
For DynamICCL, prefer aggressive pruning during catalog generation and conservative pruning during periodic re-synthesis triggered by the agent's high-regret signal. The two-tier approach matches the two-tier exploration that RL agents use in practice.
9.3 Epoch-balance E_2 vs synthesis time (Sec 7.4, Fig 17c)
| E_2 | Synthesis time | Schedule perf | Best for |
|---|---|---|---|
| 0.1 | longest | no improvement | -- |
| 0.2 (default) | balanced | strong | most cells |
| 1.0 | fastest | degrades | very large scale (512 GPUs) |
9.4 Two-step synthesis filter ratios (Sec 5.3)
| Knob | Default | Effect |
|---|---|---|
| R_1 | 20% | Filter window: candidates within 20% of best-coarse-pass advance |
| R_2 | 8 | Top-8 candidates retained for fine pass |
| E_1 | 3.0 | Coarse-pass tau auto-selection target |
| E_2 | 0.5 (general) / 0.2 (production) / 1.0 (512-GPU) | Fine-pass tau target |
For DynamICCL, the analog of R_1/R_2 is the explore/exploit budget in beam-search-style action selection. Keep top-k actions and explore around them; the "k" is tunable per regime, exactly as R_2 is.
9.5 Workload-driven scaling regime
| Regime | Topology | Scale | Best speedup over NCCL | Mechanism |
|---|---|---|---|---|
| Small msg / small N | A100 16-GPU | ≤ 1 MB | NCCL wins (0.43x-0.81x) | NCCL ring is latency-optimized |
| Medium msg / small N | A100 16-GPU | 4M-4G | up to 1.08x (bandwidth) | SyCCL: 14:1 BW ratio match vs NCCL's 7:1 |
| Any size / mid N | A100 32-GPU | full | up to 1.04x | TECCL collapses; SyCCL scales |
| Any size / large N | H800 64-GPU | full | up to 1.27x | Two-dim broadcast (NVLink → rail) |
| Large msg / huge N | H800 512-GPU | full | huge (TECCL TIMED OUT) | NCCL Ring's 511 hops dominate latency |
For DynamICCL, prefer SyCCL-synthesized schedules in {medium-msg ∪ large-N ∪ multi-rail} and prefer NCCL defaults in small-msg / small-N regimes. The agent's state vector should encode both axes to make this regime-conditional choice.
10. Bottlenecks & Insights Surfaced by the Measurements
10.1 The NCCL bandwidth-ratio mismatch (Section 2.1, the motivating bug)
NCCL fixed ring schedule for AllGather:
Intra-server : Inter-server = 7 : 1
(assumed by NCCL, hard-coded)
Production H800 cluster actual:
NVLink BW = 180 GB/s
Inter-server NIC = 8 x 400 Gbps = 400 GB/s aggregate
Actual ratio = 3.6 : 1
Bandwidth wastage:
(7 - 3.6) / (3.6 + 1) = 48.5% / 4.6 = 10.6%
Latency wastage at small sizes:
NCCL Ring requires |V|-1 hops
For 512 GPUs: 511 hops, each ~1us => 511us alone in latency
-> 4x worse than synthesized at small sizes
This is the headline empirical finding: NCCL's hard-coded ring ratio is wrong on every modern cluster, and SyCCL's bandwidth-ratio-aware sketch combination recovers the 10.6%.
10.2 The 511-hop Ring latency wall (Sec 7.2 H800 results)
For 512-GPU AllGather of 1 GB, each GPU only handles 2 MB. NCCL's Ring traversal of 511 hops at ~1 us per hop is 0.5 ms of pure latency overhead — comparable to the actual transfer time. SyCCL's two-dimensional broadcast avoids this by parallelizing across NVLink and rail dimensions: depth becomes O(|D|) ≈ 2 hops instead of 511.
10.3 The MILP scaling wall and the TECCL timeout (Sec 7.3, 7.5)
"TECCL timed out for 512 GPUs and produced no valid output."
TECCL's monolithic MILP at 512 GPUs is computationally hopeless: variables grow as O(|V|² × K × |C|), and at K = 50 epochs and |C| = many chunks, the variable count blows past 10^9. SyCCL's per-sub-demand factoring keeps each MILP under 10^4-10^5 variables.
10.4 The MILP-precision wall (Sec 7.2 AlltoAll H800)
"Because of uncertainties of the solver and precision issues in the LP model, SyCCL perform slightly worse than NCCL [for AlltoAll on 64 H800]."
Floating-point precision in Gurobi's LP relaxation introduces small suboptimalities on the unique-source-destination AlltoAll demand that SyCCL cannot recover. NCCL's hand-tuned PXN routine is exactly optimal in this case.
10.5 The runtime-bottleneck residue (Sec 7.2 ReduceScatter)
"MSCCL's runtime, designed to support flexible schedules, does not incorporate such tailored optimizations for a given schedule. Therefore, although SyCCL's schedules benefit from higher bandwidth utilization and lower latency, these advantages can be diminished by a sub-optimal runtime implementation, especially for smaller sizes."
SyCCL's schedule may be bandwidth-optimal but the MSCCL-executor is generic. For ReduceScatter the gain is smaller than for AllGather because NCCL's hand-coded reduction kernel is faster than MSCCL's generic interpreter.
10.6 The intra-group residual MILP cost (Sec 7.3 H800 512-GPU)
"While SyCCL significantly reduces schedule generation time by leveraging symmetry, extremely large-scale (e.g., 512 H800 GPUs) can still incur notable synthesis time, i.e., 37min. Profilings show that this residual cost stems almost entirely from invoking TECCL to solve schedules within each symmetric GPU group; the sketch-exploration phase itself scales sub-second."
The residual is inside the per-sub-demand MILP, not in SyCCL's sketch logic. Replacing TECCL with a faster intra-group solver would improve SyCCL further.
11. Limitations of the Methodology
| Limitation | Implication for DynamICCL |
|---|---|
| Symmetry-based decomposition fails on irregular networks | DynamICCL must fall back to a direct policy in heterogeneous regimes; SyCCL is unusable there |
| Asymmetric collectives (AlltoAll(v), AllGather(v)) out of scope | RL policy must handle MoE workloads natively (no synthesizer fallback) |
| Heterogeneous clusters (A100+H100 mixed) only partially supported | Need richer topology fingerprint feature in state vector |
| Dynamic networks (link failures, congestion) require regeneration | DynamICCL's run-time selector becomes essential |
| MILP precision limits on AlltoAll | Synthesizer can be sub-NCCL on regime; DynamICCL must detect and fall back |
| 10-hour solver timeout for TECCL baseline | Comparison vs TECCL has timeout artifacts at 512 GPUs |
| Sketch enumeration may miss unseen optimal sketches | "Pruning #1 may miss schedules that marginally outperform" (Sec 8) |
| MSCCL-executor runtime is itself sub-optimal for ReduceScatter | Schedule-level gain doesn't always translate to busbw |
| End-to-end gain capped at 6.3% (Table 6) | Real production benefit is small slice of total time |
| Two production topologies (A100 Clos, H800 multi-rail) only | Limited generalization evidence; no DGX-1, DGX-2, NDv2, AMD |
| Adam optimizer only; no MoE | Specialized comm patterns of GShard/Expert-Parallel untested |
| No streaming overlap with computation considered | DynamICCL must add overlap-aware reward shaping |
| ASTRA-sim simulator for H800 results, not real hardware | Some gain may be sim-artifact |
| Schedules are static once emitted | Cannot adapt to message size sweep within a job |
| Authors' own Section 8: "introduction of sketch significantly enhances the speed and accuracy of SyCCL compared to previous efforts, SyCCL does not ensure that the synthesized schedules are optimal" | Optimality guarantee is per-sub-demand, not global |
The most consequential limitation for DynamICCL is the static-schedule output. SyCCL emits one MSCCL XML per (topology, collective, size) cell. A real workload with mixed message sizes from the same job must either use one schedule for all (sub-optimal) or switch schedules per-call. The latter is exactly DynamICCL's job.
12. What to Borrow for DynamICCL — Composability vs Tension
SyCCL and DynamICCL are complementary, not competitive. SyCCL generates a static catalog of high-quality schedules per (topology, collective, message-size) cell. DynamICCL is the runtime layer that picks among them — and below the schedule selection, picks the in-NCCL knobs (algo, proto, nChannels, numThreads, chunkSize) that no synthesizer addresses. There are two composition modes and three tensions worth thinking through.
12.1 Composition Mode A — SyCCL Catalog as DynamICCL's Action Set
+-------------------- Composition Mode A ------------------------+
| |
| Offline (once per cluster): |
| SyCCL --> catalog of schedules |
| XML_1 = AllGather, A100-32, 1MB |
| XML_2 = AllGather, A100-32, 64MB |
| XML_3 = AlltoAll, H800-64, 4MB |
| ... (one per (collective, scale, msg-bin)) |
| |
| Online (per call): |
| DynamICCL -> read state vector s_t |
| -> select XML schedule from catalog AND |
| (algo*, proto*, nChannels*, numThreads*) tuple |
| -> dispatch to MSCCL-executor |
| -> observe wall-clock latency, update Q-fn |
+----------------------------------------------------------------+
^ Fig 10: Mode A — SyCCL is the catalog generator, DynamICCL is
the catalog selector. Action space includes one categorical
"which XML" plus the in-NCCL knobs.
This is the cleanest composition. DynamICCL never re-synthesizes; it only picks among pre-computed schedules. The catalog is regenerated periodically (e.g., monthly, or on topology change) using SyCCL.
12.2 Composition Mode B — DynamICCL Triggers SyCCL Re-synthesis
+-------------------- Composition Mode B ------------------------+
| |
| DynamICCL maintains regret estimate per (topology, coll, msg) |
| When regret exceeds threshold: |
| Trigger SyCCL re-synthesis |
| Emit new XML for the regretful cell |
| Add to catalog |
| Steady-state: catalog grows as workload distribution shifts |
+----------------------------------------------------------------+
^ Fig 11: Mode B — RL agent decides when to expand the catalog.
More flexible but harder to control: SyCCL takes minutes to
hours per cell, so the trigger must amortize that cost.
Mode B gives DynamICCL agency over the catalog itself. It is effectively the same pattern as ALaSCA's online expansion of NCCL's auto-tuner cache, but driven by symmetry-aware synthesis instead of profiling.
12.3 State-vector features SyCCL validates as predictive
Add to DynamICCL Agent state vector s_t:
+-----------------------------------------------------------+
| topology_dim_count : int (|D| from SyCCL extractor)|
| topology_group_sizes : tuple (|G_d| per dimension) |
| intra_inter_bw_ratio : float (e.g., 3.6, 7.0, 14.0) |
| msg_size_bin : enum (12 powers, 1K..4G) |
| collective_class : enum ({1to1, 1toA, AtoA}) |
| expected_chunk_per_gpu: float (= total / |V|) |
| expected_relay_hops : int (NCCL Ring: |V|-1) |
| schedule_id : int (which catalog entry) |
| schedule_predicted_time: float (from SyCCL simulator) |
| isomorphism_class_id : int (if multiple iso schedules)|
+-----------------------------------------------------------+
^ Fig 12: Borrowed state features. The intra/inter bandwidth ratio
is the single most predictive scalar - it is exactly what NCCL
hard-codes (incorrectly) and what SyCCL solves correctly.
The intra_inter_bw_ratio feature is the most actionable add. NCCL hard-codes 7:1; the actual ratio on H800 is 3.6:1, on A100 Clos may be 14:1. DynamICCL must observe this from telemetry and let the policy condition on it.
12.4 Action priors SyCCL provides
The two crossover boundaries from Section 8:
Boundary 1 — Small msg / small N: NCCL > SyCCL. SyCCL's busbw is 0.43x-0.81x of NCCL for 16-GPU AG ≤ 1 MB. DynamICCL prior: in this regime, prefer NCCL ring with LL/LL128 protocol. The synthesizer offers no value; the policy should not explore there.
Boundary 2 — Large msg / large N + multi-rail: SyCCL >> NCCL. SyCCL achieves 1.27x at 64 H800 AG and infinite-vs-timeout at 512. DynamICCL prior: in this regime, prefer SyCCL XML with Simple protocol and high nChannels.
Boundary 3 — AlltoAll on small N: NCCL ≈ SyCCL. The PXN routine is near-optimal. DynamICCL prior: small exploration budget; both options are within noise.
PRIOR: DynamICCL Agent should be CONSERVATIVE in:
+--------------------------------------------------------------+
| Regime Reason |
|--------------------------------------------------------------|
| 16-GPU AG / msg < 1 MB NCCL Ring + LL protocol|
| (small N, small msg) is latency-optimal |
| |
| 16-GPU AlltoAll / any size NCCL PXN is near- |
| optimal, MILP solver |
| precision can hurt |
| |
| Symmetric topology + large msg SyCCL's deterministic |
| already at SyCCL's predicted opt catalog wins; no need |
| to explore around it |
+--------------------------------------------------------------+
PRIOR: DynamICCL Agent should be AGGRESSIVE in:
+--------------------------------------------------------------+
| Regime Reason |
|--------------------------------------------------------------|
| Multi-rail H800 / N >= 64 / msg >= 1M Up to 1.27x gap from |
| catalog selection |
| |
| Mixed-size workload from one job Catalog has multiple |
| crossing 1 MB threshold entries; pick best |
| per-call |
| |
| Topology with non-7:1 BW ratio Default NCCL will be |
| (any non-DGX cluster) mismatched; SyCCL |
| catalog wins big |
| |
| Mixture-of-Experts AlltoAll(v) SyCCL out of scope -- |
| agent is the only |
| layer with adaptive |
| response |
+--------------------------------------------------------------+
^ Fig 13: Where to allocate exploration budget. The regret-rich
regimes (bottom box) are exactly where SyCCL's catalog has
multiple high-quality candidates that vary by msg size, so the
agent's runtime selection has the most leverage.
12.5 Reward shaping
SyCCL minimizes K (number of epochs of duration τ); the simulator reports wall-clock time. DynamICCL's reward is −collective_wall_clock_us. Two shaping ideas:
Use the SyCCL simulator's predicted time as a baseline. Reward = −(actual − predicted) / predicted. This isolates the runtime selection from the schedule-level inefficiency that SyCCL has already optimized away. The agent's residual gain measures DynamICCL's contribution.
Use the bandwidth-ratio feature as a piecewise reward. When ratio matches the schedule's design ratio, the reward is straight wall-clock. When mismatched, add a small bonus for the agent picking a different schedule from the catalog. This incentivizes the agent to re-route around stale catalog entries.
12.6 Exploration budget allocation (the lesson from pruning ablation)
SyCCL's three pruning rules cut search space by 95%-97% with "minimal performance impact." DynamICCL's analog is symmetry-aware action sharing: tie the policy network's weights across symmetric topology positions (e.g., share parameters across all 8 NVLink-equivalent ranks within a server). This cuts the effective action space by 8x with no degradation, the same multiplier SyCCL achieves on its sketch space.
12.7 Three tensions between SyCCL and DynamICCL
Tension 1 — Static catalog vs adaptive selection. SyCCL produces one schedule per cell; DynamICCL adapts at every call. If the catalog is too narrow, the agent has nothing to choose from. Resolution: generate the catalog with multiple promising sketch combinations per cell (SyCCL already keeps R_2 = 8 by default). DynamICCL picks among these 8.
Tension 2 — Schedule-level optimality vs runtime knob mismatch. SyCCL computes the optimal schedule assuming default channel count. If DynamICCL changes nChannels, the schedule may no longer be optimal for the new channel count. Resolution: synthesize multiple schedule variants per cell, one per channel count; DynamICCL picks both schedule and channels jointly.
Tension 3 — Synthesis cost vs deployment latency. SyCCL takes 0.3s-37 min per cell. DynamICCL adapts at sub-second cadence. If a new (topology, msg, scale) cell appears that the catalog doesn't cover, the agent cannot wait 37 min. Resolution: fallback hierarchy: (1) try catalog; (2) try a similar cell's schedule with adjusted parameters; (3) trigger SyCCL re-synthesis in the background while serving NCCL defaults online; (4) update catalog when re-synthesis completes.
12.8 Architectural placement
+----------------------------------------------+
| Application: PyTorch / Megatron / DeepSpeed | L4
+----------------------------------------------+
| Comm framework: BytePS / Horovod / DDP | L3
+----------------------------------------------+
| DynamICCL Tuner Plugin | L2.5 <-- our layer
| - State: msg_bin, I, M, topology fp, |
| intra_inter_ratio, recent timing |
| - Action: schedule_id, algo, proto, |
| nChannels, numThreads, chunkSize|
| - Reward: -wall_clock_us |
| - Catalog source: SyCCL (offline) |
+----------------------------------------------+
| SyCCL XML schedule (one per cell) | L2.4 <-- catalog
+----------------------------------------------+
| MSCCL-executor / NCCL Core | L2
+----------------------------------------------+
| Transport: RDMA / NVLink / PCIe / SHM | L1
+----------------------------------------------+
^ Fig 14: Layered placement. SyCCL sits offline at L2.4 producing
the catalog. DynamICCL sits at L2.5 selecting from that catalog
AND tuning the in-NCCL knobs. The two layers are orthogonal and
composable.
12.9 Methodological patterns to reuse
| Pattern from Cao et al. (2025) | DynamICCL adoption |
|---|---|
| Symmetry-driven decomposition | Tie policy weights across symmetric topology ranks |
| Two-phase synthesis (sketch + solve) | Two-stage RL: discrete schedule_id selection, then continuous knob tuning |
| Coarse-to-fine refinement (R_1=20%, R_2=8) | Beam search with top-8 retained candidates |
| Per-sub-demand parallelism | Distribute training rollouts across topology-canonical actor groups |
| Auto-extracted dimensions/groups (Sec 3.1) | Auto-discover topology fingerprint from NCCL graph at init |
| Bandwidth-ratio feature (3.6:1 vs 7:1) | First-class state feature in the policy input |
| ASTRA-sim simulator at decision time | Use ASTRA-sim as the offline rollout environment for DynamICCL |
| MSCCL-executor as runtime | DynamICCL emits to MSCCL-executor, same fork as NCCL |
| Open source (github.com/aliyun/symccl) | DynamICCL must be open source for reproducibility |
| 6.3% end-to-end gain on real training | Match or exceed; use Table 6 as the reference |
12.10 Borrowed insight as the "no single winner" principle
The paper's evaluation explicitly shows three crossovers: (a) NCCL > SyCCL for small-msg/small-N, (b) SyCCL > NCCL for medium-large msg, (c) NCCL ≈ SyCCL for AlltoAll on small N. No single static synthesizer wins everywhere. This is the no-free-lunch property that justifies an RL selector. DynamICCL's existence is not "fixing" SyCCL; it is the complement of any synthesizer — the layer that exploits the regime structure SyCCL itself surfaces.
13. Analogy
SyCCL is a CAD tool that exploits a building's floorplan symmetry. The architect (synthesizer) looks at a 64-floor skyscraper and notices that floors 3-62 are nearly identical (symmetric groups). Instead of designing all 64 floors from scratch (TECCL's holistic encoding, which would take a month), the CAD tool designs one representative floor, two boundary floors, and the lobby, then replicates the representative across the symmetric groups with small adjustments for utilities and access (sketch combination). The MILP solver is the structural engineer who certifies one floor at a time; the CAD tool runs 192 such certifications in parallel.
DynamICCL is the building's HVAC and elevator-dispatch control system. The CAD blueprint is fixed once construction is done, but how the elevators route at 9 AM on a rainy Monday vs 3 PM on a sunny Friday is different — the runtime selector adapts to the state of the building (msg_size, model intensity I, batch size M, topology fingerprint, recent latency window) by choosing among pre-computed schedule plans (SyCCL's XML catalog) and tuning the elevator parameters (algo, proto, nChannels) in real time. The synthesizer cannot replace the dispatch system because it has no state observability; the dispatch system cannot replace the synthesizer because it has no time to redesign floors at call time. The two layers are physically separate and intellectually orthogonal.
The paper's key intellectual contribution — that symmetry is a free resource the previous generation left on the table — translates directly to RL: parameter-tied policies across symmetric ranks should be the default, not the optimization. The 17286x speedup SyCCL achieves in synthesis time is the same multiplier DynamICCL stands to gain in sample efficiency if it shares actor-network weights across NVLink-equivalent rank positions. Symmetry is the cheapest constraint a learning system can be told.
Summary of Borrowed Patterns
| Pattern from Cao et al. (2025) | DynamICCL application |
|---|---|
| Sketch IR (per-stage, per-group sub-demand) | Action-space factoring: separate "schedule_id" from "in-NCCL knob tuple" |
| Pruning #1 (isomorphism removal) | Tie policy network weights across symmetric topology positions |
| Pruning #2 (consistency invariant) | Constrain action space to symmetry-respecting tuples |
| Pruning #3 (relay-cap X = | D |
| Two-phase synthesis (4 + 5) | Two-stage RL: schedule selection then knob tuning |
| Coarse-to-fine refinement (R_1, R_2) | Beam search with top-R_2 retained between stages |
| 192-thread MILP parallelism | Distributed actor pool indexed by topology canonical group |
| Auto-dimension extraction (Sec 3.1) | Topology fingerprint feature derived from NCCL graph init |
| Bandwidth ratio observation (3.6:1 vs hard-coded 7:1) | First-class state feature for policy conditioning |
| Sketch combination (chunk-allocation t_i) | Convex action mixture: probability-weighted sub-action selection |
| ASTRA-sim simulator for evaluation | Offline rollout environment for DynamICCL training |
| MSCCL-executor as runtime back-end | DynamICCL emits via the same XML format SyCCL uses |
| Up to 127% gain over NCCL on H800 | Reference upper bound for any DynamICCL evaluation |
| Up to 6.3% end-to-end gain on training | End-to-end metric SyCCL surpasses; DynamICCL must match |
| Synthesis time 0.3s-37min per cell | Catalog regeneration cadence: minutes-hours, not call-time |
| Symmetry-blind solvers waste 95-97% of search | Symmetry-blind policies waste similar effort; tie weights |
| The "no single winner" insight (NCCL vs SyCCL crossovers) | RL justification: regime-conditional best is the only winning strategy |
| 17286x synthesis speedup | Reference data point for state-abstraction reductions in policy learning |
| Hand-crafted schedules tie SyCCL on 16-A100 | Expert knowledge is a valid baseline; DynamICCL must beat both |
| 511-hop NCCL Ring latency wall at 512 GPUs | Latency-dominated regime where catalog selection alone gives huge gain |
| End-to-end gain capped at 6.3% on training | Communication slice is small; over-emphasize comm regret only when warranted |
| Limited applicability to asymmetric MoE workloads | DynamICCL is the only layer with response in MoE regimes (no synthesizer fallback) |
| Open-source github.com/aliyun/symccl | DynamICCL must be open-source for reproducibility |