Architecture & Synthesizer-Design Analysis

SyCCL — Exploiting Symmetry for Efficient Collective Communication Scheduling

Source: Cao, J.; Shi, S.; Gao, J.; Liu, W.; Yang, Y.; Xu, Y.; Zheng, Z.; Guan, Y.; Qian, K.; Liu, Y.; Xu, M.; Wang, T.; Wang, N.; Dong, J.; Fu, B.; Cai, D.; Zhai, E. SyCCL: Exploiting Symmetry for Efficient Collective Communication Scheduling. In ACM SIGCOMM '25, September 8–11, 2025, Coimbra, Portugal. 18 pages, 22 figures (incl. appendix), 6 tables. DOI: 10.1145/3718925.3750499 Code: https://github.com/aliyun/symccl Authors: Alibaba Cloud + Tsinghua University (J. Cao and S. Shi contributed equally). Reader: Direct PDF read (gemini-reader free-tier RESOURCE_EXHAUSTED on gemini-2.0-flash; codex-reader rejected gpt-5.1-codex-mini for ChatGPT free-tier accounts; pages 1-18 read directly via the Read tool with pages parameter, including the full Appendix A-C). Analyst: Vishwakarma Date: 2026-05-04


Table of Contents

  1. Lineage Note — Where SyCCL Sits in the Synthesis Family (SCCL → TACCL → MSCCLang/GC3 → TE-CCL → TECCL → SyCCL)
  2. System / Synthesizer Architecture (the "instrument" — two-phase symmetry decomposition)
  3. Target-Hardware / Topology Specimens (A100 Clos, H800 Multi-rail, generalized multi-rail)
  4. Design-Space Diagram (axes swept by the evaluation, axes held fixed)
  5. The SyCCL Sketch IR — Variables, Sub-demands, and Combinations
  6. Algorithm / Control Flow — Sketch Search, Sketch Combination, Schedule Synthesis
  7. The Two Symmetry Insights (topology symmetry, collective symmetry) and Three Pruning Rules
  8. Quantitative Results — Empirical Findings by Regime
  9. Configuration-Regime Trade-off Tables
  10. Bottlenecks & Insights Surfaced by the Measurements
  11. Limitations of the Methodology
  12. What to Borrow for DynamICCL — Composability vs Tension with Run-Time Knob Selection
  13. Analogy

1. Lineage Note — Where SyCCL Sits in the Synthesis Family

The five immediately preceding papers in this corpus form a clean genealogy of automatic collective synthesis. SyCCL is the sixth generation:

+------------------------------------------------------------------+
|  SCCL (PPoPP 2021, paper 0031)                                   |
|    SMT solver synthesizes optimal schedules                      |
|    Pareto-Synthesize sweeps (S, R, C); QF_LIA encoding           |
|    Single-node only, takes >24 h to synthesize 16-GPU AllGather  |
|                              |                                   |
|                              v                                   |
|  TACCL (NSDI 2023, paper 0032)                                   |
|    MILP guided by user-supplied "communication sketch"           |
|    Three-stage: routing -> heuristic ordering -> contiguity      |
|    Logical topology + switch-hyperedge + symmetry hints          |
|    Fails to create AllGather for 128-GPU topology in 8 hours     |
|                              |                                   |
|                              v                                   |
|  MSCCLang / GC3 (ASPLOS 2023 / arXiv 2022, papers 0033, 0034)    |
|    DSL programmability — user *writes* algorithm, compiler lowers|
|    Three IRs (Chunk DAG -> Instruction DAG -> GC3-IR), runtime   |
|    interpreter; synthesis is OUT OF SCOPE                        |
|                              |                                   |
|                              v                                   |
|  TE-CCL  (SIGCOMM 2024, paper 0035)                              |
|    Treat CCL as a multi-commodity-flow problem from network TE   |
|    General MILP + scalable LP form + A*-style temporal partition |
|    No sketch, no DSL. Scales to 256-GPU AllToAll in 2.8 h.       |
|                              |                                   |
|                              v                                   |
|  TECCL (NSDI 2025, the SyCCL paper's main baseline)              |
|    Variant of TE-CCL with epoch-duration tuning + greedy         |
|    interval decomposition. Suffers ~20% performance loss in      |
|    exchange for scalability. SyCCL cites this as ref [28].       |
|                              |                                   |
|                              v                                   |
|  SyCCL  (SIGCOMM 2025, this paper, 0036)                         |
|    "Forget holistic encoding. Decompose the collective by        |
|    symmetry, solve each smaller sub-problem, recombine."         |
|    Two-phase: Sketch Exploration -> Schedule Synthesis.          |
|    Sketch IR + isomorphism pruning + parallel MILP per sketch.   |
|    Up to 127% better performance than NCCL, 2-4 orders of        |
|    magnitude faster synthesis than TECCL.                         |
+------------------------------------------------------------------+
^ Fig 0: Lineage of automatic CCL synthesis. SyCCL inherits the
  problem statement (topology + demand -> schedule) from SCCL/TACCL
  and the MILP back-end from TECCL/TE-CCL, but rejects the holistic
  encoding paradigm of the prior generation. Its central claim is
  that *symmetry is the missing constraint* — present in real
  topologies and collectives but invisible to the solver.

The intellectual move is to decompose first, solve later. Earlier synthesizers (SCCL, TACCL, TECCL) each took the entire collective and the entire topology and handed the resulting MILP/SMT to a solver, hoping it would discover the symmetric structure on its own. SyCCL's authors observe that solvers do not exploit symmetry well: they brute-force isomorphic regions of the search space and waste hours re-solving sub-problems whose answers are already determined by earlier ones. SyCCL extracts the symmetry up front, reduces the problem to a small set of canonical sub-demands, solves those in parallel, and copies the solutions across isomorphic groups.

The paper's own framing in Section 2.3 makes this explicit:

"We identify that the culprit of the scalability issue in previous solutions is encoding the entire collective and topology into a holistic formula. While in reality, symmetries commonly exist in the collective and topology... The challenge lies in effectively leveraging these symmetries. MILP solvers may not efficiently detect complex symmetric structures. Conversely, directly encoding them as MILP constraints may further complicate the problem and degrade solver performance."

That last sentence is the key: even if the user tells the solver the symmetries (as TACCL does with rotational-symmetry hints), the extra constraints make the formula harder. So SyCCL's sketch IR replaces the symmetry constraints with structural decomposition — the MILP never sees the symmetry, only the canonical fragments.

For DynamICCL the relevant lineage observation is that SyCCL still produces a static MSCCL XML schedule per (topology, collective, message-size) cell. The runtime has no adaptation: SyCCL targets the offline catalog. DynamICCL's RL knob selector is the layer above any such catalog, choosing among pre-synthesized schedules at call time, plus the in-NCCL knobs (algo, proto, nChannels, numThreads, chunkSize) that no synthesizer covers.


2. System / Synthesizer Architecture (the "instrument" — two-phase symmetry decomposition)

SyCCL is not a monolithic solver. It is a two-phase synthesizer: a Sketch Exploration phase (Section 4) that decomposes the global demand into canonical sub-demands using symmetry, and a Schedule Synthesis phase (Section 5) that solves each sub-demand with a small MILP and stitches results back together.

+--------------------------------------------------------------------+
|                          SyCCL Pipeline                            |
|                                                                    |
|  +--------------------+      +-----------------------------------+ |
|  | Workload Spec      |      | Topology Spec                     | |
|  | - Collective name  |      | - V (GPUs), V_0 (GPUs+NICs+sw)    | |
|  |   {AG, AtoA, AR,   |      | - E directed edges                | |
|  |    RS, Bcast, Sct, |      | - alpha_e latency per edge (us)   | |
|  |    Reduce, Gather} |      | - 1/beta_e bandwidth per edge      | |
|  | - Demand matrix    |      | - D dimensions = {0,1,2,...}      | |
|  |   F_s, F_d mappings|      | - G_d groups in dimension d        | |
|  | - Chunk size s     |      | - V_{d,g} GPUs in group g, dim d  | |
|  +---------+----------+      +-----------------+-----------------+ |
|            |                                   |                   |
|            +-----------------+-----------------+                   |
|                              |                                     |
|                              v                                     |
|  +---------------------------------------------------------------+ |
|  | Topology Profiler / Dimension Extractor (Sec 3.1, 6)          | |
|  |   - Network profiler tests chunk sizes per link type          | |
|  |   - Auto-extracts dimensions and groups from connectivity     | |
|  |     and connection performance                                | |
|  |   - Produces V_0, E, alpha_e, beta_e, D, G_d, V_{d,g}         | |
|  +-------------------------+-------------------------------------+ |
|                            |                                       |
|                            v                                       |
|  +---------------------------------------------------------------+ |
|  | DECOMPOSITION (Sec 4.3)                                       | |
|  |  Choose collective family:                                    | |
|  |    one-to-one  -> direct sketch search (special case of A2A)  | |
|  |    all-to-one  -> tree sketch (Reduce/Gather are inverses)    | |
|  |    one-to-all  -> tree sketch (Broadcast/Scatter)             | |
|  |    all-to-all  -> N independent one-to-all decompositions     | |
|  |                   (AllGather, AlltoAll, ReduceScatter)        | |
|  |    AllReduce    -> ReduceScatter ; concat ; AllGather         | |
|  +-------------------------+-------------------------------------+ |
|                            |                                       |
|                            v                                       |
|  +================== PHASE 1: SKETCH EXPLORATION ===============+   |
|  | (Sec 4 — pure combinatorics, no MILP yet)                    |   |
|  |                                                              |   |
|  |  +-----------------------------+                             |   |
|  |  | (4.1) Searching for Sketches|                             |   |
|  |  |  Enumerate D_k, G_{d,k},     |                             |   |
|  |  |   V_{k,d,g}^s, V_{k,d,g}^r   |  per stage k = 0..K-1       |   |
|  |  |  Produce tree-shaped sketch |                             |   |
|  |  |   for one-to-all chunk      |                             |   |
|  |  |  Apply pruning #1, #2, #3   |                             |   |
|  |  +-------------+---------------+                             |   |
|  |                |                                             |   |
|  |                v                                             |   |
|  |  +-----------------------------+                             |   |
|  |  | (4.2) Generating Sketch     |                             |   |
|  |  |       Combinations           |                             |   |
|  |  |  Step 1: replicate sketches |  -- balance across GROUPS   |   |
|  |  |          via mapping H:V->V |     (mapping replication)   |   |
|  |  |          and H_d:G_d -> G_d |                             |   |
|  |  |  Step 2: integrate combos   |  -- balance across DIMS     |   |
|  |  |          chunk-allocation t |     (chunk allocation t_i)  |   |
|  |  +-------------+---------------+                             |   |
|  |                |                                             |   |
|  |                v                                             |   |
|  |  +-----------------------------+                             |   |
|  |  | (4.3) AllToAll Extension    |                             |   |
|  |  |  Replicate single one-to-all|                             |   |
|  |  |   sketch to N copies via    |                             |   |
|  |  |   topology symmetry         |                             |   |
|  |  +-------------+---------------+                             |   |
|  |                |                                             |   |
|  |                v                                             |   |
|  |  Output: M = { <S_i, t_i> } sketch combinations              |   |
|  +-----------------------------+--------------------------------+   |
|                                |                                    |
|                                v                                    |
|  +================== PHASE 2: SCHEDULE SYNTHESIS ===============+   |
|  | (Sec 5 — MILP + simulator, parallelized)                     |   |
|  |                                                              |   |
|  |  +-----------------------------+                             |   |
|  |  | (5.1) Synthesizing          |                             |   |
|  |  |       Sub-Schedules          |                             |   |
|  |  |  For each sketch combination|                             |   |
|  |  |   For each merged sub-demand|                             |   |
|  |  |    Build MILP (alpha-beta   |                             |   |
|  |  |     + epoch tau + integer F)|                             |   |
|  |  |    Solve via Gurobi          |                             |   |
|  |  |    Use isomorphism: solve   |                             |   |
|  |  |     1 / class, copy others  |                             |   |
|  |  |    Run instances in PARALLEL|                             |   |
|  |  +-------------+---------------+                             |   |
|  |                |                                             |   |
|  |                v                                             |   |
|  |  +-----------------------------+                             |   |
|  |  | (5.2) Synthesizing the      |                             |   |
|  |  |       Optimal Schedule       |                             |   |
|  |  |  Concatenate sub-schedules  |                             |   |
|  |  |   for each combination      |                             |   |
|  |  |  Evaluate via fine-grained  |                             |   |
|  |  |   simulator (ASTRA-sim base)|                             |   |
|  |  |  Select best across combos  |                             |   |
|  |  +-------------+---------------+                             |   |
|  |                |                                             |   |
|  |                v                                             |   |
|  |  +-----------------------------+                             |   |
|  |  | (5.3) Acceleration           |                             |   |
|  |  |  Two-step solving:           |                             |   |
|  |  |    coarse (large tau) then  |                             |   |
|  |  |    fine (small tau, top-R2) |                             |   |
|  |  |  R1 = 20%  (filter ratio)   |                             |   |
|  |  |  R2 = 8    (top retained)   |                             |   |
|  |  |  E = 3.0 / 0.5 (tau auto)   |                             |   |
|  |  +-----------------------------+                             |   |
|  +------------------------------------------------------------+   |
|                                |                                    |
|                                v                                    |
|  +---------------------------------------------------------------+ |
|  | Schedule Executor                                             | |
|  |   - Convert synthesized schedule to MSCCL XML                 | |
|  |   - Configure runtime: CCL transport, channel count           | |
|  |   - Inject XML into MSCCL-executor [10] via lightweight parser| |
|  |   - MSCCL-executor runs XML directly (no CUDA kernel changes) | |
|  +-----------------------------+---------------------------------+ |
|                                |                                   |
|                                v                                   |
|  +---------------------------------------------------------------+ |
|  | Hardware Path (deployment)                                    | |
|  |   - 32-GPU A100 testbed (4 servers, 8x A100, 4x200Gbps RDMA)  | |
|  |   - Production H800 simulation (64 servers, 8x H800,          | |
|  |     8x400Gbps RDMA, multi-rail)                                | |
|  |   - Synthesizer host: 192-core Intel Xeon Platinum 8469C, 1 TB |
|  +---------------------------------------------------------------+ |
+--------------------------------------------------------------------+
^ Fig 1: SyCCL pipeline. The two phases are pure-combinatorial
  (Sketch Exploration) and MILP+simulation (Schedule Synthesis). The
  hardware path reuses MSCCL-executor — SyCCL contributes the
  schedule, not the runtime. The 7K lines of C++ in Section 6 implement
  the core SyCCL algorithm; everything else is reused.

The deliberate factoring is what differentiates SyCCL from prior CCL solvers. SCCL had one SMT formulation; TACCL had one MILP plus a heuristic ordering hack; TECCL had three formulation choices (full MILP, LP, A* multi-round); SyCCL is the first synthesizer where the solver never sees the full collective. The MILP is invoked hundreds or thousands of times — once per (sketch combination, sub-demand, isomorphism class) cell — and each invocation is small enough to solve in under a second.

For DynamICCL this is the most important architectural lesson: when the search space explodes, the right move is not always "make the solver smarter"; sometimes it is "tell the solver less, and hand it the canonical fragment instead." DynamICCL's analog is state abstraction: never present the agent with the raw 32-dim feature vector if a 4-dim canonical projection (msg_size_bin, I, M, topology fingerprint) captures the same equivalence class.


3. Target-Hardware / Topology Specimens

3.1 The A100 Clos testbed (32 GPUs, real hardware)

+--------------------- A100 Cluster: 4 servers x 8 A100 = 32 -------+
|                                                                   |
|     +-------- Two-tier Clos network (oversubscribed) ----------+   |
|     |  8x Spine switches                                         |   |
|     |     ===2x200Gbps===                                          |   |
|     |  2x Leaf switches  ====8x200Gbps====  per leaf              |   |
|     |     ===4x200Gbps===                                          |   |
|     |  4x Servers (each leaf serves 2 servers)                    |   |
|     +-------------------------------------------------------------+   |
|                                                                   |
|  Server 0..3   each:                                              |
|    8x A100 GPUs                                                   |
|    intra-server NVSwitch (NVLink fabric)                          |
|    4x 200 Gbps RDMA NICs (one NIC per 2 GPUs)                     |
|                                                                   |
|  Connectivity dimensions (Sec 3.1):                               |
|    Dim 0: intra-server NVLink (NVSwitch)                          |
|    Dim 1: leaf-switch (intra-pod)                                 |
|    Dim 2: spine-switch (cross-pod)                                |
+-------------------------------------------------------------------+
^ Fig 2a: A100 testbed — 32 GPUs across 4 servers, two-tier Clos
  RDMA network, on-server NVSwitch. Used for real-hardware
  evaluation in Sec 7.2 (Fig 14) and Appendix C (Fig 21,22).

3.2 The H800 Multi-rail production cluster (64 servers, simulated)

+--------------- H800 Cluster: 64 servers x 8 H800 = 512 -----------+
|                                                                   |
|     +-------- Multi-rail topology (rail-aligned) -------------+    |
|     |  8x Leaf switches  ====8x400Gbps==== per server         |    |
|     |     ===8x400Gbps===                                       |    |
|     |  64x Servers, "Same-rail" connection pattern            |    |
|     |     (GPUs with same intra-server index -> same leaf)    |    |
|     +-----------------------------------------------------------+    |
|                                                                   |
|  Server 0..63  each:                                              |
|    8x Nvidia H800 GPUs                                            |
|    intra-server NVSwitch (NVLink, 180 GB/s)                       |
|    8x 400 Gbps RDMA NICs (one NIC per GPU)                        |
|                                                                   |
|  Bandwidth ratio observation (Sec 2.1):                           |
|    NVLink: 180 GBps                                               |
|    Inter-server: 8x400Gbps = 400 GB/s aggregate                   |
|    Actual ratio 3.6:1 (NVLink:network), NOT NCCL's assumed 7:1   |
|    -> 48.5% / (3.6+1) = 10.6% bandwidth wastage with fixed ring   |
|                                                                   |
|  Connectivity dimensions:                                         |
|    Dim 0: intra-server NVSwitch                                   |
|    Dim 1: leaf-switch (rail-bound)                                |
+-------------------------------------------------------------------+
^ Fig 2b: H800 production cluster — 512 GPUs, multi-rail topology
  where GPUs with the same intra-server ID share a leaf switch.
  Used for simulation evaluation in Sec 7.2 (Fig 15). The 3.6:1
  bandwidth ratio is the central motivating measurement.

3.3 Generalized multi-rail and Clos examples (Appendix B)

+------- Fig 19 (Appendix B): 7-server x 4-GPU multi-rail -----+
|   Spine switch  -- 4x core link                              |
|   4x Leaf switches each serving 2 servers                    |
|   7 servers x 4 GPUs = 28 GPUs                               |
|                                                              |
|   Dimension extraction (auto):                               |
|     Dim 3 = {0,1,...,27}            (1 group, all GPUs)      |
|     Dim 2 = {0,1,2,3}, {4,...,27}  (4 groups via leaf)       |
|     Dim 1 = {0,4,8,12}, {1,5,9,13},...  (4 groups via rail) |
|     Dim 0 = {0,1,2,3}, {4,5,6,7}, ...   (7 groups intra-srv) |
+--------------------------------------------------------------+

+------- Fig 20 (Appendix B): Clos topology, 8 servers x 4 GPUs +
|   Core switch -- 2 spine links                                |
|   2x Spine switches                                           |
|   4x Leaf switches each serving 2 servers                     |
|                                                               |
|   Dimension extraction (auto):                                |
|     Dim 3 = all 32 GPUs                                       |
|     Dim 2 = 2 groups via spine                                 |
|     Dim 1 = 4 groups via leaf                                  |
|     Dim 0 = 8 groups intra-server                              |
+----------------------------------------------------------------+
^ Fig 2c: SyCCL's dimension extraction is automatic for any
  symmetric topology. The set of dimensions D and groups G_d
  are derived from connectivity and link bandwidth — NOT
  user-supplied as in TACCL's "logical topology" sketch.

The key topology assumption SyCCL exploits is that real production GPU clusters are built layer-by-layer with structural repetition: every server has the same number of GPUs, every leaf serves the same number of servers, every rail follows the same pattern. This is the repetition that creates the isomorphic groups SyCCL slices on.


4. Design-Space Diagram (axes swept x axes held fixed)

                       DESIGN SPACE
+----------------------------------------------------------------+
|                                                                |
|  Axis 1: COLLECTIVE TYPE (4 evaluated)                         |
|    [AllGather]   primary focus, real-hardware + simulation     |
|    [ReduceScatter] real-hardware A100                          |
|    [AlltoAll]    real-hardware A100 + simulation H800          |
|    [AllReduce]   end-to-end (decomposed into RS+AG)            |
|    (Broadcast / Scatter / Reduce / Gather: tree case in §4.1)  |
|                                                                |
|  Axis 2: TOPOLOGY / SCALE (3 levels per topology)              |
|    A100 testbed: 16 GPUs / 32 GPUs                             |
|    H800 simulated: 64 GPUs / 512 GPUs                          |
|                                                                |
|  Axis 3: DATA SIZE (12 powers, 1 KB to 4 GB)                   |
|    1K, 4K, 16K, 64K, 256K, 1M, 4M, 16M, 64M, 256M, 1G, 4G      |
|                                                                |
|  Axis 4: SYNTHESIZER (3 systems compared)                      |
|    [TECCL]   prior SOTA MILP synthesizer (10 h timeout)        |
|    [NCCL]    fixed schedule baseline (default ring/tree)       |
|    [SyCCL]   this paper                                         |
|                                                                |
|  Axis 5 (ablation, Sec 7.4): SyCCL POLICY KNOBS                |
|    Pruning {#1: redundancy/consistency, #2: relay limit, #3:   |
|              AlltoAll stage cap}                                |
|    AlltoAll max stages X = {3, 5, 10}                          |
|    Epoch-balance E_2 = {0.1, 0.2, 1.0}                         |
|    R_1 = 20%, R_2 = 8 (two-step synthesis filter ratios)       |
|    Parallelism = {1, 2, 4, 8, 16, 32, 64, 128, 192} threads    |
|                                                                |
|  Held FIXED:                                                   |
|    - Workload model: GPT-6.7B (DP/TP) and Llama3-8B (DP/TP)    |
|    - Optimizer: Adam-class, distributed                         |
|    - MILP solver: Gurobi (TECCL [28] back-end)                 |
|    - Runtime: MSCCL-executor [10] (Microsoft fork of NCCL)     |
|    - Network profiler: chunk-size sweep per link type          |
|    - Synthesis host: 192-core Xeon 8469C, 1 TB RAM             |
+----------------------------------------------------------------+
^ Fig 3: 5-axis design space — 4 collectives x 4 scales x 12
  data sizes x 3 synthesizers, with 5 ablation knobs in Sec 7.4.
  The "held fixed" line is critical: SyCCL is benchmarked on
  CCL-only metrics (busbw and synthesis time), not on application-
  layer end-to-end except in the brief Sec 7.5 / Table 6.

Three absences define the paper's measurement scope. First, no in-NCCL knob sweep — algo, proto, nChannels, numThreads, chunkSize are fixed at MSCCL-executor's runtime defaults; SyCCL only varies the schedule (the XML chunk-route plan). Second, no adaptive runtime — every reported number is for a single synthesized schedule executed verbatim. Third, no asymmetric or heterogeneous workloads — Section 8 explicitly notes that Mixture-of-Experts AlltoAll(v) is out of scope. These absences are what leaves room for DynamICCL to compose with SyCCL: the schedules SyCCL produces are inputs to DynamICCL's runtime selector.


5. The SyCCL Sketch IR — Variables, Sub-demands, and Combinations

5.1 Notation (Tables 1, 2, 3 in the paper)

Variable Description
V Set of GPUs
V_0 Set of GPUs, NICs, and switches (V ⊆ V_0)
C Set of chunks
s Chunk size (bytes)
F_s : C → V Mapping each chunk c to the GPU it starts on
F_d : C → P(V) Mapping each chunk c to the set of destination GPUs
r ∈ {0,1} Whether chunks are reduced on destinations
E Set of directed links E ⊆ {(v_1,v_2) : v_1, v_2 ∈ V_0}
α_e Latency of link e (μs)
1/β_e Bandwidth of link e
D Set of dimensions D = {0, 1, ...}
G_d Set of groups in dimension d ∈ D
V_{d,g} Set of GPUs in group g ∈ G_d in dimension d
K Number of stages K ∈ ℤ⁺
V^s_{k,d,g} Set of GPUs in group g, dimension d acting as sources at stage k (V^s_{k,d,g} ⊆ V_{d,g})
V^r_{k,d,g} Set of GPUs in group g, dimension d acting as destinations at stage k (V^r_{k,d,g} ⊆ V_{d,g})
R_{k,d,g} Communication sub-demand: V^s_{k,d,g} → V^r_{k,d,g}

A sketch is a tuple of such variables across all stages:

   Sketch = ⟨ K,  { V^s_{k,d,g}, V^r_{k,d,g} }_{k,d,g} ⟩
   Constraint:  ∪_{d,g} V^s_{0,d,g} ∪ ∪_{k,d,g} V^r_{k,d,g} = V
                (every GPU is either a source at stage 0 or a
                 destination exactly once at some later stage)

5.2 Sketch Combination

A single sketch handles one indivisible chunk; a sketch combination M = {(S_1, t_1), (S_2, t_2), ...} prescribes that sketch S_i carries fraction t_i of the data, with Σ t_i = 1.

+--------- Example: 16-GPU Broadcast (Fig 5 in paper) ------------+
|                                                                  |
|  Sketch (1):                                                     |
|    Stage 0: R_{0,1,0} = {0} -> {1,2,3}                          |
|             R_{0,1,1} = {0} -> {4,8,12}                         |
|    Stage 1: R_{1,0,1} = {4} -> {5,6,7}                          |
|             R_{1,0,2} = {8} -> {9,10,11}                        |
|             R_{1,0,3} = {12} -> {13,14,15}                      |
|                                                                  |
|  Tree shape: GPU 0 broadcasts in dim 1 to {4,8,12}, then         |
|  each forwards in dim 0 to its 3 server-mates.                   |
+------------------------------------------------------------------+
^ Fig 4: A single sketch is a tree where every GPU has at most one
  predecessor (Broadcast/Scatter case). Pruning #3 caps the total
  number of relay hops at |D| - 1 (≤ 3 for a 4-dimensional topology).

The sketch IR is the central abstraction. Compared to other synthesis IRs:

IR Granularity Solver-visible? User-supplied?
SCCL: SMT formula per-link, per-time Yes (entire) No
TACCL: communication sketch logical topology + path hints Partial Yes
MSCCLang: Chunk DAG → Inst DAG → GC3-IR per-chunk → per-instruction Compiled User writes algo
TE-CCL / TECCL: F_{s,d,i,j,k,c} per-link per-epoch per-chunk Yes (entire) No
SyCCL: Sketch + Combination per-stage per-group sub-demand Sub-demand only No (auto-enumerated)

SyCCL's IR is the only one that is automatic and decomposed. The solver sees one R_{k,d,g} at a time, never the full collective.


6. Algorithm / Control Flow

6.1 Overall control flow

   START (Topology + Collective + Demand + Chunk size)
            |
            v
   (1) Decompose collective
        |
        +-- one-to-one  ----------------> goto (3) with single chunk
        +-- one-to-all  ----------------> tree case, goto (2)
        +-- all-to-one  ----------------> tree case (inverse), (2)
        +-- all-to-all  --> N x one-to-all decomposition --> (2)
        +-- AllReduce   --> RS ; AG  (treat separately) ----> (2)
            |
            v
   (2) PHASE 1: Sketch Exploration (Sec 4)
        (4.1) Enumerate sketches per (D_k, G_{d,k})
              Apply pruning #1: remove redundant isomorphic sketches
              Apply pruning #2: enforce consistency across iso groups
              Apply pruning #3: cap relay count at X = |D|-1
              -> Set of unique sketches {S_1, ..., S_n}
        (4.2) Generate sketch combinations
              Step 1: Replicate sketches via H:V->V mapping
                       (balance workload across groups)
              Step 2: Integrate combinations across dimensions
                       (chunk-allocation t_i; solve linear |D|-var
                        problem for bandwidth ratio match)
              -> Set of valid combinations M = {⟨S_i, t_i⟩}
        (4.3) AllToAll: replicate the one-to-all sketch N times
              with topology symmetry; produce N-sketch combination
        |
        v
   (3) PHASE 2: Schedule Synthesis (Sec 5)
        (5.1) For each combination, for each merged sub-demand:
              Build alpha-beta MILP (TECCL Appendix A.1 modeling)
              Use isomorphism: solve 1 representative / class
              Copy solution to other isomorphic sub-demands
              Run independent sub-problems in PARALLEL (192 threads)
              -> Sub-schedules
        (5.2) Concatenate sub-schedules per combination
              Run fine-grained simulator (ASTRA-sim base, alpha-beta)
              Select schedule with minimum predicted completion time
        (5.3) Acceleration:
              Coarse pass: large tau, all combinations
              Filter: keep top R2=8 within R1=20% of best
              Fine pass: small tau, only retained candidates
              Auto-select tau via E_1=3.0, E_2=0.5
        |
        v
   (4) Materialize: emit MSCCL XML schedule
       Feed to MSCCL-executor [10]
       END

^ Fig 5: SyCCL control flow. The MILP solver is called inside
  step (5.1) only — never on the full collective. The isomorphism
  optimization (last bullet of 5.1) is what cuts solver invocations
  by 2-4 orders of magnitude vs TECCL on the same problem.

6.2 Sketch search algorithm (Section 4.1)

The enumeration over sketches is governed by three loops at each stage k:

   for each stage k = 0 .. K-1:
     for each subset D_k ⊆ D of dimensions used at stage k:
       for each dimension d ∈ D_k:
         for each group g ∈ G_d:
           enumerate V^s_{k,d,g} ⊆ V_{d,g}
           enumerate V^r_{k,d,g} ⊆ V_{d,g}
           constraint: V^s_{k,d,g} ∩ V^r_{k,d,g} = ∅
                       V^r_{k,d,g} not seen as r at earlier k' < k
                       For Broadcast: |V^s| = 1 (single source)
                       For Scatter:   |V^s| ≤ 1 (with descendants
                                       counted via f(v) for relay
                                       chunk-count cap)
^ Fig 6: Sketch search inner loops. Without pruning, this is
  super-exponential in |V|. The three pruning rules below cut it
  to polynomial-in-|G_d| per stage.

6.3 The three pruning rules (Section 4.1)

Pruning #1 — Redundant isomorphic sketches. Define a sketch S_a isomorphic to S_b iff a one-to-one mapping H : V → V exists carrying all sources/destinations of S_a onto those of S_b. SyCCL keeps only one representative per isomorphism class.

"Sketches 2 and 3 in Figure 8 are isomorphic to each other, while sketch 1 is not."

Pruning #2 — Consistency across isomorphic groups. Within a single sketch, sub-demands assigned to isomorphic groups must have identical structure. Define per-group ratio ρ_{d,g,k} = |V^r_{d,k,g}| / |V^s_{d,k,g}|. A sketch is consistent if ρ is uniform across the isomorphic groups in any given dimension (or zero, meaning the group is excluded).

"Sketches 4 and 5 display consistent sub-demands across groups in both dimensions 0 and 1. In contrast, sketch 6 shows varying ratios in dimension 1 (i.e., GPUs 1, 6, and 16 correspond to 1, 3, and 2 destination GPUs, respectively)."

Pruning #3 — Relay-hop cap for Scatter. In a Scatter tree, each relay GPU forwards n+1 chunks (its own plus n descendants'), creating bandwidth overhead. SyCCL caps the maximum hop count from root to leaf at X = |D| − 1. In practice this means each dimension is traversed at most once.

6.4 Sketch combination algorithm (Section 4.2)

  Step 1: replicate sketches to balance across GROUPS
    For each base sketch S generated in (4.1):
      Compute per-group workload w_{d,g}
        Broadcast: w_{d,g} = Σ_k |V^r_{k,d,g}|
        Scatter:    w_{d,g} = Σ_k Σ_{v∈V^r} (f(v)+1)
                    where f(v) = number of descendants of v
      If groups in some dim d are imbalanced:
        Replicate S as S^0, S^1, ..., S^{|C|-1}, each transmitting
        fraction t = 1/|C| of the chunk, with mapping H_d : G_d -> G_d
        rotating which group each replica favors
        => Output: combination C with |C| equal-weighted sketches

  Step 2: integrate combinations to balance across DIMENSIONS
    For each |D|-tuple of combinations C_d (one per dim d):
      For each sketch S_i in the merged combination:
        Compute w_{i,d}  (workload of S_i in dim d)
      Demand for dim d: u_d ∝ topology bandwidth share
      Solve: t_i values such that w_d / u_d = Σ_i t_i * w_{i,d} / u_d
             constant across d
      If no solution exists -> reject this combination
      Else -> output ⟨S_i, t_i⟩ as a valid combination

6.5 AllToAll extension (Section 4.3)

For all-to-all collectives (AllGather, AlltoAll, ReduceScatter), an N-GPU AllToAll = N independent one-to-all collectives. SyCCL:

  1. Searches sketches for one decomposed one-to-all (Sec 4.1).
  2. Replicates the chosen sketch to the other N − 1 GPUs via topology symmetry, ensuring even per-dim workload distribution.
  3. Integrates the N-sketch combination as in Sec 4.2 Step 2.

For AllReduce, SyCCL splits it as ReduceScatter ; AllGather and synthesizes each separately — the same decomposition used by NCCL's ring-allreduce internally.

6.6 Schedule synthesis MILP (Appendix A.1)

The per-sub-demand MILP is the standard alpha-beta epoch model inherited from TECCL:

   Time = K × τ        (K epochs of duration τ)
   Per-link transmission: t = α + β·s   for one chunk of size s
   Bandwidth constraint:  total volume per τ ≤ τ/β
   Latency constraint:    ⌈t/τ⌉ epochs before next chunk on same link

   Variables:
     F_{s,d,i,j,k,c} ∈ {0,1}   chunk c on link (i,j) at epoch k
     B_{v,k,c} ∈ {0,1}         GPU v holds chunk c at epoch k

   Objective: minimize K subject to demand satisfaction,
              flow conservation, and bandwidth/latency caps

   Solved via Gurobi (TECCL [28] back-end)

The auxiliary variable E (Appendix A.3) bridges bandwidth and latency constraints automatically: τ must be a multiple of both α + β·s (latency) and β·s (bandwidth). Setting ⌈f(r)⌉ = E lets SyCCL pick τ without manual tuning per link type.


7. The Two Symmetry Insights and Three Pruning Rules

7.1 Observation 1 — Topology symmetry (Section 3.1)

Modern GPU clusters are built with multi-dimensional symmetric networks. Every server has the same NVSwitch, every leaf has the same fan-out, every spine connects the same number of pods. SyCCL formalizes this by extracting dimensions D and groups G_d from the topology graph automatically: GPUs reachable through "the same type of connection" form a group, and groups in the same dimension are isomorphic.

   Topology dimension extraction (auto, Sec 3.1):
     Dim 0 = intra-server connections (NVSwitch)
     Dim 1 = leaf-switch connections
     Dim 2 = spine-switch connections
     Dim 3 = core-switch connections
   Each dim d defines |G_d| groups; groups in the same dim are
   isomorphic by construction.

7.2 Observation 2 — Collective symmetry (Section 3.1)

   one-to-all collectives (Broadcast, Scatter):
     -> all destinations have identical demand structure
   all-to-all collectives (AllGather, AlltoAll):
     -> N independent one-to-all collectives
     -> each chunk size identical
     -> rank-0 schedule = rank-i schedule with relabeling

7.3 The combined insight (Section 3.1)

"Optimal schedules consist of consistent sub-schedules across isomorphic GPU groups."

This is the load-balancing argument: if isomorphic groups have identical connectivity but the schedule asymmetrically loads them, the heavier group becomes the bottleneck while the lighter group sits idle — strictly worse than a balanced schedule.

+--- Fig 4 in the paper: 3 schedules for 16-GPU Broadcast ------+
|  Schedule 1: ring across all 4 groups                          |
|  Schedule 2: tree across all 4 groups                          |
|  Schedule 3: ring for groups 1,3 + tree for groups 0,2         |
|                                                                |
|  Schedules 1 and 2: consistent across isomorphic groups -- OK  |
|  Schedule 3:        inconsistent -- one group bottlenecks      |
|                                                                |
|  -> SyCCL keeps {1, 2}; filters {3}                           |
+----------------------------------------------------------------+
^ Fig 7: Insight visualization. The pruning is NOT removing
  schedules with worse predicted bandwidth — it is removing
  schedules that violate the consistency invariant. They cannot
  be optimal under symmetric topology + symmetric demand.

7.4 The three pruning rules in operation (Fig 17 ablation)

Pruning rule Section Synthesis-time saving Performance impact
#1: Redundancy + consistency §4.1, Fig 17(a) 20.8% – 48.1% reduction Minimal
#2: Consistency across iso-groups §4.1, Fig 17(a) (combined with #1 above) Minimal
#3: Relay cap X = D − 1 (AlltoAll ≤ 3 stages) §4.1, Fig 17(b)

For DynamICCL the lesson is that symmetry-respecting policies should beat symmetry-blind ones, and the search-space reduction is free if the invariant is enforced upfront. DynamICCL's analog: tie weights of the policy network across symmetric topology positions (e.g., share parameters across NVLink-equivalent ranks).


8. Quantitative Results — Empirical Findings by Regime

8.1 Headline performance vs NCCL and TECCL

"SyCCL achieves up to 91% improvement in schedule performance compared to TECCL and 108% improvement compared to NCCL." "Our production-scale H800 simulation experiments further show that SyCCL significantly improves schedule performance by up to 127% while decreasing synthesis time by 2 to 4 orders of magnitudes compared to TECCL." "SyCCL improves end-to-end model training performance by up to 6.3% on the 32-A100 testbed compared to NCCL."

8.2 AllGather + ReduceScatter on A100 testbed (Fig 14 a/b/c)

Scale Data range Best vs NCCL (busbw) Best vs TECCL (busbw) Notes
16 GPUs, AG, ≤ 1 MB 1K-1M 0.43x to 0.81x 0.37x to 0.86x Worse than NCCL (latency dominated)
16 GPUs, AG, > 1 MB 4M-4G 0.17x to 1.08x improvement over NCCL; up to 0.36x over TECCL -- Bandwidth ratio 14:1 vs NCCL's 7:1
32 GPUs, AG full range up to 1.04x over NCCL 0.20x to 0.88x over TECCL TECCL sacrifices accuracy at scale
16 GPUs, RS full range smaller gain than AG -- NCCL has tailored static schedules for RS+reduction

8.3 AlltoAll on A100 testbed (Fig 14d)

"Compared to NCCL, SyCCL increases busbw by up to 0.71x."

The smaller margin reflects that AlltoAll on a small, non-oversubscribed topology already has limited room — NCCL's PXN AlltoAll is near-optimal. TECCL's LP formulation is also near-optimal here.

8.4 AllGather on H800 simulation (Fig 15a/b)

"TECCL timed out for 512 GPUs and produced no valid output. As the number of GPUs increases, the data in AllGather becomes more fragmented and the link latency dominates. For example, with a total size of 1GB distributed across 512 GPUs, each GPU only handles 2MB. Thus the 511 hops in NCCL's Ring severely limit the performance. Instead, SyCCL synthesized a two-dimension schedule, by allowing each source GPU to first broadcast along one dimension (e.g., NVLink) then letting each GPU broadcast all their data received from the first dimension to other dimensions (e.g., rail between servers)."

Scale Data range Best vs NCCL Best vs TECCL
64 H800 GPUs, AG full range up to 1.27x up to 1.27x
512 H800 GPUs, AG full range significant TECCL TIMED OUT

8.5 AlltoAll on H800 simulation (Fig 15c)

"SyCCL improves performance by up to 0.69x compared to TECCL."

NCCL's PXN AlltoAll is near-optimal in this case. SyCCL beats TECCL because TECCL's LP precision issues at 64 GPUs cost it a few percent.

8.6 Synthesis-time table (Table 5, the headline acceleration result)

Scenario TECCL Min/Max/Mean (s) SyCCL Min/Max/Mean (s) Speedup
16 A100, AllGather 266 / 2972 / 1193 0.3 / 4.3 / 0.8 1554x
16 A100, AlltoAll 1042 / 26206 / 15759 1.4 / 8.3 / 3.6 4321x
32 A100, AllGather 226 / 31293 / 8200.2 6.2 / 19.1 / 9.0 915x
64 H800, AllGather 1225 / 67615 / 28200 0.4 / 10.8 / 1.6 17286x
64 H800, AlltoAll 3698 / 59044 / 29371 1.3 / 29.8 / 5.7 5135x
512 H800, AllGather TIMED OUT (10 h cap) 85.5 / 14146 / 2246 N/A (TECCL fails)

The 17286x speedup on 64 H800 AllGather is the paper's most striking number. It comes from three multipliers stacked: (a) the search-space reduction from sketch IR + pruning, (b) the isomorphism amortization that solves one MILP per equivalence class instead of all instances, and (c) the parallelism (192 threads) that runs independent sub-problems concurrently.

8.7 Synthesis-time breakdown (Fig 16b, 32 GPU)

     Stage                Time share at 32 GPU
   +---------------+ +-----------------------------+
   | search        | | small  (steady; data-size  |
   |               | |        independent)         |
   +---------------+ +-----------------------------+
   | combine       | | small  (steady)             |
   +---------------+ +-----------------------------+
   | solve1 (AG)   | | grows w/ data size          |
   +---------------+ +-----------------------------+
   | solve2 (A2A)  | | grows w/ data size          |
   +---------------+ +-----------------------------+
^ Fig 8: Time breakdown. Sketch-exploration phase is sub-second
  and constant in data size; MILP solving (per-sub-demand)
  dominates and grows with data size as more chunks need finer
  scheduling. The sub-second sketch phase confirms that the
  symmetry-driven pre-processing is essentially free.

8.8 Effect of parallelism (Fig 16c)

   Threads   1     2     4     8    16    32    64   128   192
   AG (s)   ~10^4 -    -    -    -    -    -    -    ~10^1
   AtoA (s) ~10^4 -    -    -    -    -    -    -    ~10^2
   1M (s)   ~10^3 -    -    -    -    -    -    -    ~10^1
^ Fig 9: Synthesis time vs number of parallel MILP threads.
  Strong scaling out to 192 threads. TECCL is monolithic and
  cannot scale across threads in the same way -- reflected as
  a flat horizontal "TECCL" line in Fig 16c.

8.9 Pruning ablation (Fig 17)

Configuration Synthesis-time savings Schedule perf
Pruning #1 + #2 disabled baseline (1x) baseline
Pruning #1 + #2 enabled 20.8% to 48.1% reduction minimal degradation
AlltoAll stage cap = 10 baseline similar
AlltoAll stage cap = 5 small reduction similar
AlltoAll stage cap = 3 95%-97% reduction similar

8.10 Effect of E_2 (epoch-balance auto-knob, Fig 17c)

E_2 Synthesis time Schedule perf
0.1 longest (high-precision tau) no improvement
0.2 balanced (chosen default) strong
1.0 fastest degrades

E_2 = 0.2 is SyCCL's default; E_1 = 3.0 is the coarse-pass default. For 512 H800 specifically, E_2 = 1.0 was needed to fit within 4 hours.

8.11 End-to-end training (Table 6, the application-layer summary)

Model NCCL TECCL SyCCL vs NCCL vs TECCL
GPT3-6.7B, DP16 672.4 ms 653.0 630.0 6.3% 3.5%
GPT3-6.7B, TP16 200.0 ms 197.7 192.5 3.8% 2.6%
GPT3-6.7B, TP32 219.4 ms 216.5 209.7 4.4% 3.1%
Llama3-8B, DP16 1195.4 ms 1153.8 1135.4 5.0% 1.6%
Llama3-8B, TP16 433.9 ms 422.2 412.6 4.9% 2.3%
Llama3-8B, TP32 854.9 ms 887.4 851.5 0.4% 4.0%

Note the gap between schedule-level (up to 127%) and end-to-end (up to 6.3%). The bulk of training time is GPU compute; communication optimizations move the needle on the ~10-30% slice that is comm-bound.

8.12 Comparison with hand-crafted schedules (Appendix C, Fig 21-22)

"Figure 21(a) shows that SyCCL's performance is similar to that of hand-crafted schedules on the 16-A100 testbed. ... SyCCL uses the same hierarchical schedule as the hand-crafted schedules for large data sizes, and the same direct schedule for smaller data sizes."

"Figure 21(b) shows that SyCCL outperforms hand-crafted schedules on the 64-H800 testbed for larger data sizes. This improvement is attribute to the bandwidth ratio between inter-machine and inter-machine links in the 64-H800 testbed not aligning well with the hand-crafted hierarchical schedule. In contrast, SyCCL utilizes sketch combination and alternative sketches to almost perfectly match the bandwidth ratio."

"After incorporating this hand-crafted schedule, we achieve comparable performance to SyCCL for large sizes. ... We expect such workflow of optimizing the best sketch combinations being the best practice of using SyCCL."

This is the most honest reported finding: SyCCL matches expert hand-crafting and sometimes wins through better bandwidth-ratio matching, but a knowledgeable user can hand-craft a sketch that catches up. The synthesizer's value is correctness automation, not absolute optimality.


9. Configuration-Regime Trade-off Tables

9.1 Synthesizer choice (SCCL vs TACCL vs TECCL vs SyCCL)

Dimension SCCL TACCL TECCL SyCCL Winner (DynamICCL upstream)
Encoding SMT MILP + sketch MILP (epoch+greedy) Sketch IR + per-sub-demand MILP SyCCL
User-supplied sketch needed No Yes No No tie (SCCL/TECCL/SyCCL)
Scale ceiling (AG) ~30 GPUs ~64 GPUs (128 fails 8h) ~64-256 (slow) 512 GPUs in 4 h SyCCL
Synthesis time @ 16-GPU AG >24 h minutes-hours 20-50 min 0.3-4.3 s SyCCL
Symmetry exploitation None (relies on solver) Manual hint None Automatic decomposition SyCCL
Optimality guarantee Optimal (single-node) Heuristic ordering Near-optimal w/ time limit Optimal per sub-demand tie SCCL/SyCCL
Production-cluster fit Single chassis only Multi-node Multi-node Multi-rail Clos SyCCL
Output format XML/MSCCL XML/MSCCL MSCCL XML MSCCL XML --
Run-time adaptation None None None None -- (all need DynamICCL above)

For DynamICCL, prefer SyCCL as the upstream catalog generator because it is the only synthesizer that produces optimal schedules at production cluster scale (512 GPUs) within minutes — a regime that no prior synthesizer reaches. DynamICCL operates on the catalog SyCCL emits.

9.2 Pruning aggressiveness vs schedule quality (Sec 7.4)

Pruning level Synth-time reduction Quality impact When to use
All disabled (exhaustive) 1x baseline Provably optimal sketch Research / small N
Pruning #1 + #2 only 20.8%-48.1% Minimal Default mid-scale
Pruning #1+#2+#3 (X≤3) 95%-97% Equivalent in practice Production / large N

For DynamICCL, prefer aggressive pruning during catalog generation and conservative pruning during periodic re-synthesis triggered by the agent's high-regret signal. The two-tier approach matches the two-tier exploration that RL agents use in practice.

9.3 Epoch-balance E_2 vs synthesis time (Sec 7.4, Fig 17c)

E_2 Synthesis time Schedule perf Best for
0.1 longest no improvement --
0.2 (default) balanced strong most cells
1.0 fastest degrades very large scale (512 GPUs)

9.4 Two-step synthesis filter ratios (Sec 5.3)

Knob Default Effect
R_1 20% Filter window: candidates within 20% of best-coarse-pass advance
R_2 8 Top-8 candidates retained for fine pass
E_1 3.0 Coarse-pass tau auto-selection target
E_2 0.5 (general) / 0.2 (production) / 1.0 (512-GPU) Fine-pass tau target

For DynamICCL, the analog of R_1/R_2 is the explore/exploit budget in beam-search-style action selection. Keep top-k actions and explore around them; the "k" is tunable per regime, exactly as R_2 is.

9.5 Workload-driven scaling regime

Regime Topology Scale Best speedup over NCCL Mechanism
Small msg / small N A100 16-GPU ≤ 1 MB NCCL wins (0.43x-0.81x) NCCL ring is latency-optimized
Medium msg / small N A100 16-GPU 4M-4G up to 1.08x (bandwidth) SyCCL: 14:1 BW ratio match vs NCCL's 7:1
Any size / mid N A100 32-GPU full up to 1.04x TECCL collapses; SyCCL scales
Any size / large N H800 64-GPU full up to 1.27x Two-dim broadcast (NVLink → rail)
Large msg / huge N H800 512-GPU full huge (TECCL TIMED OUT) NCCL Ring's 511 hops dominate latency

For DynamICCL, prefer SyCCL-synthesized schedules in {medium-msg ∪ large-N ∪ multi-rail} and prefer NCCL defaults in small-msg / small-N regimes. The agent's state vector should encode both axes to make this regime-conditional choice.


10. Bottlenecks & Insights Surfaced by the Measurements

10.1 The NCCL bandwidth-ratio mismatch (Section 2.1, the motivating bug)

   NCCL fixed ring schedule for AllGather:
     Intra-server : Inter-server  =  7 : 1
     (assumed by NCCL, hard-coded)

   Production H800 cluster actual:
     NVLink BW         = 180 GB/s
     Inter-server NIC  = 8 x 400 Gbps = 400 GB/s aggregate
     Actual ratio      = 3.6 : 1

   Bandwidth wastage:
     (7 - 3.6) / (3.6 + 1) = 48.5% / 4.6 = 10.6%

   Latency wastage at small sizes:
     NCCL Ring requires |V|-1 hops
     For 512 GPUs: 511 hops, each ~1us => 511us alone in latency
     -> 4x worse than synthesized at small sizes

This is the headline empirical finding: NCCL's hard-coded ring ratio is wrong on every modern cluster, and SyCCL's bandwidth-ratio-aware sketch combination recovers the 10.6%.

10.2 The 511-hop Ring latency wall (Sec 7.2 H800 results)

For 512-GPU AllGather of 1 GB, each GPU only handles 2 MB. NCCL's Ring traversal of 511 hops at ~1 us per hop is 0.5 ms of pure latency overhead — comparable to the actual transfer time. SyCCL's two-dimensional broadcast avoids this by parallelizing across NVLink and rail dimensions: depth becomes O(|D|) ≈ 2 hops instead of 511.

10.3 The MILP scaling wall and the TECCL timeout (Sec 7.3, 7.5)

"TECCL timed out for 512 GPUs and produced no valid output."

TECCL's monolithic MILP at 512 GPUs is computationally hopeless: variables grow as O(|V|² × K × |C|), and at K = 50 epochs and |C| = many chunks, the variable count blows past 10^9. SyCCL's per-sub-demand factoring keeps each MILP under 10^4-10^5 variables.

10.4 The MILP-precision wall (Sec 7.2 AlltoAll H800)

"Because of uncertainties of the solver and precision issues in the LP model, SyCCL perform slightly worse than NCCL [for AlltoAll on 64 H800]."

Floating-point precision in Gurobi's LP relaxation introduces small suboptimalities on the unique-source-destination AlltoAll demand that SyCCL cannot recover. NCCL's hand-tuned PXN routine is exactly optimal in this case.

10.5 The runtime-bottleneck residue (Sec 7.2 ReduceScatter)

"MSCCL's runtime, designed to support flexible schedules, does not incorporate such tailored optimizations for a given schedule. Therefore, although SyCCL's schedules benefit from higher bandwidth utilization and lower latency, these advantages can be diminished by a sub-optimal runtime implementation, especially for smaller sizes."

SyCCL's schedule may be bandwidth-optimal but the MSCCL-executor is generic. For ReduceScatter the gain is smaller than for AllGather because NCCL's hand-coded reduction kernel is faster than MSCCL's generic interpreter.

10.6 The intra-group residual MILP cost (Sec 7.3 H800 512-GPU)

"While SyCCL significantly reduces schedule generation time by leveraging symmetry, extremely large-scale (e.g., 512 H800 GPUs) can still incur notable synthesis time, i.e., 37min. Profilings show that this residual cost stems almost entirely from invoking TECCL to solve schedules within each symmetric GPU group; the sketch-exploration phase itself scales sub-second."

The residual is inside the per-sub-demand MILP, not in SyCCL's sketch logic. Replacing TECCL with a faster intra-group solver would improve SyCCL further.


11. Limitations of the Methodology

Limitation Implication for DynamICCL
Symmetry-based decomposition fails on irregular networks DynamICCL must fall back to a direct policy in heterogeneous regimes; SyCCL is unusable there
Asymmetric collectives (AlltoAll(v), AllGather(v)) out of scope RL policy must handle MoE workloads natively (no synthesizer fallback)
Heterogeneous clusters (A100+H100 mixed) only partially supported Need richer topology fingerprint feature in state vector
Dynamic networks (link failures, congestion) require regeneration DynamICCL's run-time selector becomes essential
MILP precision limits on AlltoAll Synthesizer can be sub-NCCL on regime; DynamICCL must detect and fall back
10-hour solver timeout for TECCL baseline Comparison vs TECCL has timeout artifacts at 512 GPUs
Sketch enumeration may miss unseen optimal sketches "Pruning #1 may miss schedules that marginally outperform" (Sec 8)
MSCCL-executor runtime is itself sub-optimal for ReduceScatter Schedule-level gain doesn't always translate to busbw
End-to-end gain capped at 6.3% (Table 6) Real production benefit is small slice of total time
Two production topologies (A100 Clos, H800 multi-rail) only Limited generalization evidence; no DGX-1, DGX-2, NDv2, AMD
Adam optimizer only; no MoE Specialized comm patterns of GShard/Expert-Parallel untested
No streaming overlap with computation considered DynamICCL must add overlap-aware reward shaping
ASTRA-sim simulator for H800 results, not real hardware Some gain may be sim-artifact
Schedules are static once emitted Cannot adapt to message size sweep within a job
Authors' own Section 8: "introduction of sketch significantly enhances the speed and accuracy of SyCCL compared to previous efforts, SyCCL does not ensure that the synthesized schedules are optimal" Optimality guarantee is per-sub-demand, not global

The most consequential limitation for DynamICCL is the static-schedule output. SyCCL emits one MSCCL XML per (topology, collective, size) cell. A real workload with mixed message sizes from the same job must either use one schedule for all (sub-optimal) or switch schedules per-call. The latter is exactly DynamICCL's job.


12. What to Borrow for DynamICCL — Composability vs Tension

SyCCL and DynamICCL are complementary, not competitive. SyCCL generates a static catalog of high-quality schedules per (topology, collective, message-size) cell. DynamICCL is the runtime layer that picks among them — and below the schedule selection, picks the in-NCCL knobs (algo, proto, nChannels, numThreads, chunkSize) that no synthesizer addresses. There are two composition modes and three tensions worth thinking through.

12.1 Composition Mode A — SyCCL Catalog as DynamICCL's Action Set

+-------------------- Composition Mode A ------------------------+
|                                                                |
|  Offline (once per cluster):                                   |
|    SyCCL  -->  catalog of schedules                            |
|                  XML_1 = AllGather, A100-32, 1MB               |
|                  XML_2 = AllGather, A100-32, 64MB              |
|                  XML_3 = AlltoAll, H800-64, 4MB                |
|                  ... (one per (collective, scale, msg-bin))    |
|                                                                |
|  Online (per call):                                            |
|    DynamICCL -> read state vector s_t                          |
|              -> select XML schedule from catalog AND           |
|                 (algo*, proto*, nChannels*, numThreads*) tuple |
|              -> dispatch to MSCCL-executor                     |
|              -> observe wall-clock latency, update Q-fn        |
+----------------------------------------------------------------+
^ Fig 10: Mode A — SyCCL is the catalog generator, DynamICCL is
  the catalog selector. Action space includes one categorical
  "which XML" plus the in-NCCL knobs.

This is the cleanest composition. DynamICCL never re-synthesizes; it only picks among pre-computed schedules. The catalog is regenerated periodically (e.g., monthly, or on topology change) using SyCCL.

12.2 Composition Mode B — DynamICCL Triggers SyCCL Re-synthesis

+-------------------- Composition Mode B ------------------------+
|                                                                |
|  DynamICCL maintains regret estimate per (topology, coll, msg) |
|  When regret exceeds threshold:                                |
|    Trigger SyCCL re-synthesis                                  |
|    Emit new XML for the regretful cell                         |
|    Add to catalog                                              |
|  Steady-state: catalog grows as workload distribution shifts   |
+----------------------------------------------------------------+
^ Fig 11: Mode B — RL agent decides when to expand the catalog.
  More flexible but harder to control: SyCCL takes minutes to
  hours per cell, so the trigger must amortize that cost.

Mode B gives DynamICCL agency over the catalog itself. It is effectively the same pattern as ALaSCA's online expansion of NCCL's auto-tuner cache, but driven by symmetry-aware synthesis instead of profiling.

12.3 State-vector features SyCCL validates as predictive

   Add to DynamICCL Agent state vector s_t:
   +-----------------------------------------------------------+
   |  topology_dim_count    : int    (|D| from SyCCL extractor)|
   |  topology_group_sizes  : tuple  (|G_d| per dimension)     |
   |  intra_inter_bw_ratio  : float  (e.g., 3.6, 7.0, 14.0)    |
   |  msg_size_bin          : enum   (12 powers, 1K..4G)        |
   |  collective_class      : enum   ({1to1, 1toA, AtoA})       |
   |  expected_chunk_per_gpu: float  (= total / |V|)            |
   |  expected_relay_hops   : int    (NCCL Ring: |V|-1)         |
   |  schedule_id           : int    (which catalog entry)      |
   |  schedule_predicted_time: float (from SyCCL simulator)     |
   |  isomorphism_class_id  : int    (if multiple iso schedules)|
   +-----------------------------------------------------------+
^ Fig 12: Borrowed state features. The intra/inter bandwidth ratio
  is the single most predictive scalar - it is exactly what NCCL
  hard-codes (incorrectly) and what SyCCL solves correctly.

The intra_inter_bw_ratio feature is the most actionable add. NCCL hard-codes 7:1; the actual ratio on H800 is 3.6:1, on A100 Clos may be 14:1. DynamICCL must observe this from telemetry and let the policy condition on it.

12.4 Action priors SyCCL provides

The two crossover boundaries from Section 8:

Boundary 1 — Small msg / small N: NCCL > SyCCL. SyCCL's busbw is 0.43x-0.81x of NCCL for 16-GPU AG ≤ 1 MB. DynamICCL prior: in this regime, prefer NCCL ring with LL/LL128 protocol. The synthesizer offers no value; the policy should not explore there.

Boundary 2 — Large msg / large N + multi-rail: SyCCL >> NCCL. SyCCL achieves 1.27x at 64 H800 AG and infinite-vs-timeout at 512. DynamICCL prior: in this regime, prefer SyCCL XML with Simple protocol and high nChannels.

Boundary 3 — AlltoAll on small N: NCCL ≈ SyCCL. The PXN routine is near-optimal. DynamICCL prior: small exploration budget; both options are within noise.

   PRIOR: DynamICCL Agent should be CONSERVATIVE in:
   +--------------------------------------------------------------+
   |  Regime                                Reason                |
   |--------------------------------------------------------------|
   |  16-GPU AG / msg < 1 MB                NCCL Ring + LL protocol|
   |   (small N, small msg)                  is latency-optimal    |
   |                                                              |
   |  16-GPU AlltoAll / any size            NCCL PXN is near-     |
   |                                         optimal, MILP solver  |
   |                                         precision can hurt    |
   |                                                              |
   |  Symmetric topology + large msg        SyCCL's deterministic |
   |   already at SyCCL's predicted opt      catalog wins; no need |
   |                                         to explore around it  |
   +--------------------------------------------------------------+

   PRIOR: DynamICCL Agent should be AGGRESSIVE in:
   +--------------------------------------------------------------+
   |  Regime                                Reason                |
   |--------------------------------------------------------------|
   |  Multi-rail H800 / N >= 64 / msg >= 1M Up to 1.27x gap from  |
   |                                         catalog selection     |
   |                                                              |
   |  Mixed-size workload from one job      Catalog has multiple  |
   |   crossing 1 MB threshold              entries; pick best    |
   |                                         per-call             |
   |                                                              |
   |  Topology with non-7:1 BW ratio        Default NCCL will be  |
   |   (any non-DGX cluster)                 mismatched; SyCCL     |
   |                                         catalog wins big      |
   |                                                              |
   |  Mixture-of-Experts AlltoAll(v)        SyCCL out of scope --  |
   |                                         agent is the only    |
   |                                         layer with adaptive   |
   |                                         response             |
   +--------------------------------------------------------------+
^ Fig 13: Where to allocate exploration budget. The regret-rich
  regimes (bottom box) are exactly where SyCCL's catalog has
  multiple high-quality candidates that vary by msg size, so the
  agent's runtime selection has the most leverage.

12.5 Reward shaping

SyCCL minimizes K (number of epochs of duration τ); the simulator reports wall-clock time. DynamICCL's reward is −collective_wall_clock_us. Two shaping ideas:

  1. Use the SyCCL simulator's predicted time as a baseline. Reward = −(actual − predicted) / predicted. This isolates the runtime selection from the schedule-level inefficiency that SyCCL has already optimized away. The agent's residual gain measures DynamICCL's contribution.

  2. Use the bandwidth-ratio feature as a piecewise reward. When ratio matches the schedule's design ratio, the reward is straight wall-clock. When mismatched, add a small bonus for the agent picking a different schedule from the catalog. This incentivizes the agent to re-route around stale catalog entries.

12.6 Exploration budget allocation (the lesson from pruning ablation)

SyCCL's three pruning rules cut search space by 95%-97% with "minimal performance impact." DynamICCL's analog is symmetry-aware action sharing: tie the policy network's weights across symmetric topology positions (e.g., share parameters across all 8 NVLink-equivalent ranks within a server). This cuts the effective action space by 8x with no degradation, the same multiplier SyCCL achieves on its sketch space.

12.7 Three tensions between SyCCL and DynamICCL

Tension 1 — Static catalog vs adaptive selection. SyCCL produces one schedule per cell; DynamICCL adapts at every call. If the catalog is too narrow, the agent has nothing to choose from. Resolution: generate the catalog with multiple promising sketch combinations per cell (SyCCL already keeps R_2 = 8 by default). DynamICCL picks among these 8.

Tension 2 — Schedule-level optimality vs runtime knob mismatch. SyCCL computes the optimal schedule assuming default channel count. If DynamICCL changes nChannels, the schedule may no longer be optimal for the new channel count. Resolution: synthesize multiple schedule variants per cell, one per channel count; DynamICCL picks both schedule and channels jointly.

Tension 3 — Synthesis cost vs deployment latency. SyCCL takes 0.3s-37 min per cell. DynamICCL adapts at sub-second cadence. If a new (topology, msg, scale) cell appears that the catalog doesn't cover, the agent cannot wait 37 min. Resolution: fallback hierarchy: (1) try catalog; (2) try a similar cell's schedule with adjusted parameters; (3) trigger SyCCL re-synthesis in the background while serving NCCL defaults online; (4) update catalog when re-synthesis completes.

12.8 Architectural placement

+----------------------------------------------+
|  Application: PyTorch / Megatron / DeepSpeed  |  L4
+----------------------------------------------+
|  Comm framework: BytePS / Horovod / DDP      |  L3
+----------------------------------------------+
|  DynamICCL Tuner Plugin                      |  L2.5  <-- our layer
|    - State: msg_bin, I, M, topology fp,      |
|             intra_inter_ratio, recent timing |
|    - Action: schedule_id, algo, proto,       |
|              nChannels, numThreads, chunkSize|
|    - Reward: -wall_clock_us                  |
|    - Catalog source: SyCCL (offline)         |
+----------------------------------------------+
|  SyCCL XML schedule  (one per cell)          |  L2.4  <-- catalog
+----------------------------------------------+
|  MSCCL-executor / NCCL Core                  |  L2
+----------------------------------------------+
|  Transport: RDMA / NVLink / PCIe / SHM       |  L1
+----------------------------------------------+
^ Fig 14: Layered placement. SyCCL sits offline at L2.4 producing
  the catalog. DynamICCL sits at L2.5 selecting from that catalog
  AND tuning the in-NCCL knobs. The two layers are orthogonal and
  composable.

12.9 Methodological patterns to reuse

Pattern from Cao et al. (2025) DynamICCL adoption
Symmetry-driven decomposition Tie policy weights across symmetric topology ranks
Two-phase synthesis (sketch + solve) Two-stage RL: discrete schedule_id selection, then continuous knob tuning
Coarse-to-fine refinement (R_1=20%, R_2=8) Beam search with top-8 retained candidates
Per-sub-demand parallelism Distribute training rollouts across topology-canonical actor groups
Auto-extracted dimensions/groups (Sec 3.1) Auto-discover topology fingerprint from NCCL graph at init
Bandwidth-ratio feature (3.6:1 vs 7:1) First-class state feature in the policy input
ASTRA-sim simulator at decision time Use ASTRA-sim as the offline rollout environment for DynamICCL
MSCCL-executor as runtime DynamICCL emits to MSCCL-executor, same fork as NCCL
Open source (github.com/aliyun/symccl) DynamICCL must be open source for reproducibility
6.3% end-to-end gain on real training Match or exceed; use Table 6 as the reference

12.10 Borrowed insight as the "no single winner" principle

The paper's evaluation explicitly shows three crossovers: (a) NCCL > SyCCL for small-msg/small-N, (b) SyCCL > NCCL for medium-large msg, (c) NCCL ≈ SyCCL for AlltoAll on small N. No single static synthesizer wins everywhere. This is the no-free-lunch property that justifies an RL selector. DynamICCL's existence is not "fixing" SyCCL; it is the complement of any synthesizer — the layer that exploits the regime structure SyCCL itself surfaces.


13. Analogy

SyCCL is a CAD tool that exploits a building's floorplan symmetry. The architect (synthesizer) looks at a 64-floor skyscraper and notices that floors 3-62 are nearly identical (symmetric groups). Instead of designing all 64 floors from scratch (TECCL's holistic encoding, which would take a month), the CAD tool designs one representative floor, two boundary floors, and the lobby, then replicates the representative across the symmetric groups with small adjustments for utilities and access (sketch combination). The MILP solver is the structural engineer who certifies one floor at a time; the CAD tool runs 192 such certifications in parallel.

DynamICCL is the building's HVAC and elevator-dispatch control system. The CAD blueprint is fixed once construction is done, but how the elevators route at 9 AM on a rainy Monday vs 3 PM on a sunny Friday is different — the runtime selector adapts to the state of the building (msg_size, model intensity I, batch size M, topology fingerprint, recent latency window) by choosing among pre-computed schedule plans (SyCCL's XML catalog) and tuning the elevator parameters (algo, proto, nChannels) in real time. The synthesizer cannot replace the dispatch system because it has no state observability; the dispatch system cannot replace the synthesizer because it has no time to redesign floors at call time. The two layers are physically separate and intellectually orthogonal.

The paper's key intellectual contribution — that symmetry is a free resource the previous generation left on the table — translates directly to RL: parameter-tied policies across symmetric ranks should be the default, not the optimization. The 17286x speedup SyCCL achieves in synthesis time is the same multiplier DynamICCL stands to gain in sample efficiency if it shares actor-network weights across NVLink-equivalent rank positions. Symmetry is the cheapest constraint a learning system can be told.


Summary of Borrowed Patterns

Pattern from Cao et al. (2025) DynamICCL application
Sketch IR (per-stage, per-group sub-demand) Action-space factoring: separate "schedule_id" from "in-NCCL knob tuple"
Pruning #1 (isomorphism removal) Tie policy network weights across symmetric topology positions
Pruning #2 (consistency invariant) Constrain action space to symmetry-respecting tuples
Pruning #3 (relay-cap X = D
Two-phase synthesis (4 + 5) Two-stage RL: schedule selection then knob tuning
Coarse-to-fine refinement (R_1, R_2) Beam search with top-R_2 retained between stages
192-thread MILP parallelism Distributed actor pool indexed by topology canonical group
Auto-dimension extraction (Sec 3.1) Topology fingerprint feature derived from NCCL graph init
Bandwidth ratio observation (3.6:1 vs hard-coded 7:1) First-class state feature for policy conditioning
Sketch combination (chunk-allocation t_i) Convex action mixture: probability-weighted sub-action selection
ASTRA-sim simulator for evaluation Offline rollout environment for DynamICCL training
MSCCL-executor as runtime back-end DynamICCL emits via the same XML format SyCCL uses
Up to 127% gain over NCCL on H800 Reference upper bound for any DynamICCL evaluation
Up to 6.3% end-to-end gain on training End-to-end metric SyCCL surpasses; DynamICCL must match
Synthesis time 0.3s-37min per cell Catalog regeneration cadence: minutes-hours, not call-time
Symmetry-blind solvers waste 95-97% of search Symmetry-blind policies waste similar effort; tie weights
The "no single winner" insight (NCCL vs SyCCL crossovers) RL justification: regime-conditional best is the only winning strategy
17286x synthesis speedup Reference data point for state-abstraction reductions in policy learning
Hand-crafted schedules tie SyCCL on 16-A100 Expert knowledge is a valid baseline; DynamICCL must beat both
511-hop NCCL Ring latency wall at 512 GPUs Latency-dominated regime where catalog selection alone gives huge gain
End-to-end gain capped at 6.3% on training Communication slice is small; over-emphasize comm regret only when warranted
Limited applicability to asymmetric MoE workloads DynamICCL is the only layer with response in MoE regimes (no synthesizer fallback)
Open-source github.com/aliyun/symccl DynamICCL must be open-source for reproducibility