Architecture & Solver-Design Analysis

TE-CCL — Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

Source: Liu, X.; Arzani, B.; Kakarla, S. K. R.; Zhao, L.; Liu, V.; Castro, M.; Kandula, S.; Marshall, L. Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem. In ACM SIGCOMM '24, August 4–8, 2024, Sydney, NSW, Australia. 22 pages, 15 figures, 8 tables. DOI: 10.1145/3651830.3672249 Code: https://github.com/microsoft/TE-CCL Authors: University of Pennsylvania + Microsoft Research + University of Washington + OpenAI + Microsoft. Reader: Direct PDF read (gemini-reader free-tier RESOURCE_EXHAUSTED on gemini-2.0-flash; codex-reader rejected gpt-5.1-codex-mini for ChatGPT free-tier accounts; pages 1-22 read directly via the Read tool with pages parameter, including the full Appendix A-H). Analyst: Vishwakarma Date: 2026-05-04


Table of Contents

  1. Lineage Note — Where TE-CCL Sits in the Synthesis Family (SCCL → TACCL → MSCCLang/GC3 → TE-CCL)
  2. System / Solver Architecture (the "instrument" — three nested formulations)
  3. Target-Hardware / Topology Specimens (DGX1, DGX2, NDv2, AMD, Internal-1, Internal-2)
  4. Design-Space Diagram (axes swept by the evaluation, axes held fixed)
  5. The TE-CCL MILP — Variables, Constraints, Objective in Detail
  6. Algorithm / Control Flow — MILP Path, LP Path, A* Path, Schedule-Synthesis Runtime
  7. The Three CCL-vs-TE Differences (discretized sends, in-network copies, latency+queuing)
  8. Quantitative Results — Empirical Findings by Regime
  9. Configuration-Regime Trade-off Tables
  10. Bottlenecks & Insights Surfaced by the Measurements
  11. Limitations of the Methodology
  12. What to Borrow for DynamICCL — Composability vs Tension with Run-Time Knob Selection
  13. Analogy

1. Lineage Note — Where TE-CCL Sits in the Synthesis Family

The four immediately preceding papers in this corpus form a clean genealogy of automatic collective synthesis:

+------------------------------------------------------------------+
|  SCCL (PPoPP 2021, paper 0031)                                   |
|    SMT solver synthesizes optimal schedules                      |
|    Pareto-Synthesize sweeps (S, R, C); QF_LIA encoding           |
|    Single-node only, intractable beyond 1 chassis (~30 GPUs)     |
|                              |                                   |
|                              v                                   |
|  TACCL (NSDI 2023, paper 0032)                                   |
|    MILP guided by user-supplied "communication sketch"           |
|    Three-stage: routing -> heuristic ordering -> contiguity      |
|    Logical topology + switch-hyperedge + symmetry hints          |
|    Scales to 128 GPUs in 11h, but heuristic ordering hurts qty   |
|                              |                                   |
|                              v                                   |
|  MSCCLang / GC3 (ASPLOS 2023 / arXiv 2022, papers 0033, 0034)    |
|    DSL programmability — user *writes* algorithm, compiler lowers|
|    Three IRs (Chunk DAG -> Instruction DAG -> GC3-IR), runtime   |
|    interpreter; synthesis is OUT OF SCOPE                        |
|                              |                                   |
|                              v                                   |
|  TE-CCL  (SIGCOMM 2024, this paper, 0035)                        |
|    "Forget per-link sketches; forget human DSLs."                |
|    Treat CCL as a multi-commodity-flow problem from network      |
|    traffic engineering (TE). General MILP + scalable LP form +   |
|    A*-style temporal partitioning. No sketch. No DSL.            |
|    Scales to 256-GPU AllToAll on Internal-2 in 2.8 h.            |
+------------------------------------------------------------------+
^ Fig 0: Lineage of automatic CCL synthesis. TE-CCL inherits the
  problem statement (topology + demand -> schedule) from SCCL/TACCL
  but borrows its solving machinery from production WAN traffic
  engineering, where multi-commodity flow has scaled to "thousands
  of nodes" (paper's Sec 1, citing Krishnaswamy 2022 [20]).

The intellectual move is to relocate the problem. SCCL/TACCL framed collective synthesis as combinatorial scheduling and ran into the SMT scalability wall. TE-CCL relabels it as multi-commodity flow with three CCL-specific modifications (discretized sends, in-network copies, latency+queuing) and inherits four decades of TE solver engineering. The paper's own framing in Section 2.2 makes this explicit:

"Our solution requires no human in the loop — the user only needs to specify the topology and the demand matrix — but the solver is slightly slower."

That trade — slower-but-no-human vs faster-but-needs-sketch — is the central design choice TE-CCL makes against TACCL. For DynamICCL the relevant lineage observation is that TE-CCL still produces a static XML/MSCCL schedule per (topology, collective, message-size) cell. The runtime has no adaptation: TE-CCL targets the offline catalog. DynamICCL's RL knob selector is the layer above any such catalog, choosing among pre-synthesized schedules at call time.


2. System / Solver Architecture (the "instrument" — three nested formulations)

TE-CCL is not a monolithic solver. It is a three-formulation toolkit: a general MILP that handles every CCL feature exactly, an LP that sacrifices in-network copy support to scale further, and an A*-style temporal partition that scales further still by approximating optimality. The user picks one formulation based on the (topology, collective, optimality-tolerance) cell.

+--------------------------------------------------------------------+
|                          TE-CCL Toolkit                            |
|                                                                    |
|  +--------------------+      +-----------------------------------+ |
|  | Workload Spec      |      | Topology Spec                     | |
|  | - Demand matrix D  |      | - N (nodes), S subset of N        | |
|  |   D_{s,c,d}={0,1}  |      |   (S = switches, no buffer)       | |
|  | - Collective name  |      | - E (directed edges)              | |
|  |   {AG, AtoA, AR}   |      | - Per-edge T_{ij} (chunks/sec)    | |
|  | - C+1 chunks/source|      | - Per-edge alpha_{ij} (us)        | |
|  +---------+----------+      +-----------------+-----------------+ |
|            |                                   |                   |
|            +-----------------+-----------------+                   |
|                              |                                     |
|                              v                                     |
|  +---------------------------------------------------------------+ |
|  | Scaling-Knob Layer (Sec 5)                                    | |
|  |  - Epoch duration tau (Sec 5: fastest-link or slowest-link)   | |
|  |  - K (number of epochs); auto-bounded by Algorithm 1          | |
|  |  - K' (epochs per round) for A* technique                     | |
|  |  - Switch model {SHARP-style copy / TACCL hyper-edge}         | |
|  |  - Optimality gap tolerance (e.g., 30% early-stop)            | |
|  +-------------------------+-------------------------------------+ |
|                            |                                       |
|                            v                                       |
|  +---------------------------------------------------------------+ |
|  | THREE FORMULATION OPTIONS (user picks one)                    | |
|  |                                                               | |
|  |   +-----------------------+   +-----------------------+       | |
|  |   | (a) General MILP      |   | (b) LP form (Sec 4.1) |       | |
|  |   | F,B integers           |  | F,B real-valued       |       | |
|  |   | Supports COPY (multi-cast)| NO copy (per-link rate|       | |
|  |   | Optimal with epoch-gap | allocation -> DFS path) |       | |
|  |   | Slowest single problem | Fast, polynomial         |       | |
|  |   | (only single time-      | Reverts to traditional   |       | |
|  |   |  partition)             | flow conservation        |       | |
|  |   +-----------+-----------+   +-----------+-----------+       | |
|  |               |                           |                   | |
|  |               +-------------+-------------+                   | |
|  |                             |                                 | |
|  |                             v                                 | |
|  |              +----------------------------+                   | |
|  |              | (c) A* technique (Sec 4.2) |                   | |
|  |              | Multi-round MILP partitions| <-- best for      | |
|  |              | each round greedy progress |     scaling AG    | |
|  |              | toward destinations using  |     beyond 64-128 | |
|  |              | Floyd-Warshall alpha-cost  |     GPUs          | |
|  |              | weights as logical edges   |                   | |
|  |              | NOT optimal but provably    |                   | |
|  |              | improves quality with each   |                   | |
|  |              | round                       |                   | |
|  |              +-------------+--------------+                   | |
|  +-----------------------------+---------------------------------+ |
|                                |                                   |
|                                v                                   |
|  +---------------------------------------------------------------+ |
|  | Solver Backend                                                | |
|  |   Gurobi 9.5.2 (Pedroso 2011 [30])                            | |
|  |   Optimality-gap early-stop (e.g., 30% for AG eval)           | |
|  |   Feasibility-first warm start (binary search via Alg 1)      | |
|  |   Linux Ubuntu 20.04 VM, 2x Xeon Platinum 8380 (80c/160t),    | |
|  |   512 GB RAM                                                   | |
|  +-----------------------------+---------------------------------+ |
|                                |                                   |
|                                v                                   |
|  +---------------------------------------------------------------+ |
|  | Schedule Materializer                                         | |
|  |   - Reverse-DFS over F_{s,d,i,j,k,c} solution                  | |
|  |   - Zero-out flows that don't satisfy any demand (Sec 3.1)    | |
|  |   - Convert per-link rate (LP form) into per-chunk paths      | |
|  |   - Emit MSCCL-XML schedule (consumed by MSCCL runtime [5])   | |
|  +-----------------------------+---------------------------------+ |
|                                |                                   |
|                                v                                   |
|  +---------------------------------------------------------------+ |
|  | Hardware Path (deployment)                                    | |
|  |   MSCCL XML  -> MSCCL runtime (Microsoft fork of NCCL)        | |
|  |   Two AMD ROCm6 nodes, 32 GPUs total (Sec 6, AMD evaluation)  | |
|  |   Hand-optimized to match RCCL/TACCL channel count for fairness| |
|  +---------------------------------------------------------------+ |
+--------------------------------------------------------------------+
^ Fig 1: TE-CCL toolkit. Three formulations share notation (Table 1),
  capacity constraints, and destination constraints; they differ in
  whether F/B are integers, whether the problem is solved as one
  global MILP or as a sequence of greedy rounds, and whether copy
  (in-network multicast) is supported. The hardware path reuses the
  MSCCL runtime — TE-CCL contributes the schedule, not the runtime.

The deliberate factoring is what differentiates TE-CCL from prior CCL solvers. SCCL had one SMT formulation; TACCL had one MILP plus a heuristic ordering hack; TE-CCL exposes the optimality-vs-scale knob to the user as a choice between three formulations rather than as an internal heuristic. The MILP is for cases where copy matters; the LP is for AllToAll where each demand has a unique destination segment; the A* is for AllGather at scale where copy matters but the MILP is too slow. The user is expected to know their workload well enough to pick.

For DynamICCL this is the most important architectural lesson: the right answer is rarely "one formulation." Agent-2 should be able to select among multiple pre-synthesized schedules — each cell of the TE-CCL output catalog corresponds to one (algorithm, msg_size) action the agent can choose at runtime. The catalog itself is offline; the choice is online.


3. Target-Hardware / Topology Specimens

TE-CCL evaluates on six distinct topologies (Table 2 in the paper), spanning standard public hardware and proprietary cloud topologies. The structural diversity is the point: the same solver must handle topologies that range from 4 GPUs/2 edges (Internal 2 chassis) to 17 GPUs/32 edges (DGX2) to 16 GPUs/56 edges (AMD).

+--------------------- TE-CCL Topology Specimens --------------------+
|                                                                   |
|  Topology  | GPUs/chassis | Edges/chassis | alpha (us) | use      |
|  ----------+--------------+---------------+------------+--------- |
|  Internal-1| 4            | 8             | 0.6 / 0.75 | proprietary
|  Internal-2| 2            | 2             | 0.6 / 0.75 | proprietary
|  DGX-1     | 8            | 32            | (NCCL std) | classic NV |
|  NDv2      | 8            | 32            | 0.7        | Azure       |
|  DGX-2     | 17 (16+swt)  | 32            | 0.35 / 2.6 | NVSwitch    |
|  AMD       | 16           | 56            | 0.6 / 0.75 | ROCm6       |
|                                                                   |
|  Notes from the paper:                                            |
|   - alpha is GPU<->GPU and GPU<->switch link latency in us         |
|   - DGX-1 is similar to a single chassis NDv2 (8 GPUs, no switch)  |
|   - DGX-2 has one NVSwitch per chassis (the "+1" GPU count is the  |
|     switch modeled as a node with zero buffer)                      |
|   - AMD: 200 / 100 / 50 GBps GPU-GPU + 25 GBps switch links         |
|     (Fig 15)                                                       |
|   - NDv2 (multi-chassis): 50 / 25 / 12.5 GBps with alpha=0.7us      |
|     intra-chassis and 1.3us inter-chassis-switch (Fig 14)           |
|   - DGX2 (2-chassis): 125 GBps direct + 12.5 GBps cross-chassis     |
|     dashed link with alpha=2.6us (Fig 13)                           |
|                                                                   |
+-------------------------------------------------------------------+
^ Fig 2: TE-CCL topology atlas. The structural span is intentional:
  Internal 2 has 2 GPUs/2 edges to test the corner case; DGX-2 with
  17 nodes/32 edges and a switch tests the SHARP-style copy support;
  AMD with 16 nodes/56 edges tests heterogeneous link bandwidth
  (200/100/50/25 GBps mix); NDv2 4-chassis (32 GPUs) is the headline
  TACCL comparison topology.

3.1 The DGX-2 two-chassis specimen (Fig 13 in paper)

+--------------+                              +--------------+
|              |   125 GBps, alpha=0.35us    |              |
|  Chassis A   |================================|  Chassis B   |
|  16 GPUs     |   (8 send-out, 8 recv-in)   |  16 GPUs     |
|              |                              |              |
|   +-------+  |   12.5 GBps, alpha=2.6us    |  +-------+   |
|   | NVSwt |--+----------- dashed ---+-------+--| NVSwt |   |
|   +-------+  |  (sparse cross-chassis)      |  +-------+   |
|              |                              |              |
+--------------+                              +--------------+
^ Fig 2a: DGX-2 2-chassis. The asymmetry — 8 GPUs send across, 8
  GPUs receive — is what makes this a hard scheduling problem.
  TACCL can synthesize for this; SCCL cannot (>1 chassis is out
  of scope). TE-CCL's headline "2x improvement on TACCL"
  measurement is on this exact topology (Fig 14 / Sec 1).

3.2 The NDv2 four-chassis specimen (Fig 14 in paper)

  +-------- Chassis 1 --------+     +-------- Chassis 2 --------+
  | 8 GPUs, intra-mesh        |     | 8 GPUs, intra-mesh        |
  | 50 GBps (solid),          |     |                            |
  | 25 GBps (dashed)          |     |                            |
  | alpha=0.7us               |     |                            |
  +-------------+-------------+     +-------------+-------------+
                |                                 |
                |  NVSwitch (12.5 GBps dotted, alpha=1.3us)
                |  GPU 0 of each chassis -> GPU 1 of all others
                |  (TACCL: hyperedge constraint, only one of 3 active)
                v                                 v
       +----------------------------------------------+
       |              Central NVSwitch                |
       +-----------------+----------------------------+
                         |
  +-------- Chassis 3 --------+     +-------- Chassis 4 --------+
  | 8 GPUs                    |     | 8 GPUs                    |
  +---------------------------+     +---------------------------+
^ Fig 2b: NDv2 4-chassis = 32 GPUs. TE-CCL's "scales significantly
  further than TACCL" claim (Sec 1) is partly grounded on this:
  TACCL ran out of memory on this topology for many sizes; TE-CCL
  produced schedules.

3.3 The AMD two-chassis specimen (Fig 15 in paper)

  +-- Chassis A (16 GPUs, ROCm6) --+    +-- Chassis B (16 GPUs) --+
  |  Pairs connected via small     |    |  Pairs connected via    |
  |  switches (200/100/50 GBps     |    |  small switches          |
  |  mix, alpha=0.6us)             |    |                          |
  |                                |    |                          |
  |        +----+    +----+         |    |        +----+   +----+   |
  |        |GPU |    |GPU |  ...    |    |        |GPU |   |GPU |   |
  |        +----+    +----+         |    |        +----+   +----+   |
  |          ^         ^             |    |                          |
  |          | small switch          |    |                          |
  +----------+----------------------+    +-------------------------+
                       |                          |
                       v                          v
                 +-----------------------------------+
                 |       Big inter-chassis switch     |
                 |       all switch links 25 GBps     |
                 +-----------------------------------+
^ Fig 2c: AMD 2-chassis = 32 GPUs. This is the only topology TE-CCL
  was deployed *on real hardware* (ROCm6, ROCm5.7 baseline). All
  other reported numbers are simulator/solver-derived bandwidths.
  AllGather measurement: 3x faster than RCCL at 1 MB; 1.5-2x faster
  for >=4 MB transfers (Fig 8).

The hardware-vs-solver split matters for DynamICCL. The bulk of the paper's "TE-CCL outperformed TACCL by 2.14x" headline is derived from the schedules' transfer-time computed under the alpha-beta cost model, not from running on hardware. The AMD ROCm6 numbers are the sole real-hardware confirmation. Agent-2 must therefore treat TE-CCL-style cost-model output as a prior on what is achievable, not a guarantee — DynamICCL's reward signal (collective wall-clock microseconds) is the reality check.


4. Design-Space Diagram (axes swept by the evaluation, axes held fixed)

Section 6 of the paper sweeps a five-axis space. Each evaluation figure (Fig 5-12) fixes a subset and varies the rest.

                   DESIGN SPACE (5 axes + held fixed)
  +---------------------------------------------------------------+
  |                                                               |
  |  Axis 1: TOPOLOGY                                              |
  |    [Internal 1, Internal 2, NDv2, DGX-2, DGX-1, AMD]           |
  |                                                               |
  |  Axis 2: COLLECTIVE                                            |
  |    [AllGather (AG), AllToAll (AtoA)]                           |
  |    AllReduce: derived as AtoA + AG (Sec 8)                     |
  |    Reduce / Broadcast / ReduceScatter: not evaluated           |
  |                                                               |
  |  Axis 3: TRANSFER SIZE / OUTPUT BUFFER                         |
  |    [1 KB, 4 KB, 16 KB, 64 KB, 256 KB, 1 MB,                    |
  |     4 MB, 16 MB, 64 MB, 256 MB, 1 GB]                          |
  |    11 size points across 6 orders of magnitude                |
  |                                                               |
  |  Axis 4: TE-CCL VARIANT                                        |
  |    [General MILP (optimal AG + LP for AtoA)                    |
  |     A* form (AG only, multi-round)                              |
  |     Early-stop 30% (faster, sub-optimal)]                      |
  |                                                               |
  |  Axis 5: # CHASSIS / # GPUs                                     |
  |    [2 ch, 4 ch, 8 ch, 16 ch, 32 ch]                             |
  |    -> 16, 32, 64, 128, 256 GPUs                                 |
  |                                                               |
  |  Held FIXED (no sweep):                                        |
  |    - Chunk size: 25 KB (Table 3)                               |
  |    - alpha for 25 KB sweep: 0.7 us                             |
  |    - Solver: Gurobi 9.5.2                                      |
  |    - Hardware: 80c/160t Xeon 8380, 512 GB RAM                  |
  |    - Optimality gap (most evals): 30%                          |
  |    - Switch model: TE-CCL flow-conservation by default,        |
  |      TACCL hyper-edge for backwards-compat comparisons         |
  |    - alpha, beta: profiled (using TACCL's profiler), input only|
  |    - Multi-tenancy / heterogeneous demand: formulated but not  |
  |      benchmarked in the main eval                              |
  |    - Failures (link/GPU): ECMP at deployment, NOT in schedule  |
  |                                                               |
  +---------------------------------------------------------------+
^ Fig 3: The 5-axis design space. The "held fixed" line is the
  important one: chunk size 25 KB and alpha 0.7 us were chosen so
  that TE-CCL's epoch tau is ~1 us, fine enough to capture the
  delay/queue effects the paper argues prior work misses (Fig 3).
  Multi-tenancy is *formulated* (Sec 5, Sec 8) but not measured.

The most consequential held-fixed knob is chunk size = 25 KB. This is much smaller than NCCL's typical 1 MB chunk and is what enables the fine-grained pipelining TE-CCL claims as its advantage over TACCL:

"TE-CCL can better pipeline chunks and so pays less alpha cost with larger transfers." — Sec 6.1, justifying why TE-CCL's least-steps result beats MSCCL on cells with >1 chunk.

For DynamICCL, this is a knob that already exists in the action space: chunkSize. The paper validates that (chunk_size, alpha, topology) have a non-trivial joint optimum and that the choice of chunk size constrains everything downstream. Agent-2 must encode chunk_size as both a state feature (when caller selected a fixed chunk size) and an action (when DynamICCL is allowed to override).


5. The TE-CCL MILP — Variables, Constraints, Objective in Detail

Section 3.1 of the paper presents the general MILP. Reproducing it as a single block diagram with variables and constraints labeled:

+--------------------------------------------------------------------+
|                  TE-CCL General MILP (Sec 3.1)                     |
|                                                                    |
|  DECISION VARIABLES (Table 1):                                     |
|    F_{s,i,j,k,c}  in {0,1}  : chunk c of source s on link (i,j)    |
|                                in epoch k                           |
|    B_{s,i,k,c}    in {0,1}  : chunk c of source s in node i's       |
|                                buffer at start of epoch k           |
|    R_{s,d,k,c}    in {0,1}  : chunk c delivered to destination d    |
|                                by epoch k                           |
|                                                                    |
|  PARAMETERS:                                                       |
|    T_{ij}           : capacity of link (i,j) in chunks/epoch        |
|    alpha_{ij}       : fixed latency on link (i,j) in us             |
|    delta_{ij}       : ceil(alpha_{ij} / tau) -- epochs to traverse  |
|    tau              : epoch duration                                |
|    K                : max epoch index (auto-bounded by Algorithm 1) |
|    D_{s,c,d}        : 1 if destination d wants chunk c from s        |
|                                                                    |
|  CONSTRAINTS:                                                      |
|    (1) CAPACITY                                                    |
|        sum_{s in N} sum_{c in C} F_{s,i,j,k,c}  <=  T_{ij} * tau    |
|        for all (i,j) in E, all k in K                               |
|                                                                    |
|    (2) FLOW CONSERVATION (with delay, with copy)                   |
|        B_{s,n,k,c}                                                 |
|        + sum_{j: (j,n) in E} F_{s,j,n,k - delta_{jn},c}              |
|        >= max_{j: (n,j) in E} F_{s,n,j,k+1,c}                       |
|        for all (s,n,k,c): n is not source, n is not switch          |
|        --> "what node n has at end of epoch k must be >= what it    |
|             sends out next epoch on EACH outgoing link"             |
|             (the max-over-outgoing is what allows COPY)             |
|                                                                    |
|    (3) BUFFER UPDATE                                                |
|        B_{s,n,k,c} = B_{s,n,k-1,c}                                 |
|                     + sum_{j: (j,n) in E} F_{s,j,n,k - delta_{jn} -1,c}|
|        --> buffer accumulates incoming traffic                      |
|                                                                    |
|    (4) DESTINATION                                                  |
|        R_{s,d,k,c} = min(D_{s,d,c}, B_{s,d,k+1,c})                  |
|        R_{s,d,K,c} = D_{s,d,c}                                      |
|        --> demand satisfied by epoch K                              |
|                                                                    |
|    (5) SWITCH NODES (n in S)                                        |
|        Buffer = 0; same-link copy disallowed; alpha = 2*alpha_link  |
|        --> traffic conservation through the switch                  |
|                                                                    |
|  OBJECTIVE:                                                        |
|    maximize sum_{k in K, s, d in N: s != d} (1/(k+1)) * R_{s,d,k,c} |
|    --> earlier delivery yields exponentially-decaying-by-1/k reward |
|                                                                    |
|  POST-PROCESSING (Sec 3.1, last 3 paragraphs):                     |
|    - Multiple optima exist; some send extra "vampire" flows         |
|    - Reverse DFS from each destination tracks active flows backward  |
|      to source; zero out flows that don't appear in any DFS trace    |
|    - O(|V| + |E|) per source-destination pair                        |
|                                                                    |
+--------------------------------------------------------------------+
^ Fig 4: The general MILP, in full. The key insight that distinguishes
  it from a textbook TE multi-commodity flow LP is constraint (2)'s
  max-over-outgoing-edges: this is what allows COPY (multicast) — a
  node can forward the same chunk to multiple downstream nodes while
  only "consuming" one input. Traditional TE conserves flow exactly,
  forbidding copy.

5.1 The LP relaxation (Sec 4.1)

+--------------------------------------------------------------------+
|                      TE-CCL LP Form (Sec 4.1)                      |
|                                                                    |
|  Replace F, B integers -> real-valued in [0, T_{ij}]                |
|                                                                    |
|  REMOVE COPY SUPPORT: rewrite (2) without max-over-outgoing         |
|     B_{s,n,k+1} + R_{s,n,K} + sum_{j: (n,j) in E} F_{s,n,j,k+1}     |
|     == sum_{j: (j,n) in E} F_{s,j,n,k - delta_{jn}} + B_{s,n,k}     |
|     --> classic flow conservation: in == buffer + consumed + out    |
|                                                                    |
|  DESTINATION constraint becomes cumulative:                         |
|     R_{s,d,k} = sum_{r=0..k} R_{s,d,r}                              |
|     R_{s,d,K} = sum_{c} D_{s,d,c}                                   |
|                                                                    |
|  DROPS SUBSCRIPT c:                                                 |
|     We no longer track individual chunks (no copy -> no need)       |
|     This dramatically reduces variables                             |
|                                                                    |
|  OUTPUT:                                                            |
|     A "rate allocation" per (source, link, epoch)                   |
|     Convert to per-chunk paths via DFS (Sec 3, end paragraphs)      |
|                                                                    |
|  USE: AllToAll (each demand has unique source segment, no copy      |
|        benefit anyway). The LP is convex, polynomial, and scales    |
|        to 256-GPU AllToAll (Table 4: Internal-2 256 GPU AtoA in     |
|        2.8 hours)                                                   |
+--------------------------------------------------------------------+
^ Fig 4a: LP form. Sacrifices copy support to gain polynomial-time
  scalability. AllToAll is the canonical workload that doesn't need
  copy because every (source, destination) pair has a unique chunk.

5.2 The A* technique (Sec 4.2)

+--------------------------------------------------------------------+
|                  TE-CCL A* Technique (Sec 4.2)                     |
|                                                                    |
|  Inspired by Hart, Nilsson, Raphael 1968 [12] (A* search).          |
|  Partition time into ROUNDS; in each round solve a small MILP that  |
|  greedily moves chunks closer to destinations.                      |
|                                                                    |
|  Per-round MILP changes (vs general MILP):                         |
|    - Drop "satisfy all demand by epoch K" constraint                |
|      (otherwise infeasibility blocks any progress)                  |
|    - Add LOGICAL EDGES connecting every node n to every destination |
|      d, with weight equal to alpha-delay along shortest path        |
|      (computed once via Floyd-Warshall [14])                        |
|    - New objective term: maximize discounted "distance-to-dest"     |
|      progress for each chunk currently in transit (variables P,Q)   |
|                                                                    |
|  Algorithm 1 (paper's pseudocode):                                  |
|    Input: D, G(N,E), tau_opt, alpha, T, n_e (epoch count), C_T      |
|    For total_time in {4, 8, 12}:                                    |
|       For n_e in {fixed candidates}:                                |
|          tau <- total_time / n_e                                    |
|          Try general_form(D, tau, alpha, ...)                       |
|          If feasible -> n_e <- feasible_time / tau_opt; break       |
|    Returns: upper bound on epochs needed                            |
|                                                                    |
|  Look-ahead state Q_{s,n,c,k',r} carries chunks that arrive in       |
|  future rounds (after round r ends) so the next round's MILP can    |
|  account for them.                                                  |
|                                                                    |
|  USE: AllGather on large topologies (Internal-1 64-256 GPUs, Table 4)|
|        Internal-2 128 GPUs in 7 hours; 256 GPUs in 2.8 hours         |
|        Sub-optimal but provably improves with more rounds            |
+--------------------------------------------------------------------+
^ Fig 4b: A* technique. Same MILP machinery solved iteratively in
  smaller chunks, with logical edges encoding the "heuristic distance
  to goal" — exactly the h(n) function in classical A*. Trades
  global optimality for tractability at >64 GPUs.

The three formulations form a Pareto curve in (optimality, scale):

   optimality
       ^
       |  General MILP (with copy)  <-- 16-32 GPUs, optimal
       |       *
       |        \
       |         *  LP form (no copy, AtoA)  <-- 128-256 GPUs, optimal
       |          \
       |           *  A* (multi-round, all collectives) <-- 256+ GPUs,
       |            \                                       sub-optimal
       |             \
       +----------------------------------> scale (GPUs)
^ Fig 4c: The Pareto front of the three formulations. The user
  picks the rightmost feasible point given their compute budget.

For DynamICCL: this Pareto curve is the offline-synthesis curve. DynamICCL's runtime selects among schedules generated at different points on this curve. The curve also tells Agent-2's exploration budget where to invest — the LP/A* schedules are sub-optimal and have the most regret remaining for runtime knob tuning.


6. Algorithm / Control Flow — MILP Path, LP Path, A* Path, Schedule-Synthesis Runtime

START (user submits topology G(N,E), demand matrix D, collective name)
   |
   v
(1) Algorithm 1: bound the number of epochs K needed
       for total_time in {4, 8, 12}:
         for n_e in {fixed candidates}:
           tau <- total_time / n_e
           Try MILP feasibility (Gurobi)
           If feasible: n_e <- feasible_time / tau_opt; break
       Return n_e
   |
   v
(2) Choose formulation:
   |
   +-- Demand needs copy (e.g. AllGather)? ---+
   |                                          |
   |  YES                                     |  NO
   |                                          |
   v                                          v
(3a) General MILP                       (3b) LP form
   - Integer F, B                          - Real F, B
   - Max-over-outgoing flow conservation   - Strict flow conservation
   - Optimal in single solve               - Polynomial; scales further
   |                                          |
   |  Solver too slow / too memory?           |
   |                                          |
   v                                          v
(3c) A* technique                       (4) Solve to optimality (or
   - Partition time into rounds              30% gap early-stop) with
   - Per-round MILP w/ Floyd-Warshall        Gurobi 9.5.2
     logical-edge heuristic                  |
   - Sub-optimal but bounded                 v
   - Runs until all demands met        (5) Reverse-DFS post-process
   |                                       (zero out non-load-bearing
   |                                        flows; O(|V|+|E|))
   |                                          |
   +-------------+----------------------------+
                 |
                 v
(6) Materialize schedule:
       - Convert F/B/R variables -> per-chunk path list
       - Each chunk has (rank, src_buf, dst_buf, time_step) entries
       - Emit MSCCL-XML (compatible with MSCCL runtime [5])
   |
   v
(7) Deployment (optional, only the AMD evaluation):
       - Hand-tune channel count to match RCCL/TACCL fairness
       - MSCCL runtime executes XML on hardware
   |
   v
END  -> schedule consumed by NCCL/RCCL via MSCCL fork
^ Fig 5: End-to-end TE-CCL flow. Steps (1) and (2) are user choices;
  (3a/3b/3c) are the three formulation paths; (4-6) are the offline
  solve and serialize; (7) is the optional deployment. Steps (1-6)
  run on the user's workstation (or a cloud VM); step (7) runs on
  the production cluster.

6.1 The "discretized sends" modeling — Fig 4 (paper)

  Epoch 0 (one chunk as 2 halves)             Epoch 2
   +---+                                       +---+
   | s |   ---0.5--->  +---+ d1                | d3 |
   +---+               +---+
     | 0.5             +---+
     +---------------> | s2| (Epoch 1)
                       +---+

  Without integer F: solver thinks both halves in d3's buffer count
                     as a "full chunk delivered" -- WRONG.
  With integer F:     solver tracks individual chunks; it cannot
                      claim d3 received "the chunk" if it only got
                      one half from s and one half from s2 because
                      they are different copies.

^ Fig 6: Why TE-CCL needs integer variables in the MILP form.
  Allowing fractional chunks + copy creates phantom completions where
  the solver "delivers" a chunk by aggregating two unrelated halves.
  This is the bug Sec 3.1's Fig 4 illustrates and that motivates the
  general MILP formulation despite its higher solve cost.

6.2 The "in-network copies" modeling — Fig 2a (paper)

                       +-----+
                       |  h  |   (intermediate node, GPU)
                       +-----+
                      /   |   \
                     /    |    \   alpha = 1 us each
                    /     |     \
              +---+   +---+   +---+
              |d1 |   |d2 |   |d3 |
              +---+   +---+   +---+

  s sends 1 MB chunk to h. Without copy modeling:
     s sends to d1 directly        (1 link, 1 us alpha + transfer)
     s sends to d2 directly        (1 link, 1 us alpha + transfer)
     s sends to d3 directly        (1 link, 1 us alpha + transfer)
     -> 3 separate transmissions = 4 sec total

  With copy modeling:
     s sends to h once             (1 us alpha + 1 sec transfer)
     h forwards to d1, d2, d3      (1 us alpha each, parallel)
     -> 2 sec total (a 2x improvement)

^ Fig 7: Why TE-CCL needs in-network copy. Tree broadcast / reduce
  patterns rely on intermediate nodes duplicating data. Traditional
  TE forbids this because real WAN routers don't multicast at line
  rate; GPU collectives CAN multicast (the SM that holds the chunk
  can write to multiple outgoing buffers).

6.3 The "latency + queuing" modeling — Fig 2b (paper)

   s1 ----alpha---->  h1 ---alpha---> h2 ---alpha---> h3 ---alpha---> d
                                                        ^
                                                        | alpha
                                                        |
                                                        s2

  Two chunks (one from s1, one from s2) arrive at h3 at the same time.
  ONE chunk must queue while the other transmits.
  The queue delay is alpha_{2->3} (= 1 us).
  Total time = 3*alpha + 2 (without queue model)
             = 3*alpha + (1 us queue) + 2 (with queue model)

^ Fig 8: Why TE-CCL needs to model queue + propagation delay.
  Steady-state TE assumes large traffic bundles where alpha is
  amortized; CCL has small/medium chunks where alpha dominates.
  Fig 3 (paper) shows the relative error in average throughput
  estimate exceeds 100% when transfer size is below 0.1 MB if the
  solver doesn't model alpha properly.

These three figures (Fig 2 in the paper) are the conceptual core of the paper's "TE doesn't directly apply" argument. Each motivates one modeling addition to the standard multi-commodity flow LP.


7. The Three CCL-vs-TE Differences (the paper's central insight)

The paper's contribution can be reduced to three modeling additions that distinguish a CCL-aware MILP from a textbook TE multi-commodity flow LP:

TE assumption CCL reality TE-CCL fix
Sends are continuous bandwidth Sends are discrete chunks of fixed bits Integer F variables; epoch-discretized time
Flow conservation: in=out exactly In-network multicast (copy) is free max-over-outgoing flow conservation (Sec 3.1, eqn 2)
alpha is small vs sustained transfer alpha dominates for small chunks delta_{ij}=ceil(alpha/tau) added to flow constraints
Queues amortize over long flows Small flows can collide and queue Buffer constraint per node per epoch (Sec 3.1, eqn 3)
Demand is sustained / large-bundle Demand is finite, must complete Destination constraint w/ K=last epoch (Sec 3.1, eqn 4)
Switches are full-buffer routers GPU switches have ~zero buffer Special switch model (Sec 3.1, end of subsection)

This table is the most exportable artifact of the paper. For DynamICCL: each row above corresponds to a modeling assumption that must hold at runtime as well. If the runtime system violates any of them (e.g., congestion pushes effective alpha 10x higher), the synthesized schedule's optimality vanishes. Agent-2 should monitor runtime alpha and trigger schedule re-selection when alpha drifts.


8. Quantitative Results — Empirical Findings by Regime

8.1 Headline numbers (from Abstract + Sec 1)

Comparison Result
TE-CCL vs TACCL on NDv2 2-chassis 2x better algorithm bandwidth
TE-CCL vs TACCL on AMD GPU testbed 2.14x algorithm bandwidth
TE-CCL vs RCCL on AMD GPU testbed 3.18x algorithm bandwidth
TE-CCL solver time vs TACCL solver time "approximately the same time as TACCL's heuristic"
TE-CCL scaling vs TACCL "scales to larger topologies than TACCL"

8.2 AllGather AB improvement vs TACCL (Fig 5 in paper)

Topology Best AB improvement Conditions
NDv2 0.36% (mean) up to 970% (max) for AllGather
DGX-2 12% (mean) up to 471% (max)
Internal 1 -5% (mean) up to 689% (max), some cells worse
Internal 2 0.33% (mean) up to 5759% (max)

For AllToAll improvement vs TACCL:

Topology Best AB improvement Conditions
NDv2 0.18% (mean) up to 2919% (max)
DGX-2 9% (mean) up to 2979% (max)
Internal 1 20% (mean) up to 197% (max)
Internal 2 0.48% (mean) up to 12322% (max)

The wide max-vs-mean spread is the most interesting empirical pattern. Most cells are roughly equivalent to TACCL; a small number of cells are dramatically (2-100x) better. This implies TE-CCL's win is concentrated in regimes where TACCL is mis-tuned (sketch quality collapses for that topology) rather than uniform across the design space.

8.3 NDv2 2-chassis AllToAll (Table 8 — selected cells)

Output buffer | TE-CCL CT (us) | TACCL CT (us)  | Improvement %
   1 GB         320235.81        320049.40          -0.058% (TACCL wins)
   1 MB         325.28           359.00             +10.366%
   256 KB       85.52            115.72             +35.313%
   64 KB        23.30            50.34              +116.052%
   16 KB        7.27             35.76              +392.223%
   4 KB         4.64             32.16              +592.134%
   1 KB         4.24             36.80              +768.920%

The pattern is striking: at large buffers (1 GB, 256 MB), TACCL is narrowly better; at small buffers (1-16 KB), TE-CCL is 4-8x better. This is the small-message regime where alpha dominates and TACCL's heuristic ordering loses to TE-CCL's explicit pipeline modeling.

8.4 NDv2 2-chassis AllGather (Table 8 — selected cells)

Output buffer | TE-CCL AB (GB/s) | TACCL AB (GB/s) | Improvement %
   1 GB         22.86               18.60             +22.90%
   256 MB       22.86               20.49             +11.56%
   64 MB        22.86               20.43             +11.90%
   1 MB         20.51               16.09             +27.49%
   256 KB       17.39               10.13             +71.60%
   64 KB        10.53               4.89              +115.13%
   16 KB        3.60                1.26              +185.59%

Here the win is uniform — TE-CCL beats TACCL at every size from 1 GB down to 16 KB on AllGather — but the ratio grows sharply at small messages. This is the same shape as 8.3 but lifted off zero.

8.5 AMD ROCm6 hardware results (Fig 8 in paper)

Buffer size | RCCL AB | TACCL AB | TE-CCL AB | Speedup over RCCL
   16 KB      ~0.8     ~1.0       ~2.5         ~3.1x
   64 KB      ~1.3     ~2.0       ~4.0         ~3.0x
   256 KB     ~2.2     ~3.5       ~6.0         ~2.7x
   1 MB       ~3.5     ~5.5       ~10.5        ~3.0x   <-- "1.5-2x faster for >=4 MB"
   4 MB       ~6.0     ~9.0       ~14.0        ~2.3x
   32 MB      ~9.0     ~14.0      ~16.0        ~1.8x
   128 MB     ~10.5    ~15.0      ~16.5        ~1.6x
   512 MB     ~11.0    ~15.0      ~16.5        ~1.5x
   1 GB       ~11.5    ~15.5      ~16.5        ~1.4x

(Numbers are read off Fig 8 bar chart, paper does not publish a table.)

"TE-CCL outperformed RCCL on ROCm5.7 by a larger margin for small transfers, but RCCL improved their manually constructed schedules for small transfers in their recent update [ROCm6] and reduced the gap." — Sec 6.2

So the AMD measurement is itself a moving target: the very small message regime where TE-CCL won most has been partially closed by RCCL's hand tuning between ROCm5.7 and ROCm6. Synthesized schedules race against vendor hand-tuning; the gap closes as vendors invest.

8.6 Solver time vs TACCL (Fig 6 in paper)

"TE-CCL is faster than TACCL on 45% of AllToAll scenarios and 40% of AllGather scenarios (with early stop) on the NDv2 topology; 72% and 27% for DGX2; 72% and 83% for Internal 1; and 50% of AllGather for Internal 2."

Solver time is on par or favorable in most cells. The early-stop mechanism (30% optimality gap) is what makes the comparison fair — TACCL has its own analogous heuristic for ordering.

8.7 Scaling table (Table 4 in paper)

Topology Collective # GPUs EM Solver time
Internal 1 AG (A*) 64 1 3000 s
Internal 1 AG (A*) 128 1 7 h
Internal 2 AG (A*) 128 1 1300 s
Internal 2 AG (A*) 256 1 2.8 h
Internal 1 AtoA 16 1 66 s
Internal 1 AtoA 32 1 215 s
Internal 1 AtoA 64 1 500 s
Internal 1 AtoA 128 2 800 s
Internal 2 AtoA 128 1 2600 s
Internal 2 AtoA 256 4 1500 s

EM = epoch multiplier (factor by which epoch is coarsened to fit memory). TACCL "ran out of memory and did not produce a solution for almost all Internal 1 topologies (over 4 chassis)" (Sec 6.1).

8.8 Microbenchmarks — Copy benefit (Fig 9), Epoch size (Fig 10), Buffer (Fig 11), A* vs OPT

Microbenchmark Finding
Copy on/off (Fig 9) Copy reduces transfer time by 50% on DGX1 and 12.5% on Internal-2 at largest sizes; no help for small transfers (enough capacity for direct sends).
Epoch size (Fig 10) Smaller epochs improve solution quality on NDv2 / DGX2 (heterogeneous links) — slowest link dictates transfer time when epoch is large. Internal 1 (homogeneous links) shows no difference.
Buffers (Fig 11) Buffers do not improve solution quality but reduce solver time by 61% (Internal 1 alpha=0), 71% (Internal 1 alpha>0), -28.46% (Internal 2 — slower with buffers), 0.23% (DGX1).
A* vs OPT (Sec 6.4) At alpha=0: A* finishes in 86.5 s (263.3 s for 2 chunks); OPT finishes in 346 s (4392 s for 2 chunks). A* is 10% off optimal (6% in 2-chunk case). At alpha>0: A* in 137 s (901 s); OPT in 363 s (3047 s); A* is 20% off (8% in 2-chunk).

These four microbenchmarks tell a consistent story: each TE-CCL modeling addition (copy, epoch tuning, buffers, A) is empirically justified, but each operates only in a specific regime*. Copy helps at large transfers; epoch tuning matters on heterogeneous-link topologies; buffers speed solver time but don't change quality; A* trades 6-20% optimality for 4-10x solver speedup.

For DynamICCL: this regime-specificity is exactly the kind of state-conditional optimal action that motivates RL. None of these TE-CCL knobs is monotonically beneficial; each has a regime where it helps and a regime where it hurts.


9. Configuration-Regime Trade-off Tables

9.1 Three TE-CCL formulations

Dimension General MILP LP form A* technique Winner (DynamICCL)
Optimality Optimal Optimal* Sub-optimal (6-20%) MILP
Copy / multicast support Yes No Yes MILP, A*
Time complexity Worst (NP-hard) Polynomial Multi-round LP (when applicable)
Max GPUs (in eval) 17 (DGX-2) 256 (Internal-2) 256 (Internal-2) LP for AtoA, A* for AG
Schedule quality Best Best for AtoA Slightly worse MILP, LP per workload
User burden Pick K Pick K Pick K, K' (round size) LP (least burden, but no copy)
AllGather suitability Optimal at <=64 GPUs N/A (no copy) Best at >64 GPUs A* for big-AG cells
AllToAll suitability OK at <=32 GPUs Best at >32 OK LP

*LP is optimal only for the no-copy subset of demand patterns.

For DynamICCL, prefer LP form at synthesis time when caller is issuing AllToAll, MILP form when caller issues AllGather/Broadcast on small clusters, and A form* when AllGather/Broadcast at large scale. This mapping (collective × scale -> formulation) is exactly the offline-catalog axis Agent-2 selects from at runtime.

9.2 TE-CCL vs SCCL vs TACCL vs MSCCLang/GC3

Dimension SCCL (PPoPP'21) TACCL (NSDI'23) MSCCLang (ASPLOS'23) TE-CCL (SIGCOMM'24) Winner
Auto-synthesis from topo Yes (SMT) Yes (MILP+heur) No (DSL, manual) Yes (MILP/LP/A*) TE-CCL
User input Topology Topology + sketch Topology + Python Topology only TE-CCL
Solver type SMT (Z3) MILP (Gurobi) -- MILP (Gurobi) --
Cost model Hockney alpha-beta Hockney -- Hockney + delta TE-CCL
In-network copy Yes Yes (via switch hyperedge) Yes (DSL prim) Yes (max-over-out) tied
Buffer/store-and-forward Implicit No Yes (DSL) Explicit (B vars) TE-CCL
Queueing modeling No No -- Yes (delta_{ij}) TE-CCL
Multi-cast TE roots No No No Yes (Sec 7) TE-CCL
Max scale (eval) ~30 GPUs (1 chassis) 128 GPUs 256 GPUs 256 GPUs tied (TE-CCL/MSCCL)
Multi-tenant support No No No Yes (Sec 5) TE-CCL
Hardware deployment DGX-1 sim NDv2 + DGX-2 A100 + DGX-2 AMD ROCm6 tied
Programmability Low (synth-only) Med (sketch) High (DSL) Low (MILP only) MSCCLang
Adaptability at runtime None None None None NONE — DynamICCL gap

For DynamICCL, prefer TE-CCL as the synthesizer of record for multi-cell catalogs — it produces highest-quality schedules with least user input. But notice the bottom row: none of the four synthesizers adapts at runtime. They all produce static schedules. DynamICCL's runtime knob selection is orthogonal and complementary to all four — it composes with any of them.

9.3 Workload regime atlas

Regime Best TE-CCL formulation Best knob choice Winner (DynamICCL state)
AllGather, <=32 GPUs, large msg General MILP optimal epoch, buffers on MILP-large schedule
AllGather, <=32 GPUs, small msg General MILP optimal epoch, EM=1 MILP-small schedule
AllGather, 64-256 GPUs A* (epoch multiplier ~ 2-4) A* + early-stop 30% A* schedule
AllToAll, <=64 GPUs LP form optimal epoch LP-AtoA schedule
AllToAll, 128-256 GPUs LP form EM=2-4 LP-AtoA-scaled schedule
AllReduce (any) AtoA (LP) + AG (MILP/A*) composed schedule AR-composed schedule
Heterogeneous-link topology (NDv2) Smaller epoch optimal-epoch model small-epoch schedule
Homogeneous-link (Internal-1) Larger epoch (faster solver) max-epoch large-epoch schedule
Switch supports SHARP-style copy Switch flow-conservation enable copy through switch with-copy-switch
Legacy switch (no copy at switch) TACCL-style hyperedge model explicit hyperedge constraints hyperedge schedule

Each row is a candidate offline-synthesis cell. The number of cells is manageable (tens to low hundreds), so DynamICCL's runtime can carry all of them in a lookup table indexed by (collective, msg_size_bin, nGPU_bin, topology_fingerprint). Agent-2 selects the cell at call time; the cell selection is the action.

9.4 Epoch duration knob

Epoch tau Solver time Solution quality Use when
Large (max-epoch) Fast Lower Homogeneous links
Small (fastest-link) Slow Higher Heterogeneous links
Medium (5x scaling) Moderate Moderate alpha > 200*tau cases

Sec 6 of the paper explicitly recommends set tau = bandwidth of fastest link for most evaluations. When alpha is large relative to tau, increase tau by 5x to avoid making large models infeasible.


10. Bottlenecks & Insights Surfaced by the Measurements

10.1 Small messages are where alpha modeling pays off

The Fig 3 plot — relative error in algorithm bandwidth estimate vs transfer size — shows that without modeling alpha, the error exceeds 100% below 0.1 MB transfer size. This validates the entire TE-CCL premise: queueing and propagation delay dominate for small chunks, and a TE solver that ignores them produces wildly inaccurate schedules in the regime where most CCL traffic actually lives.

For DynamICCL, this bottleneck appears as a state-feature requirement. Agent-2 must know:

The third quantity is the closed-form "small-message tax" that TE-CCL encodes via delta_{ij}.

10.2 In-network copy is a large-transfer optimization, not a

universal one

Fig 9: copy reduces transfer time by 50% on DGX1 at largest sizes, 12.5% on Internal-2, and zero benefit at small sizes (because small transfers have enough direct-link capacity to multi-cast from source). This is the opposite of what naive intuition suggests — "multicast must always help."

For DynamICCL: copy/multicast schedules should be in the action space only for large messages. Below some threshold, direct point-to-point schedules win because alpha overhead of intermediate hops exceeds the multi-link savings. Agent-2 must encode msg_size_bin and learn this threshold.

10.3 Buffers reduce solver time, not solution quality

Fig 11: buffer presence improves solver time by 61-71% on Internal 1 and DGX1, but solution quality is essentially unchanged. The paper's hypothesis (Sec 6.4):

"Buffers do not impact the solution quality but only the solver time ... because each node needs the same amount of traffic as it has to forward, it can interleave consuming traffic with forwarding it to compensate for the lack of buffers."

So buffers don't unlock new optimal schedules; they just give the solver more search-space slack to find the existing optima faster. This is a counter-intuitive empirical fact that informs Agent-2's prior: do not over-explore buffer-related runtime knobs (which in NCCL would be numChannels and chunkSize — they affect buffer occupancy). Their leverage is small except as a search-time accelerator for the offline solver.

Fig 10: on heterogeneous-link topologies (NDv2, DGX2 with 4x bandwidth disparity between fast and slow links), using a large epoch wastes capacity on fast links. Smaller epochs let chunks traverse fast links multiple times per "epoch-equivalent" and capture the heterogeneity. On homogeneous topologies (Internal 1) this doesn't matter.

For DynamICCL: this argues for topology-fingerprint as a state feature that distinguishes homogeneous from heterogeneous link distributions. NCCL's algorithm choice (Ring vs Tree) is the runtime analog: Ring assumes homogeneous bandwidth; Tree handles heterogeneity better. Agent-2 should pick Ring on homogeneous fingerprints, Tree on heterogeneous ones.

10.5 The vendor-tuning arms race

Sec 6.2: "TE-CCL outperformed RCCL on ROCm5.7 by a larger margin for small transfers, but RCCL improved their manually constructed schedules for small transfers in their recent update [ROCm6] and reduced the gap." This is an honest acknowledgment that synthesized schedules race against vendor hand-tuning and the relative win depends on the vendor's investment cycle.

For DynamICCL: this is the "no static catalog wins forever" insight. The right design is a runtime tuner that continuously re-evaluates which schedule is best — exactly what Agent-2 does. Synthesized schedules become the action vocabulary for an adaptive policy, not the policy itself.

10.6 Multi-tenant support is formulated but not tested

Sec 5 ("Use in multi-tenant clusters") and Sec 8 (limitations) both acknowledge that multi-tenant scheduling is a single MILP at the sum-demand level and that hardware execution of such schedules "likely introduces additional technical challenges, but we leave these for future work." The paper formulates it; nobody has tested whether the synthesized multi-tenant schedules execute correctly.

For DynamICCL: this is a regime where Agent-2 must observe concurrent collective calls across tenants and adapt — a state feature like concurrent_tenant_count or cross_tenant_traffic_mix would be relevant here. Static synthesis doesn't capture this; runtime sees it directly.


11. Limitations of the Methodology

Limitation Implication for DynamICCL
Schedules are static (no runtime adaptation) DynamICCL's runtime tuning is strictly complementary
Most numbers are solver-derived, not hardware-measured Cost-model bias means real hardware may diverge from claimed AB
Hardware deployment only on AMD 32 GPUs NVIDIA, NDv2, large-scale numbers are simulated only
No congestion / failure modeling in solver Real clusters have transient hot links that solver assumes static
alpha, beta are inputs (profiled by user) Wrong alpha calibration silently produces wrong schedules
Multi-tenant only formulated, not benchmarked Cross-tenant interference is the regime DynamICCL must own
Chunk size held at 25 KB for most cells Other chunk sizes are not characterized; agent must explore
Optimality gap of 30% in many evals "Optimal" claim is approximate; remaining 30% is RL's territory
Solver time grows with K (number of epochs) Long-running schedules cannot be re-synthesized at runtime
Switch model is bimodal (SHARP vs hyperedge) Real switches sit in between; DynamICCL must measure not assume
Failure handling delegated to ECMP Solver assumes perfect link health; real topology drifts
AllReduce only via AtoA + AG decomposition Other AllReduce algorithms (Tree, CollNet, NVLS) not in catalog
Compute cost not accounted for in AllReduce Reduce-during-relay cost ignored
RAM ceiling at 350 GB for AllToAll on large topologies Synthesis itself becomes a resource-budget problem
TACCL comparison runs the public TACCL artifact Comparison fairness depends on TACCL's tuning state at submission
4 alpha test bins (4, 8, 12 sec) in Algorithm 1 Coarse epoch search; finer search could find smaller K

The most consequential limitation for DynamICCL is the second and third combined: almost all reported speedup numbers are derived from the solver's own cost model, not from hardware measurement. The solver could be uniformly off by a factor — its optimality conclusions are valid only relative to the same cost model. DynamICCL must validate any schedule it selects against a real reward signal (-collective_wall_clock_us) and not trust the synthesizer's predicted bandwidth.


12. What to Borrow for DynamICCL — Composability vs Tension with Run-Time Knob Selection

This section addresses the prompt's central question: TE-CCL produces optimized static schedules, while DynamICCL selects NCCL knobs at runtime per-collective. Are these complementary or in tension?

12.1 The two-layer composition picture

+--------------------------------------------------------------------+
|  Caller (PyTorch DDP)                                              |
|     issues collective(s, msg_size, model, batch, ...)              |
+----------------------------+---------------------------------------+
                             |
                             v
+--------------------------------------------------------------------+
|  DynamICCL Tuner Plugin  (RUNTIME, per-call)                       |
|     state_t = (msg_size_bin, model_intensity_I, batch_M,           |
|                topology_fingerprint, recent_timing_LSTM,           |
|                synthesizer_catalog_id)                             |
|     action_t = (algorithm, protocol, nChannels, numThreads,        |
|                 chunkSize, schedule_id_from_catalog)               |
|     reward_t = -wall_clock_us                                      |
+----------------------------+---------------------------------------+
                             |
                             | selects schedule_id from catalog
                             v
+--------------------------------------------------------------------+
|  Pre-synthesized Schedule Catalog  (OFFLINE, per-cluster)         |
|                                                                    |
|   +------------------+  +------------------+  +-----------------+  |
|   | TE-CCL MILP      |  | TE-CCL LP        |  | TE-CCL A*       |  |
|   | AG, <=32 GPUs    |  | AtoA, all sizes  |  | AG, >64 GPUs    |  |
|   +------------------+  +------------------+  +-----------------+  |
|   +------------------+  +------------------+  +-----------------+  |
|   | TACCL sketch     |  | MSCCLang DSL     |  | NCCL Ring       |  |
|   | DGX-2, AG        |  | Hierarchical AR  |  | (default)       |  |
|   +------------------+  +------------------+  +-----------------+  |
|   +------------------+  +------------------+                       |
|   | NCCL Tree         |  | SCCL synth       |                       |
|   | (default)         |  | DGX-1, AG        |                       |
|   +------------------+  +------------------+                       |
+--------------------------------------------------------------------+
                             |
                             v
+--------------------------------------------------------------------+
|  NCCL Core / RCCL / MSCCL Runtime                                  |
|     executes selected schedule with chosen knobs                   |
+--------------------------------------------------------------------+
^ Fig 9: Two-layer composition. TE-CCL contributes entries to the
  offline catalog (the bottom box). DynamICCL is the runtime selector
  (the middle box). The two layers do not compete; they are
  factorings of the same problem at different time scales.

12.2 What DynamICCL borrows directly

TE-CCL artifact / insight DynamICCL adoption
Three-formulation Pareto curve (MILP, LP, A*) Schedule catalog has 3 entries per (collective, topology) cell
25 KB chunk size as the alpha-vs-bw inflection point chunkSize_25K is a critical action-space anchor for small messages
delta_{ij}=ceil(alpha/tau) modeling Agent-2 state must include current_alpha_estimate_us per link
Switch flow-conservation vs hyperedge as binary choice Topology fingerprint encodes switch model; Agent-2 conditions on it
Reverse-DFS post-processing Pattern reuse: prune actions whose flow doesn't reach the destination
30% early-stop optimality gap Agent-2 should not over-train on cells where the catalog is already 30% sub-optimal — RL has at least 30% headroom there
Buffers improve solver time, not quality Reduce exploration of numChannels/chunkSize as quality knobs in homogeneous-link settings; treat them as latency knobs
Heterogeneous links benefit from small epochs Agent-2 picks Tree algo on heterogeneous topology fingerprints
AllReduce = AtoA + AG decomposition Agent-2's AR action space inherits this composition (implicitly via NCCL)
Multi-tenant multi-tenant demand-sum formulation Agent-2's state includes concurrent_tenant_demand_proxy
TACCL profiler used to get alpha, beta DynamICCL can reuse the same profiling harness for cluster bring-up
Floyd-Warshall as the heuristic distance function Pre-compute per-(src,dst) shortest-path distances and cache as state feat
Logical edges encode goal-distance reward shaping Agent-2 reward shaping: reward = -wallclock - lambda * remaining_distance

12.3 Where TE-CCL and DynamICCL are in tension

The composition picture above hides three real conflicts:

Tension A — Catalog completeness vs runtime regret. TE-CCL's catalog is finite. If Agent-2 selects a schedule from the catalog and the cluster state is outside any catalog cell's profile, the schedule will under-perform. DynamICCL's RL approach addresses this partially — by selecting the closest catalog cell — but fundamentally limits regret to the gap between best-catalog-cell and true-optimal-for-current-state. The mitigation: include "synthesize on-demand" as an action with a high cost, so Agent-2 can request a fresh schedule when no catalog entry is good enough. Cost should reflect the synthesizer's compile time (LP: minutes; MILP: hours).

Tension B — TE-CCL's cost model assumes a static (alpha, beta). In production, alpha and beta vary with congestion, NIC throttling, and thermal events. TE-CCL's optimality is conditional on the input profile. DynamICCL's runtime alpha estimate may drift away from the catalog's training alpha, silently invalidating optimality. The mitigation: Agent-2 monitors alpha_drift = |current_alpha - catalog_alpha| / catalog_alpha and triggers a "re-profile or fall-back-to-NCCL-default" when drift exceeds a threshold (e.g., 2x).

Tension C — TE-CCL's MILP horizon (K epochs) vs RL's per-call decision. TE-CCL plans the entire collective in one optimization. DynamICCL decides one config per collective call but the next call's collective is itself another K-epoch plan. If two consecutive calls overlap (pipelining), the catalog schedules are optimized in isolation while runtime sees concurrent execution. The mitigation: Agent-2 state includes recent_collective_in_flight_ratio and penalizes schedules with high concurrent-bandwidth peaks during pipelined regions. This is a tension that pure-static synthesis cannot resolve; only runtime adaptation can.

12.4 State-vector additions justified by TE-CCL

   Add to Agent-2 state vector s_t:
   +------------------------------------------------------------+
   | alpha_per_link_estimate     : float  (us, profiled)        |
   | beta_per_link_estimate      : float  (sec/byte, profiled)   |
   | alpha_drift_ratio           : float  (vs catalog baseline) |
   | topology_fingerprint        : enum                          |
   |   {homogeneous_links, heterogeneous_links_2x,              |
   |    heterogeneous_links_4x, switch_with_SHARP,              |
   |    switch_legacy_hyperedge}                                 |
   | switch_count                : int                           |
   | concurrent_tenant_demand    : float (relative to topology) |
   | catalog_optimality_gap_30   : float (last cell's gap)       |
   | small_msg_floor_active      : bool (msg_size < 256 KB)     |
   | copy_benefit_estimate       : float (large-transfer score) |
   | schedule_age_at_runtime     : int (epochs since synthesis) |
   +------------------------------------------------------------+
^ Fig 10: TE-CCL-justified state features. Most are simple scalar
  derivations of the topology and recent profiling data; alpha_drift
  is the runtime-adaptive feature that makes the static catalog
  trustworthy.

12.5 Action-space priors justified by TE-CCL

Action Prior bias direction Triggering state
Pick MSCCL/TE-CCL synthesized schedule Prefer for AG/AtoA, skip for small AR catalog_cell exists AND msg_size > 256 KB
Pick NCCL Ring (default) Default for homogeneous links topology = homogeneous_links
Pick NCCL Tree Default for small-msg latency msg_size_bin < 64 KB
Pick NVLS / CollNet Default for switch-supports-SHARP switch_with_SHARP
Increase nChannels Solver-time accelerator only catalog_optimality_gap_30 > 0.2
Decrease chunkSize Small-msg latency optimization msg_size_bin <= 16 KB AND alpha_drift_ratio < 1.5
Increase chunkSize to 25 KB TE-CCL's default chunk for solver offline-catalog re-synthesis bridge
Re-synthesize via TE-CCL on-demand High-cost action, last resort catalog_optimality_gap_30 > 0.5 OR alpha_drift_ratio > 2

12.6 Reward shaping justified by TE-CCL

The reward DynamICCL uses is the wall-clock metric. TE-CCL's objective is sum_{k} 1/(k+1) * R_{s,d,k} — i.e., earlier delivery is worth more. This is a discounted-completion-time reward, which maps cleanly onto Agent-2's per-call reward:

   r_t = -wallclock_us
       - lambda_chunk_progress * (1 - chunks_delivered_by_first_half / total_chunks)
       - lambda_drift * max(0, alpha_drift_ratio - 1)

   where:
     wallclock_us is the dominant term (TE-CCL's actual goal)
     lambda_chunk_progress shapes for "fast partial completion" (TE-CCL's discount)
     lambda_drift penalizes schedule-vs-runtime mismatch (the catalog-stale signal)

Lambda values can be set to lambda_chunk_progress = 0.1, lambda_drift = 0.5 based on a sweep — TE-CCL's own evaluation suggests a 30% optimality gap is acceptable, so the chunk-progress shaping coefficient should not exceed roughly that fraction of the wall-clock term.

12.7 Exploration-budget allocation justified by TE-CCL

TE-CCL's win-distribution is heavy-tailed (Sec 8.2: max-vs-mean spread is 100-12000x). Most cells are equivalent to TACCL; a few are massively better. Exploration budget should follow this same heavy tail:

Regime where TE-CCL beats TACCL by >10x Agent-2 exploration policy
AllToAll small messages (1-16 KB) High exploration; small-msg algo+proto regime
AllGather small-medium messages (16-256 KB) High exploration; chunk size + nChannels
Heterogeneous-link topology High exploration; algorithm selection
AllToAll on Internal-2 (max win 12000%) Highest exploration; this is where catalog matters most
Regime where TE-CCL is on par or loses Agent-2 exploration policy
AllToAll, 1 GB transfers Low exploration; bandwidth-saturated
AllGather, 256 MB+ on NDv2 Low exploration; near-optimal already
Homogeneous links + large messages Low exploration; few algos suffice

This atlas is the offline regret map. Agent-2 should be aggressive where the synthesized catalog yields large but variable wins (big absolute gain available) and conservative where the catalog is already close to the bandwidth ceiling (small absolute headroom).

12.8 The "no human in the loop" architectural lesson

TE-CCL's central architectural advance over TACCL is removing the sketch — the human-supplied hint about which links to use. This is the same architectural ambition DynamICCL has at the runtime layer: no human in the loop, RL discovers the policy from reward. The proof point: TE-CCL produces schedules of TACCL-quality with no sketch, so an automatic procedure can subsume a hand-tuned hint. DynamICCL's analog is producing per-collective configs of expert-tuned quality with no static profile.

The methodology to reuse: train Agent-2 on a corpus of TE-CCL- generated schedules with their solver-predicted bandwidths as a prior reward. Then fine-tune on real hardware with wallclock_us. This is policy distillation followed by online RL — exactly the pattern that worked for AlphaZero (offline self-play -> online fine-tuning) and that is justified here by the existence of a high-quality offline synthesizer.

12.9 The compile-time-vs-runtime composition principle

Both layers must obey one rule: the runtime layer can only refine, never invalidate, the compile-time layer's assumptions. TE-CCL's schedules are correct only if (alpha, beta, topology) hold. Agent-2 must not select a schedule whose preconditions are violated by current cluster state. This is the same discipline as a JIT compiler that may de-optimize when speculative assumptions fail — DynamICCL's "fall-back-to-NCCL-default" is the de-optimization step.

   IF all preconditions hold:
      pick best-catalog-cell schedule (TE-CCL, TACCL, MSCCLang, etc.)
   ELSE IF some preconditions violated:
      pick most robust action (NCCL Ring/Tree default + safe knobs)
   ELSE IF severe drift detected:
      trigger re-synthesis (high cost, async)

This three-tier policy is the cleanest way to compose static synthesis and runtime adaptation.


13. Analogy

TE-CCL is a city traffic-routing department that gets handed a new neighborhood layout each Monday and produces a static route plan for the week. The city has three planners on staff:

The department hands the published plan to the bus drivers (the MSCCL runtime), who execute it for the week. The plan never changes mid-week — even if a road gets congested or a bridge closes, drivers follow the original plan.

DynamICCL is the traffic operations center that watches the city in real time. It cannot redraw the plan (synthesizing is too slow), but it can:

The relationship is hierarchical adaptation: TE-CCL plans the streets; DynamICCL drives the cars. TE-CCL's contribution is the planning machinery — the multi-commodity-flow MILP that lets a small team produce city-wide plans without a sketch. DynamICCL's contribution is the operations machinery — the RL policy that chooses among plans and tunes execution at the second-by-second level. Neither replaces the other; together they realize the old WAN-engineering principle that traffic engineering and traffic operations are different jobs run on different timescales by different agents.

The closing intuition pump: when a synthesizer says "the optimal schedule for AllGather on NDv2-2 with 1 KB messages is X," it has committed to a static optimum. When a runtime says "I will execute X for the next 50 microseconds and watch what happens," it has committed to an adaptive trajectory that converges to the static optimum and to whatever the synthesizer didn't model. The synthesizer's static optimum is the runtime's prior. The runtime's adaptive trajectory is the synthesizer's reality check.


Summary of Borrowed Patterns

Pattern from Liu et al. (TE-CCL, SIGCOMM 2024) DynamICCL application
Multi-commodity-flow MILP as the right CCL abstraction Schedule catalog populated by TE-CCL; Agent-2 selects from it
Three formulations (MILP / LP / A*) on a Pareto curve Three-tier offline catalog: optimal-small, scalable-AtoA, A*-large-AG
25 KB chunk size as alpha-vs-bw inflection chunkSize_25K is the small-msg anchor in the action space
delta_{ij}=ceil(alpha/tau) for delay modeling alpha_estimate_us and alpha_drift_ratio in state vector
Reverse-DFS to prune non-load-bearing flows Action-space pruning: drop schedules that don't reach destinations
Floyd-Warshall logical edges as A* heuristic Pre-compute shortest-path-distance per (src,dst) for reward shaping
Buffers improve solver time, not quality Treat numChannels/chunkSize as latency knobs, not quality knobs
Heterogeneous-link topology benefits from small epochs Topology fingerprint state feature; Tree algo on heterogeneous, Ring on homogeneous
Switch model (SHARP-style copy vs TACCL hyperedge) Topology fingerprint distinguishes switch types
Multi-tenant via demand-sum MILP (formulated, untested) Concurrent-tenant state feature; runtime sees what synthesizer didn't test
30% optimality gap early-stop Acknowledged catalog headroom; Agent-2 reward has at least 30% recoverable margin
AllReduce = AllToAll + AllGather decomposition Agent-2's AR action space inherits this composition
RAM ceiling ~350 GB for AllToAll on Internal-1 large topologies Schedule re-synthesis is a high-cost action; reserve for severe drift
Schedules are static; no failure / congestion modeling DynamICCL's runtime adaptation is strictly complementary; same schedule + drift
2x improvement vs TACCL on NDv2 2-chassis Schedule-quality prior; Agent-2 should NOT explore equivalent NCCL defaults here
3.18x vs RCCL on AMD ROCm6 Vendor-tuning baseline updates over time; Agent-2 must re-evaluate periodically
TACCL profiler used for alpha, beta inputs Reuse TACCL profiler for cluster bring-up alpha/beta calibration
Heavy-tailed win distribution (max-vs-mean 100-12000x spread) Heavy-tailed exploration: invest most exploration where catalog wins are largest
MSCCL runtime as deployment target DynamICCL outputs map to MSCCL XML or NCCL-direct configs
"No human in the loop" architectural goal DynamICCL extends the same goal one layer down (runtime knob selection)
Microsoft GitHub release (microsoft/TE-CCL) DynamICCL benchmark harness must be open source for reproducibility
TE roots: PoP, NCFlow, BLASTSHIELD, etc. for solver scalability Borrow scalability tactics from TE literature, not just CCL literature