Architecture & Solver-Design Analysis
TE-CCL — Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem
Source: Liu, X.; Arzani, B.; Kakarla, S. K. R.;
Zhao, L.; Liu, V.; Castro, M.; Kandula, S.; Marshall, L. Rethinking
Machine Learning Collective Communication as a Multi-Commodity Flow
Problem. In ACM SIGCOMM '24, August 4–8, 2024,
Sydney, NSW, Australia. 22 pages, 15 figures, 8 tables.
DOI: 10.1145/3651830.3672249 Code: https://github.com/microsoft/TE-CCL
Authors: University of Pennsylvania + Microsoft
Research + University of Washington + OpenAI + Microsoft.
Reader: Direct PDF read (gemini-reader free-tier
RESOURCE_EXHAUSTED on gemini-2.0-flash; codex-reader
rejected gpt-5.1-codex-mini for ChatGPT free-tier accounts;
pages 1-22 read directly via the Read tool with pages
parameter, including the full Appendix A-H). Analyst:
Vishwakarma Date: 2026-05-04
Table of Contents
- Lineage Note — Where TE-CCL Sits in the Synthesis Family (SCCL → TACCL → MSCCLang/GC3 → TE-CCL)
- System / Solver Architecture (the "instrument" — three nested formulations)
- Target-Hardware / Topology Specimens (DGX1, DGX2, NDv2, AMD, Internal-1, Internal-2)
- Design-Space Diagram (axes swept by the evaluation, axes held fixed)
- The TE-CCL MILP — Variables, Constraints, Objective in Detail
- Algorithm / Control Flow — MILP Path, LP Path, A* Path, Schedule-Synthesis Runtime
- The Three CCL-vs-TE Differences (discretized sends, in-network copies, latency+queuing)
- Quantitative Results — Empirical Findings by Regime
- Configuration-Regime Trade-off Tables
- Bottlenecks & Insights Surfaced by the Measurements
- Limitations of the Methodology
- What to Borrow for DynamICCL — Composability vs Tension with Run-Time Knob Selection
- Analogy
1. Lineage Note — Where TE-CCL Sits in the Synthesis Family
The four immediately preceding papers in this corpus form a clean genealogy of automatic collective synthesis:
+------------------------------------------------------------------+
| SCCL (PPoPP 2021, paper 0031) |
| SMT solver synthesizes optimal schedules |
| Pareto-Synthesize sweeps (S, R, C); QF_LIA encoding |
| Single-node only, intractable beyond 1 chassis (~30 GPUs) |
| | |
| v |
| TACCL (NSDI 2023, paper 0032) |
| MILP guided by user-supplied "communication sketch" |
| Three-stage: routing -> heuristic ordering -> contiguity |
| Logical topology + switch-hyperedge + symmetry hints |
| Scales to 128 GPUs in 11h, but heuristic ordering hurts qty |
| | |
| v |
| MSCCLang / GC3 (ASPLOS 2023 / arXiv 2022, papers 0033, 0034) |
| DSL programmability — user *writes* algorithm, compiler lowers|
| Three IRs (Chunk DAG -> Instruction DAG -> GC3-IR), runtime |
| interpreter; synthesis is OUT OF SCOPE |
| | |
| v |
| TE-CCL (SIGCOMM 2024, this paper, 0035) |
| "Forget per-link sketches; forget human DSLs." |
| Treat CCL as a multi-commodity-flow problem from network |
| traffic engineering (TE). General MILP + scalable LP form + |
| A*-style temporal partitioning. No sketch. No DSL. |
| Scales to 256-GPU AllToAll on Internal-2 in 2.8 h. |
+------------------------------------------------------------------+
^ Fig 0: Lineage of automatic CCL synthesis. TE-CCL inherits the
problem statement (topology + demand -> schedule) from SCCL/TACCL
but borrows its solving machinery from production WAN traffic
engineering, where multi-commodity flow has scaled to "thousands
of nodes" (paper's Sec 1, citing Krishnaswamy 2022 [20]).
The intellectual move is to relocate the problem. SCCL/TACCL framed collective synthesis as combinatorial scheduling and ran into the SMT scalability wall. TE-CCL relabels it as multi-commodity flow with three CCL-specific modifications (discretized sends, in-network copies, latency+queuing) and inherits four decades of TE solver engineering. The paper's own framing in Section 2.2 makes this explicit:
"Our solution requires no human in the loop — the user only needs to specify the topology and the demand matrix — but the solver is slightly slower."
That trade — slower-but-no-human vs faster-but-needs-sketch — is the central design choice TE-CCL makes against TACCL. For DynamICCL the relevant lineage observation is that TE-CCL still produces a static XML/MSCCL schedule per (topology, collective, message-size) cell. The runtime has no adaptation: TE-CCL targets the offline catalog. DynamICCL's RL knob selector is the layer above any such catalog, choosing among pre-synthesized schedules at call time.
2. System / Solver Architecture (the "instrument" — three nested formulations)
TE-CCL is not a monolithic solver. It is a three-formulation toolkit: a general MILP that handles every CCL feature exactly, an LP that sacrifices in-network copy support to scale further, and an A*-style temporal partition that scales further still by approximating optimality. The user picks one formulation based on the (topology, collective, optimality-tolerance) cell.
+--------------------------------------------------------------------+
| TE-CCL Toolkit |
| |
| +--------------------+ +-----------------------------------+ |
| | Workload Spec | | Topology Spec | |
| | - Demand matrix D | | - N (nodes), S subset of N | |
| | D_{s,c,d}={0,1} | | (S = switches, no buffer) | |
| | - Collective name | | - E (directed edges) | |
| | {AG, AtoA, AR} | | - Per-edge T_{ij} (chunks/sec) | |
| | - C+1 chunks/source| | - Per-edge alpha_{ij} (us) | |
| +---------+----------+ +-----------------+-----------------+ |
| | | |
| +-----------------+-----------------+ |
| | |
| v |
| +---------------------------------------------------------------+ |
| | Scaling-Knob Layer (Sec 5) | |
| | - Epoch duration tau (Sec 5: fastest-link or slowest-link) | |
| | - K (number of epochs); auto-bounded by Algorithm 1 | |
| | - K' (epochs per round) for A* technique | |
| | - Switch model {SHARP-style copy / TACCL hyper-edge} | |
| | - Optimality gap tolerance (e.g., 30% early-stop) | |
| +-------------------------+-------------------------------------+ |
| | |
| v |
| +---------------------------------------------------------------+ |
| | THREE FORMULATION OPTIONS (user picks one) | |
| | | |
| | +-----------------------+ +-----------------------+ | |
| | | (a) General MILP | | (b) LP form (Sec 4.1) | | |
| | | F,B integers | | F,B real-valued | | |
| | | Supports COPY (multi-cast)| NO copy (per-link rate| | |
| | | Optimal with epoch-gap | allocation -> DFS path) | | |
| | | Slowest single problem | Fast, polynomial | | |
| | | (only single time- | Reverts to traditional | | |
| | | partition) | flow conservation | | |
| | +-----------+-----------+ +-----------+-----------+ | |
| | | | | |
| | +-------------+-------------+ | |
| | | | |
| | v | |
| | +----------------------------+ | |
| | | (c) A* technique (Sec 4.2) | | |
| | | Multi-round MILP partitions| <-- best for | |
| | | each round greedy progress | scaling AG | |
| | | toward destinations using | beyond 64-128 | |
| | | Floyd-Warshall alpha-cost | GPUs | |
| | | weights as logical edges | | |
| | | NOT optimal but provably | | |
| | | improves quality with each | | |
| | | round | | |
| | +-------------+--------------+ | |
| +-----------------------------+---------------------------------+ |
| | |
| v |
| +---------------------------------------------------------------+ |
| | Solver Backend | |
| | Gurobi 9.5.2 (Pedroso 2011 [30]) | |
| | Optimality-gap early-stop (e.g., 30% for AG eval) | |
| | Feasibility-first warm start (binary search via Alg 1) | |
| | Linux Ubuntu 20.04 VM, 2x Xeon Platinum 8380 (80c/160t), | |
| | 512 GB RAM | |
| +-----------------------------+---------------------------------+ |
| | |
| v |
| +---------------------------------------------------------------+ |
| | Schedule Materializer | |
| | - Reverse-DFS over F_{s,d,i,j,k,c} solution | |
| | - Zero-out flows that don't satisfy any demand (Sec 3.1) | |
| | - Convert per-link rate (LP form) into per-chunk paths | |
| | - Emit MSCCL-XML schedule (consumed by MSCCL runtime [5]) | |
| +-----------------------------+---------------------------------+ |
| | |
| v |
| +---------------------------------------------------------------+ |
| | Hardware Path (deployment) | |
| | MSCCL XML -> MSCCL runtime (Microsoft fork of NCCL) | |
| | Two AMD ROCm6 nodes, 32 GPUs total (Sec 6, AMD evaluation) | |
| | Hand-optimized to match RCCL/TACCL channel count for fairness| |
| +---------------------------------------------------------------+ |
+--------------------------------------------------------------------+
^ Fig 1: TE-CCL toolkit. Three formulations share notation (Table 1),
capacity constraints, and destination constraints; they differ in
whether F/B are integers, whether the problem is solved as one
global MILP or as a sequence of greedy rounds, and whether copy
(in-network multicast) is supported. The hardware path reuses the
MSCCL runtime — TE-CCL contributes the schedule, not the runtime.
The deliberate factoring is what differentiates TE-CCL from prior CCL solvers. SCCL had one SMT formulation; TACCL had one MILP plus a heuristic ordering hack; TE-CCL exposes the optimality-vs-scale knob to the user as a choice between three formulations rather than as an internal heuristic. The MILP is for cases where copy matters; the LP is for AllToAll where each demand has a unique destination segment; the A* is for AllGather at scale where copy matters but the MILP is too slow. The user is expected to know their workload well enough to pick.
For DynamICCL this is the most important architectural lesson: the right answer is rarely "one formulation." Agent-2 should be able to select among multiple pre-synthesized schedules — each cell of the TE-CCL output catalog corresponds to one (algorithm, msg_size) action the agent can choose at runtime. The catalog itself is offline; the choice is online.
3. Target-Hardware / Topology Specimens
TE-CCL evaluates on six distinct topologies (Table 2 in the paper), spanning standard public hardware and proprietary cloud topologies. The structural diversity is the point: the same solver must handle topologies that range from 4 GPUs/2 edges (Internal 2 chassis) to 17 GPUs/32 edges (DGX2) to 16 GPUs/56 edges (AMD).
+--------------------- TE-CCL Topology Specimens --------------------+
| |
| Topology | GPUs/chassis | Edges/chassis | alpha (us) | use |
| ----------+--------------+---------------+------------+--------- |
| Internal-1| 4 | 8 | 0.6 / 0.75 | proprietary
| Internal-2| 2 | 2 | 0.6 / 0.75 | proprietary
| DGX-1 | 8 | 32 | (NCCL std) | classic NV |
| NDv2 | 8 | 32 | 0.7 | Azure |
| DGX-2 | 17 (16+swt) | 32 | 0.35 / 2.6 | NVSwitch |
| AMD | 16 | 56 | 0.6 / 0.75 | ROCm6 |
| |
| Notes from the paper: |
| - alpha is GPU<->GPU and GPU<->switch link latency in us |
| - DGX-1 is similar to a single chassis NDv2 (8 GPUs, no switch) |
| - DGX-2 has one NVSwitch per chassis (the "+1" GPU count is the |
| switch modeled as a node with zero buffer) |
| - AMD: 200 / 100 / 50 GBps GPU-GPU + 25 GBps switch links |
| (Fig 15) |
| - NDv2 (multi-chassis): 50 / 25 / 12.5 GBps with alpha=0.7us |
| intra-chassis and 1.3us inter-chassis-switch (Fig 14) |
| - DGX2 (2-chassis): 125 GBps direct + 12.5 GBps cross-chassis |
| dashed link with alpha=2.6us (Fig 13) |
| |
+-------------------------------------------------------------------+
^ Fig 2: TE-CCL topology atlas. The structural span is intentional:
Internal 2 has 2 GPUs/2 edges to test the corner case; DGX-2 with
17 nodes/32 edges and a switch tests the SHARP-style copy support;
AMD with 16 nodes/56 edges tests heterogeneous link bandwidth
(200/100/50/25 GBps mix); NDv2 4-chassis (32 GPUs) is the headline
TACCL comparison topology.
3.1 The DGX-2 two-chassis specimen (Fig 13 in paper)
+--------------+ +--------------+
| | 125 GBps, alpha=0.35us | |
| Chassis A |================================| Chassis B |
| 16 GPUs | (8 send-out, 8 recv-in) | 16 GPUs |
| | | |
| +-------+ | 12.5 GBps, alpha=2.6us | +-------+ |
| | NVSwt |--+----------- dashed ---+-------+--| NVSwt | |
| +-------+ | (sparse cross-chassis) | +-------+ |
| | | |
+--------------+ +--------------+
^ Fig 2a: DGX-2 2-chassis. The asymmetry — 8 GPUs send across, 8
GPUs receive — is what makes this a hard scheduling problem.
TACCL can synthesize for this; SCCL cannot (>1 chassis is out
of scope). TE-CCL's headline "2x improvement on TACCL"
measurement is on this exact topology (Fig 14 / Sec 1).
3.2 The NDv2 four-chassis specimen (Fig 14 in paper)
+-------- Chassis 1 --------+ +-------- Chassis 2 --------+
| 8 GPUs, intra-mesh | | 8 GPUs, intra-mesh |
| 50 GBps (solid), | | |
| 25 GBps (dashed) | | |
| alpha=0.7us | | |
+-------------+-------------+ +-------------+-------------+
| |
| NVSwitch (12.5 GBps dotted, alpha=1.3us)
| GPU 0 of each chassis -> GPU 1 of all others
| (TACCL: hyperedge constraint, only one of 3 active)
v v
+----------------------------------------------+
| Central NVSwitch |
+-----------------+----------------------------+
|
+-------- Chassis 3 --------+ +-------- Chassis 4 --------+
| 8 GPUs | | 8 GPUs |
+---------------------------+ +---------------------------+
^ Fig 2b: NDv2 4-chassis = 32 GPUs. TE-CCL's "scales significantly
further than TACCL" claim (Sec 1) is partly grounded on this:
TACCL ran out of memory on this topology for many sizes; TE-CCL
produced schedules.
3.3 The AMD two-chassis specimen (Fig 15 in paper)
+-- Chassis A (16 GPUs, ROCm6) --+ +-- Chassis B (16 GPUs) --+
| Pairs connected via small | | Pairs connected via |
| switches (200/100/50 GBps | | small switches |
| mix, alpha=0.6us) | | |
| | | |
| +----+ +----+ | | +----+ +----+ |
| |GPU | |GPU | ... | | |GPU | |GPU | |
| +----+ +----+ | | +----+ +----+ |
| ^ ^ | | |
| | small switch | | |
+----------+----------------------+ +-------------------------+
| |
v v
+-----------------------------------+
| Big inter-chassis switch |
| all switch links 25 GBps |
+-----------------------------------+
^ Fig 2c: AMD 2-chassis = 32 GPUs. This is the only topology TE-CCL
was deployed *on real hardware* (ROCm6, ROCm5.7 baseline). All
other reported numbers are simulator/solver-derived bandwidths.
AllGather measurement: 3x faster than RCCL at 1 MB; 1.5-2x faster
for >=4 MB transfers (Fig 8).
The hardware-vs-solver split matters for DynamICCL. The bulk of the paper's "TE-CCL outperformed TACCL by 2.14x" headline is derived from the schedules' transfer-time computed under the alpha-beta cost model, not from running on hardware. The AMD ROCm6 numbers are the sole real-hardware confirmation. Agent-2 must therefore treat TE-CCL-style cost-model output as a prior on what is achievable, not a guarantee — DynamICCL's reward signal (collective wall-clock microseconds) is the reality check.
4. Design-Space Diagram (axes swept by the evaluation, axes held fixed)
Section 6 of the paper sweeps a five-axis space. Each evaluation figure (Fig 5-12) fixes a subset and varies the rest.
DESIGN SPACE (5 axes + held fixed)
+---------------------------------------------------------------+
| |
| Axis 1: TOPOLOGY |
| [Internal 1, Internal 2, NDv2, DGX-2, DGX-1, AMD] |
| |
| Axis 2: COLLECTIVE |
| [AllGather (AG), AllToAll (AtoA)] |
| AllReduce: derived as AtoA + AG (Sec 8) |
| Reduce / Broadcast / ReduceScatter: not evaluated |
| |
| Axis 3: TRANSFER SIZE / OUTPUT BUFFER |
| [1 KB, 4 KB, 16 KB, 64 KB, 256 KB, 1 MB, |
| 4 MB, 16 MB, 64 MB, 256 MB, 1 GB] |
| 11 size points across 6 orders of magnitude |
| |
| Axis 4: TE-CCL VARIANT |
| [General MILP (optimal AG + LP for AtoA) |
| A* form (AG only, multi-round) |
| Early-stop 30% (faster, sub-optimal)] |
| |
| Axis 5: # CHASSIS / # GPUs |
| [2 ch, 4 ch, 8 ch, 16 ch, 32 ch] |
| -> 16, 32, 64, 128, 256 GPUs |
| |
| Held FIXED (no sweep): |
| - Chunk size: 25 KB (Table 3) |
| - alpha for 25 KB sweep: 0.7 us |
| - Solver: Gurobi 9.5.2 |
| - Hardware: 80c/160t Xeon 8380, 512 GB RAM |
| - Optimality gap (most evals): 30% |
| - Switch model: TE-CCL flow-conservation by default, |
| TACCL hyper-edge for backwards-compat comparisons |
| - alpha, beta: profiled (using TACCL's profiler), input only|
| - Multi-tenancy / heterogeneous demand: formulated but not |
| benchmarked in the main eval |
| - Failures (link/GPU): ECMP at deployment, NOT in schedule |
| |
+---------------------------------------------------------------+
^ Fig 3: The 5-axis design space. The "held fixed" line is the
important one: chunk size 25 KB and alpha 0.7 us were chosen so
that TE-CCL's epoch tau is ~1 us, fine enough to capture the
delay/queue effects the paper argues prior work misses (Fig 3).
Multi-tenancy is *formulated* (Sec 5, Sec 8) but not measured.
The most consequential held-fixed knob is chunk size = 25 KB. This is much smaller than NCCL's typical 1 MB chunk and is what enables the fine-grained pipelining TE-CCL claims as its advantage over TACCL:
"TE-CCL can better pipeline chunks and so pays less alpha cost with larger transfers." — Sec 6.1, justifying why TE-CCL's least-steps result beats MSCCL on cells with >1 chunk.
For DynamICCL, this is a knob that already exists in the action
space: chunkSize. The paper validates that (chunk_size,
alpha, topology) have a non-trivial joint optimum and that the
choice of chunk size constrains everything downstream. Agent-2
must encode chunk_size as both a state feature (when caller selected a
fixed chunk size) and an action (when DynamICCL is allowed to
override).
5. The TE-CCL MILP — Variables, Constraints, Objective in Detail
Section 3.1 of the paper presents the general MILP. Reproducing it as a single block diagram with variables and constraints labeled:
+--------------------------------------------------------------------+
| TE-CCL General MILP (Sec 3.1) |
| |
| DECISION VARIABLES (Table 1): |
| F_{s,i,j,k,c} in {0,1} : chunk c of source s on link (i,j) |
| in epoch k |
| B_{s,i,k,c} in {0,1} : chunk c of source s in node i's |
| buffer at start of epoch k |
| R_{s,d,k,c} in {0,1} : chunk c delivered to destination d |
| by epoch k |
| |
| PARAMETERS: |
| T_{ij} : capacity of link (i,j) in chunks/epoch |
| alpha_{ij} : fixed latency on link (i,j) in us |
| delta_{ij} : ceil(alpha_{ij} / tau) -- epochs to traverse |
| tau : epoch duration |
| K : max epoch index (auto-bounded by Algorithm 1) |
| D_{s,c,d} : 1 if destination d wants chunk c from s |
| |
| CONSTRAINTS: |
| (1) CAPACITY |
| sum_{s in N} sum_{c in C} F_{s,i,j,k,c} <= T_{ij} * tau |
| for all (i,j) in E, all k in K |
| |
| (2) FLOW CONSERVATION (with delay, with copy) |
| B_{s,n,k,c} |
| + sum_{j: (j,n) in E} F_{s,j,n,k - delta_{jn},c} |
| >= max_{j: (n,j) in E} F_{s,n,j,k+1,c} |
| for all (s,n,k,c): n is not source, n is not switch |
| --> "what node n has at end of epoch k must be >= what it |
| sends out next epoch on EACH outgoing link" |
| (the max-over-outgoing is what allows COPY) |
| |
| (3) BUFFER UPDATE |
| B_{s,n,k,c} = B_{s,n,k-1,c} |
| + sum_{j: (j,n) in E} F_{s,j,n,k - delta_{jn} -1,c}|
| --> buffer accumulates incoming traffic |
| |
| (4) DESTINATION |
| R_{s,d,k,c} = min(D_{s,d,c}, B_{s,d,k+1,c}) |
| R_{s,d,K,c} = D_{s,d,c} |
| --> demand satisfied by epoch K |
| |
| (5) SWITCH NODES (n in S) |
| Buffer = 0; same-link copy disallowed; alpha = 2*alpha_link |
| --> traffic conservation through the switch |
| |
| OBJECTIVE: |
| maximize sum_{k in K, s, d in N: s != d} (1/(k+1)) * R_{s,d,k,c} |
| --> earlier delivery yields exponentially-decaying-by-1/k reward |
| |
| POST-PROCESSING (Sec 3.1, last 3 paragraphs): |
| - Multiple optima exist; some send extra "vampire" flows |
| - Reverse DFS from each destination tracks active flows backward |
| to source; zero out flows that don't appear in any DFS trace |
| - O(|V| + |E|) per source-destination pair |
| |
+--------------------------------------------------------------------+
^ Fig 4: The general MILP, in full. The key insight that distinguishes
it from a textbook TE multi-commodity flow LP is constraint (2)'s
max-over-outgoing-edges: this is what allows COPY (multicast) — a
node can forward the same chunk to multiple downstream nodes while
only "consuming" one input. Traditional TE conserves flow exactly,
forbidding copy.
5.1 The LP relaxation (Sec 4.1)
+--------------------------------------------------------------------+
| TE-CCL LP Form (Sec 4.1) |
| |
| Replace F, B integers -> real-valued in [0, T_{ij}] |
| |
| REMOVE COPY SUPPORT: rewrite (2) without max-over-outgoing |
| B_{s,n,k+1} + R_{s,n,K} + sum_{j: (n,j) in E} F_{s,n,j,k+1} |
| == sum_{j: (j,n) in E} F_{s,j,n,k - delta_{jn}} + B_{s,n,k} |
| --> classic flow conservation: in == buffer + consumed + out |
| |
| DESTINATION constraint becomes cumulative: |
| R_{s,d,k} = sum_{r=0..k} R_{s,d,r} |
| R_{s,d,K} = sum_{c} D_{s,d,c} |
| |
| DROPS SUBSCRIPT c: |
| We no longer track individual chunks (no copy -> no need) |
| This dramatically reduces variables |
| |
| OUTPUT: |
| A "rate allocation" per (source, link, epoch) |
| Convert to per-chunk paths via DFS (Sec 3, end paragraphs) |
| |
| USE: AllToAll (each demand has unique source segment, no copy |
| benefit anyway). The LP is convex, polynomial, and scales |
| to 256-GPU AllToAll (Table 4: Internal-2 256 GPU AtoA in |
| 2.8 hours) |
+--------------------------------------------------------------------+
^ Fig 4a: LP form. Sacrifices copy support to gain polynomial-time
scalability. AllToAll is the canonical workload that doesn't need
copy because every (source, destination) pair has a unique chunk.
5.2 The A* technique (Sec 4.2)
+--------------------------------------------------------------------+
| TE-CCL A* Technique (Sec 4.2) |
| |
| Inspired by Hart, Nilsson, Raphael 1968 [12] (A* search). |
| Partition time into ROUNDS; in each round solve a small MILP that |
| greedily moves chunks closer to destinations. |
| |
| Per-round MILP changes (vs general MILP): |
| - Drop "satisfy all demand by epoch K" constraint |
| (otherwise infeasibility blocks any progress) |
| - Add LOGICAL EDGES connecting every node n to every destination |
| d, with weight equal to alpha-delay along shortest path |
| (computed once via Floyd-Warshall [14]) |
| - New objective term: maximize discounted "distance-to-dest" |
| progress for each chunk currently in transit (variables P,Q) |
| |
| Algorithm 1 (paper's pseudocode): |
| Input: D, G(N,E), tau_opt, alpha, T, n_e (epoch count), C_T |
| For total_time in {4, 8, 12}: |
| For n_e in {fixed candidates}: |
| tau <- total_time / n_e |
| Try general_form(D, tau, alpha, ...) |
| If feasible -> n_e <- feasible_time / tau_opt; break |
| Returns: upper bound on epochs needed |
| |
| Look-ahead state Q_{s,n,c,k',r} carries chunks that arrive in |
| future rounds (after round r ends) so the next round's MILP can |
| account for them. |
| |
| USE: AllGather on large topologies (Internal-1 64-256 GPUs, Table 4)|
| Internal-2 128 GPUs in 7 hours; 256 GPUs in 2.8 hours |
| Sub-optimal but provably improves with more rounds |
+--------------------------------------------------------------------+
^ Fig 4b: A* technique. Same MILP machinery solved iteratively in
smaller chunks, with logical edges encoding the "heuristic distance
to goal" — exactly the h(n) function in classical A*. Trades
global optimality for tractability at >64 GPUs.
The three formulations form a Pareto curve in (optimality, scale):
optimality
^
| General MILP (with copy) <-- 16-32 GPUs, optimal
| *
| \
| * LP form (no copy, AtoA) <-- 128-256 GPUs, optimal
| \
| * A* (multi-round, all collectives) <-- 256+ GPUs,
| \ sub-optimal
| \
+----------------------------------> scale (GPUs)
^ Fig 4c: The Pareto front of the three formulations. The user
picks the rightmost feasible point given their compute budget.
For DynamICCL: this Pareto curve is the offline-synthesis curve. DynamICCL's runtime selects among schedules generated at different points on this curve. The curve also tells Agent-2's exploration budget where to invest — the LP/A* schedules are sub-optimal and have the most regret remaining for runtime knob tuning.
6. Algorithm / Control Flow — MILP Path, LP Path, A* Path, Schedule-Synthesis Runtime
START (user submits topology G(N,E), demand matrix D, collective name)
|
v
(1) Algorithm 1: bound the number of epochs K needed
for total_time in {4, 8, 12}:
for n_e in {fixed candidates}:
tau <- total_time / n_e
Try MILP feasibility (Gurobi)
If feasible: n_e <- feasible_time / tau_opt; break
Return n_e
|
v
(2) Choose formulation:
|
+-- Demand needs copy (e.g. AllGather)? ---+
| |
| YES | NO
| |
v v
(3a) General MILP (3b) LP form
- Integer F, B - Real F, B
- Max-over-outgoing flow conservation - Strict flow conservation
- Optimal in single solve - Polynomial; scales further
| |
| Solver too slow / too memory? |
| |
v v
(3c) A* technique (4) Solve to optimality (or
- Partition time into rounds 30% gap early-stop) with
- Per-round MILP w/ Floyd-Warshall Gurobi 9.5.2
logical-edge heuristic |
- Sub-optimal but bounded v
- Runs until all demands met (5) Reverse-DFS post-process
| (zero out non-load-bearing
| flows; O(|V|+|E|))
| |
+-------------+----------------------------+
|
v
(6) Materialize schedule:
- Convert F/B/R variables -> per-chunk path list
- Each chunk has (rank, src_buf, dst_buf, time_step) entries
- Emit MSCCL-XML (compatible with MSCCL runtime [5])
|
v
(7) Deployment (optional, only the AMD evaluation):
- Hand-tune channel count to match RCCL/TACCL fairness
- MSCCL runtime executes XML on hardware
|
v
END -> schedule consumed by NCCL/RCCL via MSCCL fork
^ Fig 5: End-to-end TE-CCL flow. Steps (1) and (2) are user choices;
(3a/3b/3c) are the three formulation paths; (4-6) are the offline
solve and serialize; (7) is the optional deployment. Steps (1-6)
run on the user's workstation (or a cloud VM); step (7) runs on
the production cluster.
6.1 The "discretized sends" modeling — Fig 4 (paper)
Epoch 0 (one chunk as 2 halves) Epoch 2
+---+ +---+
| s | ---0.5---> +---+ d1 | d3 |
+---+ +---+
| 0.5 +---+
+---------------> | s2| (Epoch 1)
+---+
Without integer F: solver thinks both halves in d3's buffer count
as a "full chunk delivered" -- WRONG.
With integer F: solver tracks individual chunks; it cannot
claim d3 received "the chunk" if it only got
one half from s and one half from s2 because
they are different copies.
^ Fig 6: Why TE-CCL needs integer variables in the MILP form.
Allowing fractional chunks + copy creates phantom completions where
the solver "delivers" a chunk by aggregating two unrelated halves.
This is the bug Sec 3.1's Fig 4 illustrates and that motivates the
general MILP formulation despite its higher solve cost.
6.2 The "in-network copies" modeling — Fig 2a (paper)
+-----+
| h | (intermediate node, GPU)
+-----+
/ | \
/ | \ alpha = 1 us each
/ | \
+---+ +---+ +---+
|d1 | |d2 | |d3 |
+---+ +---+ +---+
s sends 1 MB chunk to h. Without copy modeling:
s sends to d1 directly (1 link, 1 us alpha + transfer)
s sends to d2 directly (1 link, 1 us alpha + transfer)
s sends to d3 directly (1 link, 1 us alpha + transfer)
-> 3 separate transmissions = 4 sec total
With copy modeling:
s sends to h once (1 us alpha + 1 sec transfer)
h forwards to d1, d2, d3 (1 us alpha each, parallel)
-> 2 sec total (a 2x improvement)
^ Fig 7: Why TE-CCL needs in-network copy. Tree broadcast / reduce
patterns rely on intermediate nodes duplicating data. Traditional
TE forbids this because real WAN routers don't multicast at line
rate; GPU collectives CAN multicast (the SM that holds the chunk
can write to multiple outgoing buffers).
6.3 The "latency + queuing" modeling — Fig 2b (paper)
s1 ----alpha----> h1 ---alpha---> h2 ---alpha---> h3 ---alpha---> d
^
| alpha
|
s2
Two chunks (one from s1, one from s2) arrive at h3 at the same time.
ONE chunk must queue while the other transmits.
The queue delay is alpha_{2->3} (= 1 us).
Total time = 3*alpha + 2 (without queue model)
= 3*alpha + (1 us queue) + 2 (with queue model)
^ Fig 8: Why TE-CCL needs to model queue + propagation delay.
Steady-state TE assumes large traffic bundles where alpha is
amortized; CCL has small/medium chunks where alpha dominates.
Fig 3 (paper) shows the relative error in average throughput
estimate exceeds 100% when transfer size is below 0.1 MB if the
solver doesn't model alpha properly.
These three figures (Fig 2 in the paper) are the conceptual core of the paper's "TE doesn't directly apply" argument. Each motivates one modeling addition to the standard multi-commodity flow LP.
7. The Three CCL-vs-TE Differences (the paper's central insight)
The paper's contribution can be reduced to three modeling additions that distinguish a CCL-aware MILP from a textbook TE multi-commodity flow LP:
| TE assumption | CCL reality | TE-CCL fix |
|---|---|---|
| Sends are continuous bandwidth | Sends are discrete chunks of fixed bits | Integer F variables; epoch-discretized time |
| Flow conservation: in=out exactly | In-network multicast (copy) is free | max-over-outgoing flow conservation (Sec 3.1, eqn 2) |
| alpha is small vs sustained transfer | alpha dominates for small chunks | delta_{ij}=ceil(alpha/tau) added to flow constraints |
| Queues amortize over long flows | Small flows can collide and queue | Buffer constraint per node per epoch (Sec 3.1, eqn 3) |
| Demand is sustained / large-bundle | Demand is finite, must complete | Destination constraint w/ K=last epoch (Sec 3.1, eqn 4) |
| Switches are full-buffer routers | GPU switches have ~zero buffer | Special switch model (Sec 3.1, end of subsection) |
This table is the most exportable artifact of the paper. For DynamICCL: each row above corresponds to a modeling assumption that must hold at runtime as well. If the runtime system violates any of them (e.g., congestion pushes effective alpha 10x higher), the synthesized schedule's optimality vanishes. Agent-2 should monitor runtime alpha and trigger schedule re-selection when alpha drifts.
8. Quantitative Results — Empirical Findings by Regime
8.1 Headline numbers (from Abstract + Sec 1)
| Comparison | Result |
|---|---|
| TE-CCL vs TACCL on NDv2 2-chassis | 2x better algorithm bandwidth |
| TE-CCL vs TACCL on AMD GPU testbed | 2.14x algorithm bandwidth |
| TE-CCL vs RCCL on AMD GPU testbed | 3.18x algorithm bandwidth |
| TE-CCL solver time vs TACCL solver time | "approximately the same time as TACCL's heuristic" |
| TE-CCL scaling vs TACCL | "scales to larger topologies than TACCL" |
8.2 AllGather AB improvement vs TACCL (Fig 5 in paper)
| Topology | Best AB improvement | Conditions |
|---|---|---|
| NDv2 | 0.36% (mean) | up to 970% (max) for AllGather |
| DGX-2 | 12% (mean) | up to 471% (max) |
| Internal 1 | -5% (mean) | up to 689% (max), some cells worse |
| Internal 2 | 0.33% (mean) | up to 5759% (max) |
For AllToAll improvement vs TACCL:
| Topology | Best AB improvement | Conditions |
|---|---|---|
| NDv2 | 0.18% (mean) | up to 2919% (max) |
| DGX-2 | 9% (mean) | up to 2979% (max) |
| Internal 1 | 20% (mean) | up to 197% (max) |
| Internal 2 | 0.48% (mean) | up to 12322% (max) |
The wide max-vs-mean spread is the most interesting empirical pattern. Most cells are roughly equivalent to TACCL; a small number of cells are dramatically (2-100x) better. This implies TE-CCL's win is concentrated in regimes where TACCL is mis-tuned (sketch quality collapses for that topology) rather than uniform across the design space.
8.3 NDv2 2-chassis AllToAll (Table 8 — selected cells)
Output buffer | TE-CCL CT (us) | TACCL CT (us) | Improvement %
1 GB 320235.81 320049.40 -0.058% (TACCL wins)
1 MB 325.28 359.00 +10.366%
256 KB 85.52 115.72 +35.313%
64 KB 23.30 50.34 +116.052%
16 KB 7.27 35.76 +392.223%
4 KB 4.64 32.16 +592.134%
1 KB 4.24 36.80 +768.920%
The pattern is striking: at large buffers (1 GB, 256 MB), TACCL is narrowly better; at small buffers (1-16 KB), TE-CCL is 4-8x better. This is the small-message regime where alpha dominates and TACCL's heuristic ordering loses to TE-CCL's explicit pipeline modeling.
8.4 NDv2 2-chassis AllGather (Table 8 — selected cells)
Output buffer | TE-CCL AB (GB/s) | TACCL AB (GB/s) | Improvement %
1 GB 22.86 18.60 +22.90%
256 MB 22.86 20.49 +11.56%
64 MB 22.86 20.43 +11.90%
1 MB 20.51 16.09 +27.49%
256 KB 17.39 10.13 +71.60%
64 KB 10.53 4.89 +115.13%
16 KB 3.60 1.26 +185.59%
Here the win is uniform — TE-CCL beats TACCL at every size from 1 GB down to 16 KB on AllGather — but the ratio grows sharply at small messages. This is the same shape as 8.3 but lifted off zero.
8.5 AMD ROCm6 hardware results (Fig 8 in paper)
Buffer size | RCCL AB | TACCL AB | TE-CCL AB | Speedup over RCCL
16 KB ~0.8 ~1.0 ~2.5 ~3.1x
64 KB ~1.3 ~2.0 ~4.0 ~3.0x
256 KB ~2.2 ~3.5 ~6.0 ~2.7x
1 MB ~3.5 ~5.5 ~10.5 ~3.0x <-- "1.5-2x faster for >=4 MB"
4 MB ~6.0 ~9.0 ~14.0 ~2.3x
32 MB ~9.0 ~14.0 ~16.0 ~1.8x
128 MB ~10.5 ~15.0 ~16.5 ~1.6x
512 MB ~11.0 ~15.0 ~16.5 ~1.5x
1 GB ~11.5 ~15.5 ~16.5 ~1.4x
(Numbers are read off Fig 8 bar chart, paper does not publish a table.)
"TE-CCL outperformed RCCL on ROCm5.7 by a larger margin for small transfers, but RCCL improved their manually constructed schedules for small transfers in their recent update [ROCm6] and reduced the gap." — Sec 6.2
So the AMD measurement is itself a moving target: the very small message regime where TE-CCL won most has been partially closed by RCCL's hand tuning between ROCm5.7 and ROCm6. Synthesized schedules race against vendor hand-tuning; the gap closes as vendors invest.
8.6 Solver time vs TACCL (Fig 6 in paper)
"TE-CCL is faster than TACCL on 45% of AllToAll scenarios and 40% of AllGather scenarios (with early stop) on the NDv2 topology; 72% and 27% for DGX2; 72% and 83% for Internal 1; and 50% of AllGather for Internal 2."
Solver time is on par or favorable in most cells. The early-stop mechanism (30% optimality gap) is what makes the comparison fair — TACCL has its own analogous heuristic for ordering.
8.7 Scaling table (Table 4 in paper)
| Topology | Collective | # GPUs | EM | Solver time |
|---|---|---|---|---|
| Internal 1 | AG (A*) | 64 | 1 | 3000 s |
| Internal 1 | AG (A*) | 128 | 1 | 7 h |
| Internal 2 | AG (A*) | 128 | 1 | 1300 s |
| Internal 2 | AG (A*) | 256 | 1 | 2.8 h |
| Internal 1 | AtoA | 16 | 1 | 66 s |
| Internal 1 | AtoA | 32 | 1 | 215 s |
| Internal 1 | AtoA | 64 | 1 | 500 s |
| Internal 1 | AtoA | 128 | 2 | 800 s |
| Internal 2 | AtoA | 128 | 1 | 2600 s |
| Internal 2 | AtoA | 256 | 4 | 1500 s |
EM = epoch multiplier (factor by which epoch is coarsened to fit memory). TACCL "ran out of memory and did not produce a solution for almost all Internal 1 topologies (over 4 chassis)" (Sec 6.1).
8.8 Microbenchmarks — Copy benefit (Fig 9), Epoch size (Fig 10), Buffer (Fig 11), A* vs OPT
| Microbenchmark | Finding |
|---|---|
| Copy on/off (Fig 9) | Copy reduces transfer time by 50% on DGX1 and 12.5% on Internal-2 at largest sizes; no help for small transfers (enough capacity for direct sends). |
| Epoch size (Fig 10) | Smaller epochs improve solution quality on NDv2 / DGX2 (heterogeneous links) — slowest link dictates transfer time when epoch is large. Internal 1 (homogeneous links) shows no difference. |
| Buffers (Fig 11) | Buffers do not improve solution quality but reduce solver time by 61% (Internal 1 alpha=0), 71% (Internal 1 alpha>0), -28.46% (Internal 2 — slower with buffers), 0.23% (DGX1). |
| A* vs OPT (Sec 6.4) | At alpha=0: A* finishes in 86.5 s (263.3 s for 2 chunks); OPT finishes in 346 s (4392 s for 2 chunks). A* is 10% off optimal (6% in 2-chunk case). At alpha>0: A* in 137 s (901 s); OPT in 363 s (3047 s); A* is 20% off (8% in 2-chunk). |
These four microbenchmarks tell a consistent story: each TE-CCL modeling addition (copy, epoch tuning, buffers, A) is empirically justified, but each operates only in a specific regime*. Copy helps at large transfers; epoch tuning matters on heterogeneous-link topologies; buffers speed solver time but don't change quality; A* trades 6-20% optimality for 4-10x solver speedup.
For DynamICCL: this regime-specificity is exactly the kind of state-conditional optimal action that motivates RL. None of these TE-CCL knobs is monotonically beneficial; each has a regime where it helps and a regime where it hurts.
9. Configuration-Regime Trade-off Tables
9.1 Three TE-CCL formulations
| Dimension | General MILP | LP form | A* technique | Winner (DynamICCL) |
|---|---|---|---|---|
| Optimality | Optimal | Optimal* | Sub-optimal (6-20%) | MILP |
| Copy / multicast support | Yes | No | Yes | MILP, A* |
| Time complexity | Worst (NP-hard) | Polynomial | Multi-round | LP (when applicable) |
| Max GPUs (in eval) | 17 (DGX-2) | 256 (Internal-2) | 256 (Internal-2) | LP for AtoA, A* for AG |
| Schedule quality | Best | Best for AtoA | Slightly worse | MILP, LP per workload |
| User burden | Pick K | Pick K | Pick K, K' (round size) | LP (least burden, but no copy) |
| AllGather suitability | Optimal at <=64 GPUs | N/A (no copy) | Best at >64 GPUs | A* for big-AG cells |
| AllToAll suitability | OK at <=32 GPUs | Best at >32 | OK | LP |
*LP is optimal only for the no-copy subset of demand patterns.
For DynamICCL, prefer LP form at synthesis time when caller is issuing AllToAll, MILP form when caller issues AllGather/Broadcast on small clusters, and A form* when AllGather/Broadcast at large scale. This mapping (collective × scale -> formulation) is exactly the offline-catalog axis Agent-2 selects from at runtime.
9.2 TE-CCL vs SCCL vs TACCL vs MSCCLang/GC3
| Dimension | SCCL (PPoPP'21) | TACCL (NSDI'23) | MSCCLang (ASPLOS'23) | TE-CCL (SIGCOMM'24) | Winner |
|---|---|---|---|---|---|
| Auto-synthesis from topo | Yes (SMT) | Yes (MILP+heur) | No (DSL, manual) | Yes (MILP/LP/A*) | TE-CCL |
| User input | Topology | Topology + sketch | Topology + Python | Topology only | TE-CCL |
| Solver type | SMT (Z3) | MILP (Gurobi) | -- | MILP (Gurobi) | -- |
| Cost model | Hockney alpha-beta | Hockney | -- | Hockney + delta | TE-CCL |
| In-network copy | Yes | Yes (via switch hyperedge) | Yes (DSL prim) | Yes (max-over-out) | tied |
| Buffer/store-and-forward | Implicit | No | Yes (DSL) | Explicit (B vars) | TE-CCL |
| Queueing modeling | No | No | -- | Yes (delta_{ij}) | TE-CCL |
| Multi-cast TE roots | No | No | No | Yes (Sec 7) | TE-CCL |
| Max scale (eval) | ~30 GPUs (1 chassis) | 128 GPUs | 256 GPUs | 256 GPUs | tied (TE-CCL/MSCCL) |
| Multi-tenant support | No | No | No | Yes (Sec 5) | TE-CCL |
| Hardware deployment | DGX-1 sim | NDv2 + DGX-2 | A100 + DGX-2 | AMD ROCm6 | tied |
| Programmability | Low (synth-only) | Med (sketch) | High (DSL) | Low (MILP only) | MSCCLang |
| Adaptability at runtime | None | None | None | None | NONE — DynamICCL gap |
For DynamICCL, prefer TE-CCL as the synthesizer of record for multi-cell catalogs — it produces highest-quality schedules with least user input. But notice the bottom row: none of the four synthesizers adapts at runtime. They all produce static schedules. DynamICCL's runtime knob selection is orthogonal and complementary to all four — it composes with any of them.
9.3 Workload regime atlas
| Regime | Best TE-CCL formulation | Best knob choice | Winner (DynamICCL state) |
|---|---|---|---|
| AllGather, <=32 GPUs, large msg | General MILP | optimal epoch, buffers on | MILP-large schedule |
| AllGather, <=32 GPUs, small msg | General MILP | optimal epoch, EM=1 | MILP-small schedule |
| AllGather, 64-256 GPUs | A* (epoch multiplier ~ 2-4) | A* + early-stop 30% | A* schedule |
| AllToAll, <=64 GPUs | LP form | optimal epoch | LP-AtoA schedule |
| AllToAll, 128-256 GPUs | LP form | EM=2-4 | LP-AtoA-scaled schedule |
| AllReduce (any) | AtoA (LP) + AG (MILP/A*) | composed schedule | AR-composed schedule |
| Heterogeneous-link topology (NDv2) | Smaller epoch | optimal-epoch model | small-epoch schedule |
| Homogeneous-link (Internal-1) | Larger epoch (faster solver) | max-epoch | large-epoch schedule |
| Switch supports SHARP-style copy | Switch flow-conservation | enable copy through switch | with-copy-switch |
| Legacy switch (no copy at switch) | TACCL-style hyperedge model | explicit hyperedge constraints | hyperedge schedule |
Each row is a candidate offline-synthesis cell. The number of cells is manageable (tens to low hundreds), so DynamICCL's runtime can carry all of them in a lookup table indexed by (collective, msg_size_bin, nGPU_bin, topology_fingerprint). Agent-2 selects the cell at call time; the cell selection is the action.
9.4 Epoch duration knob
| Epoch tau | Solver time | Solution quality | Use when |
|---|---|---|---|
| Large (max-epoch) | Fast | Lower | Homogeneous links |
| Small (fastest-link) | Slow | Higher | Heterogeneous links |
| Medium (5x scaling) | Moderate | Moderate | alpha > 200*tau cases |
Sec 6 of the paper explicitly recommends set tau = bandwidth of fastest link for most evaluations. When alpha is large relative to tau, increase tau by 5x to avoid making large models infeasible.
10. Bottlenecks & Insights Surfaced by the Measurements
10.1 Small messages are where alpha modeling pays off
The Fig 3 plot — relative error in algorithm bandwidth estimate vs transfer size — shows that without modeling alpha, the error exceeds 100% below 0.1 MB transfer size. This validates the entire TE-CCL premise: queueing and propagation delay dominate for small chunks, and a TE solver that ignores them produces wildly inaccurate schedules in the regime where most CCL traffic actually lives.
For DynamICCL, this bottleneck appears as a state-feature requirement. Agent-2 must know:
alpha_estimate(current best estimate of per-link latency)chunkSize(caller's choice)effective_bandwidth_at_chunk_size = bw / (1 + alpha/(chunk/bw))
The third quantity is the closed-form "small-message tax" that TE-CCL encodes via delta_{ij}.
10.2 In-network copy is a large-transfer optimization, not a
universal one
Fig 9: copy reduces transfer time by 50% on DGX1 at largest sizes, 12.5% on Internal-2, and zero benefit at small sizes (because small transfers have enough direct-link capacity to multi-cast from source). This is the opposite of what naive intuition suggests — "multicast must always help."
For DynamICCL: copy/multicast schedules should be in the action space
only for large messages. Below some threshold, direct point-to-point
schedules win because alpha overhead of intermediate hops exceeds the
multi-link savings. Agent-2 must encode msg_size_bin and
learn this threshold.
10.3 Buffers reduce solver time, not solution quality
Fig 11: buffer presence improves solver time by 61-71% on Internal 1 and DGX1, but solution quality is essentially unchanged. The paper's hypothesis (Sec 6.4):
"Buffers do not impact the solution quality but only the solver time ... because each node needs the same amount of traffic as it has to forward, it can interleave consuming traffic with forwarding it to compensate for the lack of buffers."
So buffers don't unlock new optimal schedules; they just give
the solver more search-space slack to find the existing optima
faster. This is a counter-intuitive empirical fact that informs
Agent-2's prior: do not over-explore buffer-related runtime knobs (which
in NCCL would be numChannels and chunkSize —
they affect buffer occupancy). Their leverage is small except as a
search-time accelerator for the offline solver.
10.4 The slowest link dictates transfer time when epoch is too large
Fig 10: on heterogeneous-link topologies (NDv2, DGX2 with 4x bandwidth disparity between fast and slow links), using a large epoch wastes capacity on fast links. Smaller epochs let chunks traverse fast links multiple times per "epoch-equivalent" and capture the heterogeneity. On homogeneous topologies (Internal 1) this doesn't matter.
For DynamICCL: this argues for topology-fingerprint as a state feature that distinguishes homogeneous from heterogeneous link distributions. NCCL's algorithm choice (Ring vs Tree) is the runtime analog: Ring assumes homogeneous bandwidth; Tree handles heterogeneity better. Agent-2 should pick Ring on homogeneous fingerprints, Tree on heterogeneous ones.
10.5 The vendor-tuning arms race
Sec 6.2: "TE-CCL outperformed RCCL on ROCm5.7 by a larger margin for small transfers, but RCCL improved their manually constructed schedules for small transfers in their recent update [ROCm6] and reduced the gap." This is an honest acknowledgment that synthesized schedules race against vendor hand-tuning and the relative win depends on the vendor's investment cycle.
For DynamICCL: this is the "no static catalog wins forever" insight. The right design is a runtime tuner that continuously re-evaluates which schedule is best — exactly what Agent-2 does. Synthesized schedules become the action vocabulary for an adaptive policy, not the policy itself.
10.6 Multi-tenant support is formulated but not tested
Sec 5 ("Use in multi-tenant clusters") and Sec 8 (limitations) both acknowledge that multi-tenant scheduling is a single MILP at the sum-demand level and that hardware execution of such schedules "likely introduces additional technical challenges, but we leave these for future work." The paper formulates it; nobody has tested whether the synthesized multi-tenant schedules execute correctly.
For DynamICCL: this is a regime where Agent-2 must observe
concurrent collective calls across tenants and adapt — a state
feature like concurrent_tenant_count or
cross_tenant_traffic_mix would be relevant here. Static
synthesis doesn't capture this; runtime sees it directly.
11. Limitations of the Methodology
| Limitation | Implication for DynamICCL |
|---|---|
| Schedules are static (no runtime adaptation) | DynamICCL's runtime tuning is strictly complementary |
| Most numbers are solver-derived, not hardware-measured | Cost-model bias means real hardware may diverge from claimed AB |
| Hardware deployment only on AMD 32 GPUs | NVIDIA, NDv2, large-scale numbers are simulated only |
| No congestion / failure modeling in solver | Real clusters have transient hot links that solver assumes static |
| alpha, beta are inputs (profiled by user) | Wrong alpha calibration silently produces wrong schedules |
| Multi-tenant only formulated, not benchmarked | Cross-tenant interference is the regime DynamICCL must own |
| Chunk size held at 25 KB for most cells | Other chunk sizes are not characterized; agent must explore |
| Optimality gap of 30% in many evals | "Optimal" claim is approximate; remaining 30% is RL's territory |
| Solver time grows with K (number of epochs) | Long-running schedules cannot be re-synthesized at runtime |
| Switch model is bimodal (SHARP vs hyperedge) | Real switches sit in between; DynamICCL must measure not assume |
| Failure handling delegated to ECMP | Solver assumes perfect link health; real topology drifts |
| AllReduce only via AtoA + AG decomposition | Other AllReduce algorithms (Tree, CollNet, NVLS) not in catalog |
| Compute cost not accounted for in AllReduce | Reduce-during-relay cost ignored |
| RAM ceiling at 350 GB for AllToAll on large topologies | Synthesis itself becomes a resource-budget problem |
| TACCL comparison runs the public TACCL artifact | Comparison fairness depends on TACCL's tuning state at submission |
| 4 alpha test bins (4, 8, 12 sec) in Algorithm 1 | Coarse epoch search; finer search could find smaller K |
The most consequential limitation for DynamICCL is the second and
third combined: almost all reported speedup numbers are derived
from the solver's own cost model, not from hardware
measurement. The solver could be uniformly off by a factor — its
optimality conclusions are valid only relative to the same cost model.
DynamICCL must validate any schedule it selects against a real reward
signal (-collective_wall_clock_us) and not trust the
synthesizer's predicted bandwidth.
12. What to Borrow for DynamICCL — Composability vs Tension with Run-Time Knob Selection
This section addresses the prompt's central question: TE-CCL produces optimized static schedules, while DynamICCL selects NCCL knobs at runtime per-collective. Are these complementary or in tension?
12.1 The two-layer composition picture
+--------------------------------------------------------------------+
| Caller (PyTorch DDP) |
| issues collective(s, msg_size, model, batch, ...) |
+----------------------------+---------------------------------------+
|
v
+--------------------------------------------------------------------+
| DynamICCL Tuner Plugin (RUNTIME, per-call) |
| state_t = (msg_size_bin, model_intensity_I, batch_M, |
| topology_fingerprint, recent_timing_LSTM, |
| synthesizer_catalog_id) |
| action_t = (algorithm, protocol, nChannels, numThreads, |
| chunkSize, schedule_id_from_catalog) |
| reward_t = -wall_clock_us |
+----------------------------+---------------------------------------+
|
| selects schedule_id from catalog
v
+--------------------------------------------------------------------+
| Pre-synthesized Schedule Catalog (OFFLINE, per-cluster) |
| |
| +------------------+ +------------------+ +-----------------+ |
| | TE-CCL MILP | | TE-CCL LP | | TE-CCL A* | |
| | AG, <=32 GPUs | | AtoA, all sizes | | AG, >64 GPUs | |
| +------------------+ +------------------+ +-----------------+ |
| +------------------+ +------------------+ +-----------------+ |
| | TACCL sketch | | MSCCLang DSL | | NCCL Ring | |
| | DGX-2, AG | | Hierarchical AR | | (default) | |
| +------------------+ +------------------+ +-----------------+ |
| +------------------+ +------------------+ |
| | NCCL Tree | | SCCL synth | |
| | (default) | | DGX-1, AG | |
| +------------------+ +------------------+ |
+--------------------------------------------------------------------+
|
v
+--------------------------------------------------------------------+
| NCCL Core / RCCL / MSCCL Runtime |
| executes selected schedule with chosen knobs |
+--------------------------------------------------------------------+
^ Fig 9: Two-layer composition. TE-CCL contributes entries to the
offline catalog (the bottom box). DynamICCL is the runtime selector
(the middle box). The two layers do not compete; they are
factorings of the same problem at different time scales.
12.2 What DynamICCL borrows directly
| TE-CCL artifact / insight | DynamICCL adoption |
|---|---|
| Three-formulation Pareto curve (MILP, LP, A*) | Schedule catalog has 3 entries per (collective, topology) cell |
| 25 KB chunk size as the alpha-vs-bw inflection point | chunkSize_25K is a critical action-space anchor for
small messages |
| delta_{ij}=ceil(alpha/tau) modeling | Agent-2 state must include current_alpha_estimate_us
per link |
| Switch flow-conservation vs hyperedge as binary choice | Topology fingerprint encodes switch model; Agent-2 conditions on it |
| Reverse-DFS post-processing | Pattern reuse: prune actions whose flow doesn't reach the destination |
| 30% early-stop optimality gap | Agent-2 should not over-train on cells where the catalog is already 30% sub-optimal — RL has at least 30% headroom there |
| Buffers improve solver time, not quality | Reduce exploration of
numChannels/chunkSize as quality knobs in
homogeneous-link settings; treat them as latency knobs |
| Heterogeneous links benefit from small epochs | Agent-2 picks Tree algo on heterogeneous topology fingerprints |
| AllReduce = AtoA + AG decomposition | Agent-2's AR action space inherits this composition (implicitly via NCCL) |
| Multi-tenant multi-tenant demand-sum formulation | Agent-2's state includes
concurrent_tenant_demand_proxy |
| TACCL profiler used to get alpha, beta | DynamICCL can reuse the same profiling harness for cluster bring-up |
| Floyd-Warshall as the heuristic distance function | Pre-compute per-(src,dst) shortest-path distances and cache as state feat |
| Logical edges encode goal-distance reward shaping | Agent-2 reward shaping:
reward = -wallclock - lambda * remaining_distance |
12.3 Where TE-CCL and DynamICCL are in tension
The composition picture above hides three real conflicts:
Tension A — Catalog completeness vs runtime regret. TE-CCL's catalog is finite. If Agent-2 selects a schedule from the catalog and the cluster state is outside any catalog cell's profile, the schedule will under-perform. DynamICCL's RL approach addresses this partially — by selecting the closest catalog cell — but fundamentally limits regret to the gap between best-catalog-cell and true-optimal-for-current-state. The mitigation: include "synthesize on-demand" as an action with a high cost, so Agent-2 can request a fresh schedule when no catalog entry is good enough. Cost should reflect the synthesizer's compile time (LP: minutes; MILP: hours).
Tension B — TE-CCL's cost model assumes a static (alpha,
beta). In production, alpha and beta vary with congestion, NIC
throttling, and thermal events. TE-CCL's optimality is conditional on
the input profile. DynamICCL's runtime alpha estimate may drift away
from the catalog's training alpha, silently invalidating optimality. The
mitigation: Agent-2 monitors
alpha_drift = |current_alpha - catalog_alpha| / catalog_alpha
and triggers a "re-profile or fall-back-to-NCCL-default" when drift
exceeds a threshold (e.g., 2x).
Tension C — TE-CCL's MILP horizon (K epochs) vs RL's per-call
decision. TE-CCL plans the entire collective in one
optimization. DynamICCL decides one config per collective call but the
next call's collective is itself another K-epoch plan. If two
consecutive calls overlap (pipelining), the catalog schedules are
optimized in isolation while runtime sees concurrent execution. The
mitigation: Agent-2 state includes
recent_collective_in_flight_ratio and penalizes schedules
with high concurrent-bandwidth peaks during pipelined regions. This is a
tension that pure-static synthesis cannot resolve; only runtime
adaptation can.
12.4 State-vector additions justified by TE-CCL
Add to Agent-2 state vector s_t:
+------------------------------------------------------------+
| alpha_per_link_estimate : float (us, profiled) |
| beta_per_link_estimate : float (sec/byte, profiled) |
| alpha_drift_ratio : float (vs catalog baseline) |
| topology_fingerprint : enum |
| {homogeneous_links, heterogeneous_links_2x, |
| heterogeneous_links_4x, switch_with_SHARP, |
| switch_legacy_hyperedge} |
| switch_count : int |
| concurrent_tenant_demand : float (relative to topology) |
| catalog_optimality_gap_30 : float (last cell's gap) |
| small_msg_floor_active : bool (msg_size < 256 KB) |
| copy_benefit_estimate : float (large-transfer score) |
| schedule_age_at_runtime : int (epochs since synthesis) |
+------------------------------------------------------------+
^ Fig 10: TE-CCL-justified state features. Most are simple scalar
derivations of the topology and recent profiling data; alpha_drift
is the runtime-adaptive feature that makes the static catalog
trustworthy.
12.5 Action-space priors justified by TE-CCL
| Action | Prior bias direction | Triggering state |
|---|---|---|
| Pick MSCCL/TE-CCL synthesized schedule | Prefer for AG/AtoA, skip for small AR | catalog_cell exists AND msg_size > 256 KB |
| Pick NCCL Ring (default) | Default for homogeneous links | topology = homogeneous_links |
| Pick NCCL Tree | Default for small-msg latency | msg_size_bin < 64 KB |
| Pick NVLS / CollNet | Default for switch-supports-SHARP | switch_with_SHARP |
| Increase nChannels | Solver-time accelerator only | catalog_optimality_gap_30 > 0.2 |
| Decrease chunkSize | Small-msg latency optimization | msg_size_bin <= 16 KB AND alpha_drift_ratio < 1.5 |
| Increase chunkSize to 25 KB | TE-CCL's default chunk for solver | offline-catalog re-synthesis bridge |
| Re-synthesize via TE-CCL on-demand | High-cost action, last resort | catalog_optimality_gap_30 > 0.5 OR alpha_drift_ratio > 2 |
12.6 Reward shaping justified by TE-CCL
The reward DynamICCL uses is the wall-clock metric. TE-CCL's
objective is sum_{k} 1/(k+1) * R_{s,d,k} — i.e.,
earlier delivery is worth more. This is a discounted-completion-time
reward, which maps cleanly onto Agent-2's per-call reward:
r_t = -wallclock_us
- lambda_chunk_progress * (1 - chunks_delivered_by_first_half / total_chunks)
- lambda_drift * max(0, alpha_drift_ratio - 1)
where:
wallclock_us is the dominant term (TE-CCL's actual goal)
lambda_chunk_progress shapes for "fast partial completion" (TE-CCL's discount)
lambda_drift penalizes schedule-vs-runtime mismatch (the catalog-stale signal)
Lambda values can be set to lambda_chunk_progress = 0.1,
lambda_drift = 0.5 based on a sweep — TE-CCL's own
evaluation suggests a 30% optimality gap is acceptable, so the
chunk-progress shaping coefficient should not exceed roughly that
fraction of the wall-clock term.
12.7 Exploration-budget allocation justified by TE-CCL
TE-CCL's win-distribution is heavy-tailed (Sec 8.2: max-vs-mean spread is 100-12000x). Most cells are equivalent to TACCL; a few are massively better. Exploration budget should follow this same heavy tail:
| Regime where TE-CCL beats TACCL by >10x | Agent-2 exploration policy |
|---|---|
| AllToAll small messages (1-16 KB) | High exploration; small-msg algo+proto regime |
| AllGather small-medium messages (16-256 KB) | High exploration; chunk size + nChannels |
| Heterogeneous-link topology | High exploration; algorithm selection |
| AllToAll on Internal-2 (max win 12000%) | Highest exploration; this is where catalog matters most |
| Regime where TE-CCL is on par or loses | Agent-2 exploration policy |
|---|---|
| AllToAll, 1 GB transfers | Low exploration; bandwidth-saturated |
| AllGather, 256 MB+ on NDv2 | Low exploration; near-optimal already |
| Homogeneous links + large messages | Low exploration; few algos suffice |
This atlas is the offline regret map. Agent-2 should be aggressive where the synthesized catalog yields large but variable wins (big absolute gain available) and conservative where the catalog is already close to the bandwidth ceiling (small absolute headroom).
12.8 The "no human in the loop" architectural lesson
TE-CCL's central architectural advance over TACCL is removing the sketch — the human-supplied hint about which links to use. This is the same architectural ambition DynamICCL has at the runtime layer: no human in the loop, RL discovers the policy from reward. The proof point: TE-CCL produces schedules of TACCL-quality with no sketch, so an automatic procedure can subsume a hand-tuned hint. DynamICCL's analog is producing per-collective configs of expert-tuned quality with no static profile.
The methodology to reuse: train Agent-2 on a corpus of TE-CCL-
generated schedules with their solver-predicted bandwidths as a prior
reward. Then fine-tune on real hardware with wallclock_us.
This is policy distillation followed by online RL — exactly the pattern
that worked for AlphaZero (offline self-play -> online fine-tuning)
and that is justified here by the existence of a high-quality offline
synthesizer.
12.9 The compile-time-vs-runtime composition principle
Both layers must obey one rule: the runtime layer can only refine, never invalidate, the compile-time layer's assumptions. TE-CCL's schedules are correct only if (alpha, beta, topology) hold. Agent-2 must not select a schedule whose preconditions are violated by current cluster state. This is the same discipline as a JIT compiler that may de-optimize when speculative assumptions fail — DynamICCL's "fall-back-to-NCCL-default" is the de-optimization step.
IF all preconditions hold:
pick best-catalog-cell schedule (TE-CCL, TACCL, MSCCLang, etc.)
ELSE IF some preconditions violated:
pick most robust action (NCCL Ring/Tree default + safe knobs)
ELSE IF severe drift detected:
trigger re-synthesis (high cost, async)
This three-tier policy is the cleanest way to compose static synthesis and runtime adaptation.
13. Analogy
TE-CCL is a city traffic-routing department that gets handed a new neighborhood layout each Monday and produces a static route plan for the week. The city has three planners on staff:
- Senior planner (General MILP): considers every possible route including multi-vehicle convoys (multicast/copy). Slow but produces the best plans. Used for small or medium neighborhoods.
- Junior planner (LP): runs through every route in parallel using a simpler model (no convoys allowed, each driver goes alone). Much faster. Used for large neighborhoods where the senior planner would take days.
- Expert dispatcher (A technique):* breaks the week into successive shifts and plans only one shift at a time, using "how far is each driver from their destination" as a heuristic. Slightly sub-optimal but scales to very large neighborhoods.
The department hands the published plan to the bus drivers (the MSCCL runtime), who execute it for the week. The plan never changes mid-week — even if a road gets congested or a bridge closes, drivers follow the original plan.
DynamICCL is the traffic operations center that watches the city in real time. It cannot redraw the plan (synthesizing is too slow), but it can:
- Choose which of three planners' plans to follow this morning (offline-synthesized catalog selection).
- Tell drivers which side of the road to drive on (NCCL Ring vs Tree), what speed (chunk size), how many simultaneous drivers (nChannels), and which streets to avoid (numThreads-based congestion awareness).
- Detect when current conditions (alpha drift, congestion) make the catalog plan stale and order a fall-back to the simplest known-good route (NCCL default).
- Periodically request the planning department to produce new plans when no existing plan fits today's neighborhood state.
The relationship is hierarchical adaptation: TE-CCL plans the streets; DynamICCL drives the cars. TE-CCL's contribution is the planning machinery — the multi-commodity-flow MILP that lets a small team produce city-wide plans without a sketch. DynamICCL's contribution is the operations machinery — the RL policy that chooses among plans and tunes execution at the second-by-second level. Neither replaces the other; together they realize the old WAN-engineering principle that traffic engineering and traffic operations are different jobs run on different timescales by different agents.
The closing intuition pump: when a synthesizer says "the optimal schedule for AllGather on NDv2-2 with 1 KB messages is X," it has committed to a static optimum. When a runtime says "I will execute X for the next 50 microseconds and watch what happens," it has committed to an adaptive trajectory that converges to the static optimum and to whatever the synthesizer didn't model. The synthesizer's static optimum is the runtime's prior. The runtime's adaptive trajectory is the synthesizer's reality check.
Summary of Borrowed Patterns
| Pattern from Liu et al. (TE-CCL, SIGCOMM 2024) | DynamICCL application |
|---|---|
| Multi-commodity-flow MILP as the right CCL abstraction | Schedule catalog populated by TE-CCL; Agent-2 selects from it |
| Three formulations (MILP / LP / A*) on a Pareto curve | Three-tier offline catalog: optimal-small, scalable-AtoA, A*-large-AG |
| 25 KB chunk size as alpha-vs-bw inflection | chunkSize_25K is the small-msg anchor in the action
space |
| delta_{ij}=ceil(alpha/tau) for delay modeling | alpha_estimate_us and alpha_drift_ratio in
state vector |
| Reverse-DFS to prune non-load-bearing flows | Action-space pruning: drop schedules that don't reach destinations |
| Floyd-Warshall logical edges as A* heuristic | Pre-compute shortest-path-distance per (src,dst) for reward shaping |
| Buffers improve solver time, not quality | Treat numChannels/chunkSize as latency knobs, not quality knobs |
| Heterogeneous-link topology benefits from small epochs | Topology fingerprint state feature; Tree algo on heterogeneous, Ring on homogeneous |
| Switch model (SHARP-style copy vs TACCL hyperedge) | Topology fingerprint distinguishes switch types |
| Multi-tenant via demand-sum MILP (formulated, untested) | Concurrent-tenant state feature; runtime sees what synthesizer didn't test |
| 30% optimality gap early-stop | Acknowledged catalog headroom; Agent-2 reward has at least 30% recoverable margin |
| AllReduce = AllToAll + AllGather decomposition | Agent-2's AR action space inherits this composition |
| RAM ceiling ~350 GB for AllToAll on Internal-1 large topologies | Schedule re-synthesis is a high-cost action; reserve for severe drift |
| Schedules are static; no failure / congestion modeling | DynamICCL's runtime adaptation is strictly complementary; same schedule + drift |
| 2x improvement vs TACCL on NDv2 2-chassis | Schedule-quality prior; Agent-2 should NOT explore equivalent NCCL defaults here |
| 3.18x vs RCCL on AMD ROCm6 | Vendor-tuning baseline updates over time; Agent-2 must re-evaluate periodically |
| TACCL profiler used for alpha, beta inputs | Reuse TACCL profiler for cluster bring-up alpha/beta calibration |
| Heavy-tailed win distribution (max-vs-mean 100-12000x spread) | Heavy-tailed exploration: invest most exploration where catalog wins are largest |
| MSCCL runtime as deployment target | DynamICCL outputs map to MSCCL XML or NCCL-direct configs |
| "No human in the loop" architectural goal | DynamICCL extends the same goal one layer down (runtime knob selection) |
| Microsoft GitHub release (microsoft/TE-CCL) | DynamICCL benchmark harness must be open source for reproducibility |
| TE roots: PoP, NCFlow, BLASTSHIELD, etc. for solver scalability | Borrow scalability tactics from TE literature, not just CCL literature |