TE-CCL: Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem — Detailed Summary

Xuting Liu (UPenn), Behnaz Arzani (MSR), Siva Kesava Reddy Kakarla (MSR), Liangyu Zhao (UW), Vincent Liu (UPenn), Miguel Castro (OpenAI), Srikanth Kandula (Microsoft), Luke Marshall (MSR) | ACM SIGCOMM 2024 | DOI: 10.1145/3651890.3672249

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.

Abstract

"Cloud operators utilize collective communication optimizers to enhance the efficiency of the single-tenant, centrally managed training clusters they manage. However, current optimizers struggle to scale for such use cases and often compromise solution quality for scalability. Our solution, TE-CCL, adopts a traffic-engineering-based approach to collective communication. Compared to a state-of-the-art optimizer, TACCL, TE-CCL produced schedules with 2x better performance on topologies TACCL supports (and its solver took a similar amount of time as TACCL's heuristic-based approach). TE-CCL additionally scales to larger topologies than TACCL. On our GPU testbed, TE-CCL outperformed TACCL by 2.14x and RCCL by 3.18x in terms of algorithm bandwidth."

The paper targets the synthesis of CCL schedules for centrally-managed GPU training clusters at the scale of hundreds of GPUs.
Main contributions: (i) reformulation of CCL as a multi-commodity flow (MCF) problem, (ii) MILP encoding capturing chunked sends, in-network copies, alpha/beta latency-bandwidth model, and store-and-forward buffering, (iii) two scaling techniques — LP relaxation (when no copy is needed) and time-partitioned A* search — that take the optimizer to 256-GPU regimes.
Headline empirical claims: 2x schedule quality vs. TACCL (NDv2 simulation), 2.14x vs. TACCL and 3.18x vs. RCCL on a real AMD MI250x cluster. Solver time competitive with TACCL's heuristic.

1. Introduction

Why CCL synthesis is needed:

Distributed training of large ML models is dominated by communication. Cloud operators (Microsoft, OpenAI in this paper's author list) run enormous single-tenant training clusters where collective scheduling is a high-leverage optimization.
Vendor libraries (NCCL, RCCL) ship a fixed catalog of algorithms (Ring, double-binary Tree) tuned for canonical topologies. They are sub-optimal on heterogeneous fabrics and at small/medium message sizes.
CCL synthesizers (MSCCL/SCCL, TACCL, Blink) try to generate better schedules but face a precision-vs-scalability trade-off.

The two prior camps:

Precise solvers: SCCL (and predecessors) use SAT solvers and produce bandwidth-optimal schedules. They scale only to 1-2 chassis; beyond that the SAT search space explodes super-exponentially.
Heuristic solvers: TACCL uses hand-designed "communication sketches" — human-supplied hints that constrain the search to a tractable subspace. Scales further but (a) requires expert hand-tuning per topology, (b) produces schedules up to 2x slower than truly optimal.

TE-CCL's positioning:

Borrow from network traffic engineering (TE), which has solved WAN-scale routing as MCF for decades.
Adapt MCF to CCL specifics: chunked discrete sends, in-network multicast, propagation latency, GPU-memory storage.
Result: scales like TACCL (heuristic-fast) yet produces near-optimal schedules like SCCL.

Headline numbers (introduced here, detailed later):

2x improvement over TACCL on NDv2 across collectives.
2.14x over TACCL and 3.18x over RCCL on AMD MI250x testbed (algorithm bandwidth).
Solver scales to 256 GPUs (Internal-2 topology) where TACCL crashes and MSCCL is infeasible.

2. Background and Motivation

2.1 The Need for Fast Collective Scheduling

Large-model training (data, model, pipeline, expert parallelism) is comm-bound; communication is now the dominant cost component.
Cluster operators need an automated optimizer because:
- Topologies are heterogeneous (DGX1 vs. DGX2 vs. NDv2 vs. AMD)
- Workloads change (different parallelism strategies, varying collective patterns AllGather/AllToAll/AllReduce)
- Manual scheduling per (topology, workload) is intractable.

2.2 Relationship with TE Solutions

The paper draws a structural analogy between CCL scheduling and WAN traffic engineering. The MCF substrate of TE matches CCL's needs: inputs (topology, demands), constraints (capacity, flow conservation), objectives (max throughput / min completion time). But four CCL-specific differences must be handled:

Discretized Sends: "TE problems mostly focus on traffic bundles with high rates and the problem of allocating a fixed fraction of link capacity to each demand. Instead, CCLs have to schedule small- to medium-sized demands, which introduces more structure and adds new and, in some cases, hard-to-model constraints and dependencies."

In-Network Copies: "TE problems often assume flow conservation as a fundamental constraint; in contrast, collectives benefit significantly from copying data at intermediate GPUs, e.g., for tree broadcast/reduce patterns."

Latency and Queuing: "TE problems get away with focusing on steady-state effects... because they assume large traffic bundles. In contrast, we cannot ignore the effects of propagation and queuing delay for small transfers; modeling them is essential to CCL scheduling."

Support for Storage and Caching: "TE problems generally assume that data is received and sent as soon as possible... we can speed up solvers substantially if we use the available GPU memory."

These four extensions structure the rest of the paper.

3. The TE-CCL Model

3.1 The General Model (MILP)

Variables (Table 1 notation):

Variable	Description
`N`, `S`	Set of GPU nodes and switches
`E`	Set of unidirectional edges
`C`	Chunk IDs `{0, 1, ..., C}`
`D`	Demand function `N x N x C -> {0,1}`
`tau`	Epoch duration
`K`	Number of epochs
`F_{s,i,j,k,c}`	Flow of source-`s` chunk `c` over link `(i,j)` at epoch `k`
`B_{s,i,k,c}`	Buffer at node `i` of source-`s` chunk `c` at start of epoch `k`
`T_{ij}`	Capacity of link `(i,j)`
`alpha_{ij}`	Fixed latency of link `(i,j)`
`delta_{ij}`	Number of epochs in `alpha_{ij}` (i.e. `ceil(alpha_{ij}/tau)`)
`R_{s,d,k}`	Chunks from `s` read by `d` in epoch `k`

Constraints (in order, paraphrased to ASCII):

Source-buffer initial condition: B_{s,n,0,c} = max_{d in N} D_{s,d,c} if n == s else 0 The source initializes its buffer with the maximum chunk demand destined for any node; non-source nodes start empty.
Capacity: sum_{s in N} sum_{c in C} F_{s,i,j,k,c} <= T_{ij} * tau Total chunks traversing link (i,j) in epoch k cannot exceed the link's per-epoch capacity.
Flow conservation with delay: `B_{s,n,k,c} + sum_{j: (j,n) in E} F_{s,j,n, k - ceil(delta_{jn}), c}

= max_{j: (n,j) in E} F_{s,n,j, k+1, c}What is in noden's buffer plus arriving flow (delayed by the link's propagation latency) must cover what nsends out next epoch. Themax` operator on the right side encodes multicast: the same chunk can be replicated to multiple outgoing links without duplicating consumption.
Buffer evolution (store-and-forward): B_{s,n,k,c} = B_{s,n,k-1,c} + sum_{j: (j,n) in E} F_{s,j,n, k - ceil(delta_{jn}) - 1, c} Buffer at the start of epoch k equals buffer at start of k-1 plus all chunks that arrive in the interim.
Destination: R_{s,d,k,c} = min(D_{s,d,c}, B_{s,d,k+1,c}) R_{s,d,K,c} = D_{s,d,c} for the final epoch K Demand must be fully satisfied by the last epoch.
Multicast/copy: encoded implicitly via the max on the right-hand side of (3) — the same buffered chunk can be sent on multiple outgoing edges in the same epoch.
Reward variable: R_{s,d,k} is the count of source-s chunks delivered to destination d by epoch k, summed across c.
Objective: maximize sum_{k in K, s,d in N, s != d} (1/(k+1)) * R_{s,d,k} Rewards earlier delivery; the 1/(k+1) weight pushes the optimizer to finish chunks as quickly as possible while preserving correctness (full demand satisfaction by epoch K).

Why integer (MILP) for the general case:

Without integer constraints, fractional flow can lead to "half-chunk" delivery: a destination thinks it has received the chunk because its buffer count reaches the integer threshold, even though no single upstream node actually sent the entire chunk. Figure 4 in the paper illustrates this pathology.

3.2 Three Aspects That the Model Captures

(Figure 2 in the paper) — three things the model must represent for correctness:

(a) In-network copies: without modeling copy, broadcast time doubles. Figure 2(a) shows transferring 1 chunk to 2 destinations: sequential-only takes 4 epochs, with copy it takes 2.
(b) Latency / queuing: chunks cannot be forwarded until fully received. Without alpha modeling, the optimizer over-pipelines and over-counts throughput.
(c) Store-and-forward: GPU memory permits intermediate buffering; enlarges the feasible-schedule space dramatically.

Figure 3 quantifies the latency point: ignoring alpha produces relative throughput error >100% for transfers <10 KB.

4. Scaling

4.1 LP Conversion (for collectives without in-network copy)

For AllToAll, in-network copy is never useful (each chunk has a unique destination). The integer constraints are unnecessary.
Drop integer constraints; the MILP becomes a pure LP.
Solver time drops by orders of magnitude.
A reverse-DFS post-processing pass removes "wasteful" flow that the LP relaxation may have introduced (slack flow that does not contribute to demand satisfaction).

4.2 The A* Technique (for general MILP at scale)

The MILP for AllGather and AllReduce-with-copy does not relax to LP. The A* technique partitions the problem in time into sequential rounds, each round solving a smaller MILP.

Step-by-step:

Pre-processing: Compute Floyd-Warshall shortest-path distances FW_{n,d} between every pair of nodes on the topology graph (weighted by alpha).
Logical clique edges: For each round, augment the topology with logical edges from every node n to every destination d, weighted by FW_{n,d}. These are virtual; they do not consume real link capacity.
Round objective: Solve a K_round-epoch MILP that maximizes the same rewarded delivery objective, but the logical-edge weights give partial credit for moving a chunk closer to its destination even if it does not arrive in this round.
Inter-round state: Late-arriving chunks from this round (Q_{s,c,k',r}) are carried forward as initial buffer state to round r+1.
Demand update: chunks satisfied this round are removed from the demand matrix D for round r+1.
Termination: repeat until all demands are satisfied.

The Floyd-Warshall reward is the A heuristic*: each round greedily makes progress toward the destinations; rounds chained together produce a globally near-optimal schedule.

4.3 Algorithm 1 — Epoch Bound Finder

The MILP requires an a-priori choice of K (number of epochs). Too small and the problem is infeasible; too large and solver time balloons. Algorithm 1 finds the smallest feasible K:

Iterate over candidate completion times C_tau.
For each C_tau, test epoch counts n_e in {4, 8, 12} (preset).
Use Gurobi's "feasibility" mode: returns yes/no quickly without full optimization.
Binary-search to identify the smallest n_e that admits a feasible schedule for the given C_tau.

This is much faster than running the full MILP at every candidate K.

5. Important Considerations

5.1 Epoch Duration `tau`

Sweeping tau over a small grid produces better schedules than any single fixed value.
Default: tau set to the time the fastest link takes to deliver one chunk.
Smaller tau increases epoch count K and solver time; larger tau coarsens the schedule (risking missed pipelining opportunities).

5.2 Number of Epochs `K`

Determined via the binary search of Algorithm 1 (above).
Gurobi's feasibility-only mode terminates quickly even for large MILPs, making the search tractable.

5.3 Switch Models

Two abstractions are supported, matching different real-world fabrics:

SHARP-style switch:
- Switch is a node with B = 0 (no buffer).
- Supports flow conservation with in-network copy.
- Models NVSwitch + SHARP (NVIDIA in-network reduction) and multicast-capable fabrics.
Legacy hyper-edge switch:
- Switch is removed; replaced by a fully-connected clique among all GPUs sharing the switch.
- Capacity constraints ensure the sum of concurrently active hyper-edges does not exceed the physical switch's port capacity.
- Matches TACCL's switch model — useful for apples-to-apples comparisons.

5.4 Buffer Sizing

Chunk size in evaluation: 25 KB (default).
Buffer sizes swept: 1 KB to 1 GB — covers the small-to-large message regime that NCCL/RCCL ship algorithms for.

6. Evaluation

6.1 Topologies (Table 2)

Topology	GPUs/chassis	Edges/chassis
Internal 1	4	8
Internal 2	2	2
DGX1	8	32
NDv2	8	32
DGX2	17	32
AMD	16	56

DGX1, DGX2, NDv2 are publicly-known NVIDIA topologies. Internal-1 and Internal-2 are proprietary cloud topologies (Internal-1 has alpha = 0.6 / 0.75 us per link). AMD is a real testbed configuration (2 chassis, 32 MI250x GPUs, ROCm 6.0).

6.2 Solver Times — TE-CCL vs. Baselines

Direct head-to-head (single-chassis, small chunk count):

AllToAll, 8 chunks: TE-CCL 1.88 s vs. MSCCL timeout at 10,032 s (>5000x speedup; same or better schedule quality).

TACCL solver behavior:

AllToAll on 2-chassis DGX2: TACCL timed out after 4+4 hours (8 hours total).
TACCL fails on heterogeneous topologies and at >2 chassis frequently.

TE-CCL solver scaling (Table 4):

Topology	Collective	# GPUs	Solver Time
Internal 1	AllGather (A*)	64	3000 s
Internal 1	AllGather (A*)	128	7 h
Internal 2	AllGather (A*)	128	1300 s
Internal 2	AllGather (A*)	256	2.8 h
Internal 1	AllToAll	16	66 s
Internal 1	AllToAll	128	800 s
Internal 2	AllToAll	256	1500 s

AllToAll uses LP relaxation; AllGather uses A* with MILP. AllToAll solves much faster because LP scales much better than MILP.

6.3 Schedule Quality — Algorithm Bandwidth

Single-chassis simulation (NDv2, DGX2, DGX1):

TE-CCL produces schedules with 2x better algorithm bandwidth than TACCL on supported topologies.
Figure 5: small-message regime on NDv2 / DGX2 — algorithm-bandwidth improvement vs. TACCL peaks above 3000% (>30x) for some configurations where TACCL's sketches collapse.

Pipelining wins (Table 3, DGX1, K=10, 25 KB chunks):

Collective	# chunks	MSCCL (us)	TE-CCL (us)	Pipelining?
AllGather	1	3.4	4.0	No
AllGather	2	5.1	5.0	Yes
AllGather	3	8.0	6.1	Yes
AllToAll	1	3.4	4.0	No

1-chunk regime: MSCCL marginally faster (no pipelining possible; TE-CCL's epoch overhead shows).
3-chunk regime: TE-CCL 24% faster by overlapping chunk transfers across links.
This is the only regime where MSCCL beats TE-CCL.

6.4 AMD MI250x Real-Hardware Testbed (Figure 8)

Configuration: 2 chassis, 32 MI250x GPUs, ROCm 6.0, RCCL.
Workloads: AllReduce, AllGather, AllToAll across 1 KB - 1 GB.
Headline:
- TE-CCL 2.14x faster than TACCL
- TE-CCL 3.18x faster than RCCL (algorithm bandwidth)
For 1 MB transfers: ~3x faster than RCCL; for larger sizes (~100 MB - 1 GB): 1.5-2x faster.

6.5 In-Network Copy Wins

For large-message AllReduce / AllGather (workloads that admit multicast):

DGX1 at 0.21 GB transfer: in-network copy reduces transfer time by up to 50%.
Internal-1 at 0.21 GB: similar 50% reduction.
For AllToAll (no copy possible): copy modeling is irrelevant; LP relaxation is the win.

6.6 Latency `alpha` Modeling Sensitivity (Figure 3)

Without modeling alpha per link, the optimizer's predicted throughput differs from actual throughput by:
- 100% relative error at <10 KB transfers
- ~50% error at 100 KB
- ~10% error at >1 MB
Confirms that for the small-message regime, link-latency modeling is the binding correctness constraint.

The paper organizes prior work into three families:

(A) Multicast TE / WAN TE — predecessors to TE-CCL's MCF approach, historically applied to network routing rather than collectives.

(B) Prior CCL synthesizers:

SCCL / MSCCL (SAT-based): bandwidth-optimal, scales to ~1 chassis. TE-CCL replaces SAT with MILP for both faster solving and multi-chassis scalability.
TACCL (sketch heuristic): scales further than SCCL but requires hand-tuned communication sketches and is up to 2x sub-optimal. TE-CCL eliminates the need for sketches.
Blink: bandwidth-aware tree generation; less general than TE-CCL's MCF formulation.
MSCCLang / GC3: complementary — they are codegen layers (DSL + compiler for instruction-level fusion and threadblock scheduling), while TE-CCL is the upstream schedule synthesizer. The two compose: TE-CCL produces the algorithm; MSCCLang lowers it to efficient CUDA.

(C) NCCL / RCCL fixed-template libraries: Ring and double-binary Tree only; tuned for canonical NVLink-symmetric DGX1/DGX2 topologies. TE-CCL beats them on heterogeneous fabrics by 2-3x.

8. Limitations and Future Work

8.1 Failure Handling

The MCF formulation simplifies Clos topologies into a "big-switch" abstraction. Intra-chassis link failures are not represented in the model.
A failure invalidates the pre-computed XML schedule.
Future direction: robust optimization or online re-synthesis.

8.2 Cloud Dynamics / Multi-Tenancy

Assumes static alpha and beta per link.
In multi-tenant or shared environments, these fluctuate due to competing traffic, congestion, and topology changes.
Future direction: online learning of link parameters; uncertainty- aware MCF; closed-loop adaptation.

8.3 Compute Cost in AllReduce

The reduction operation (sum/avg of gradients on the GPU) takes real time, especially on small messages, but is treated as free in the MILP.
Future direction: multi-stage demand matrices that include compute as a node-internal "edge" with its own latency.

8.4 Hardware-Specific Mapping

TE-CCL produces algorithmic schedules but does not model threadblock budgets, register pressure, memory-channel contention, or kernel-launch overhead.
Future direction: integration with MSCCLang / GC3-style codegen.

8.5 Static-Schedule Assumption

The XML schedule is open-loop: executed deterministically once computed.
No runtime adaptation to stragglers, transient interference, or topology change.
Robustness today is "coarse" — pad alpha/beta with safety margins.

9. Discussion of NCCL / RCCL / MSCCL / TACCL

9.1 NCCL and RCCL

Use manually-defined Ring and double-binary Tree algorithms.
TE-CCL generates custom schedules that exploit non-uniform topologies — particularly heterogeneous bandwidth between intra-chassis NVLink and inter-chassis IB/Ethernet.
TE-CCL output (XML) feeds directly into NCCL/RCCL via the MSCCL runtime, replacing the default Ring/Tree templates.

9.2 MSCCL / SCCL

MSCCL/SCCL use SAT solvers; precise but capped at ~1 chassis.
TE-CCL's MCF/MILP approach is much faster and scales to hundreds of GPUs while preserving near-optimal schedule quality.
MSCCL's main schedule-quality advantage (1-chunk regime) comes from its ability to exhaustively search small problems.

9.3 TACCL

TACCL relies on human-supplied "communication sketches" to make the search tractable.
TACCL does not model latency precisely, so its schedules are sub-optimal on small messages.
TE-CCL eliminates sketches and improves performance by 2x while matching TACCL's solver speed.

9.4 Pipelining as the Key Distinction

TE-CCL's epoch-based formulation explicitly models pipelining: a node can be receiving one chunk while forwarding another.
MSCCL's formulation uses rigid barriers — a chunk must be fully received before any forwarding.
This is the source of TE-CCL's 24% AllGather-3-chunk win in Table 3.

9.5 In-Network Multicast (SHARP, NVSwitch)

The SHARP-style switch model is a first-class citizen in the MILP.
For AllReduce / broadcast workloads on NVSwitch + SHARP fabrics, copy reduces transfer time by up to 50%.
For AllToAll (no benefit from copy), the LP relaxation path bypasses this machinery entirely.

10. Cross-Cutting Empirical Take-Aways

Take-away	Derived from
MILP scales further than SAT for CCL	TE-CCL solves 256-GPU AllToAll where MSCCL/SCCL is infeasible
Sketches are not needed if MILP/A* is used	TE-CCL beats TACCL by 2x with no human input
Latency `alpha` modeling is decisive at small messages	Figure 3: >100% error without alpha at <10 KB
Pipelining beats rigid barriers	Table 3: AllGather 3-chunk drops 8.0 us -> 6.1 us
In-network copy halves AllReduce/AllGather at large sizes	DGX1 / Internal-1 at 0.21 GB: 50% reduction
Real hardware confirms simulation: 3.18x vs RCCL on AMD	Figure 8
LP relaxation is correct only when no copy is needed	AllToAll uses LP; AllReduce/AllGather use MILP
Static schedules suffice for centrally-managed clusters	Open-loop execution via MSCCL XML

11. Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer. Its policy selects, per collective invocation, the algorithm (Ring / Tree / CollNet / NVLS), protocol (LL / LL128 / Simple), nChannels, numThreads, and chunkSize to minimize collective wall-clock time. State features: log-binned message size, model intensity I = C/D, local batch size, topology fingerprint, and an LSTM-encoded recent-collective timing window. Reward is -collective_wall_clock_us. DynamICCL operates inside NCCL via the tuner-plugin API.

TE-CCL connects to DynamICCL across multiple design axes:

Direct mappings to DynamICCL design:

TE-CCL finding	DynamICCL design implication
2x improvement over TACCL purely from explicit pipelining	DynamICCL's `chunkSize` / `numPipeOps` action axis is real and matters; never collapse it.
Latency `alpha` causes >100% throughput error at <10 KB	Strong action prior: at small messages, prefer LL + Tree (low-latency path). DynamICCL should encode `alpha`-vs-`beta` regime as a state feature (effectively log-binned message size).
In-network copy halves time at >100 MB on NVSwitch+SHARP	DynamICCL should expose CollNet / NVLS as an action when the topology fingerprint indicates SHARP-capable fabric; otherwise mask it out.
MILP on 128-GPU AllGather takes 7 hours	Synthesis cost is large; DynamICCL's RL-trained policy amortizes this across many invocations and can pick a TE-CCL-emitted schedule slot at run time without re-solving.
TE-CCL emits MSCCL XML, executable via NCCL	DynamICCL's action space includes a "use MSCCL schedule X" option. The two systems compose; DynamICCL is the run-time selector over a library that includes TE-CCL outputs alongside vanilla Ring/Tree.
Static schedules cannot adapt to dynamic alpha/beta	DynamICCL's online-RL formulation directly solves this gap — the recent-collective timing window in state lets the policy detect congestion.
Topology heterogeneity (DGX1/DGX2/NDv2/AMD) is the primary win	Topology fingerprint must be an explicit state feature, not assumed away. Train per-fabric policy heads or feed topology as embedding.

Specific design priors for the RL agent:

Action space layering. DynamICCL's primary action space is the NCCL parameter tuple (algorithm, protocol, nChannels, numThreads, chunkSize). To incorporate TE-CCL, add a categorical sub-action "MSCCL schedule slot in {none, ring_default, te_ccl_AG_v1, te_ccl_A2A_v1, ...}". The policy learns when the synthesized schedule beats the vendor default.
Reward consistency. TE-CCL's MILP objective rewards earlier delivery, summed 1/(k+1) * R_{s,d,k}. DynamICCL's -collective_wall_clock_us is a stricter version of the same intent — minimize completion time. Cross-validation: both papers agree that completion-time, not throughput, is the right metric.
State features grounded by TE-CCL evidence.
- Log-binned message size (small / medium / large) corresponds to the alpha-bound / transition / beta-bound regimes TE-CCL identifies via Figure 3.
- Topology fingerprint corresponds to TE-CCL's distinct schedule wins per fabric (DGX1 vs. DGX2 vs. NDv2 vs. AMD).
- SHARP-capable / multicast-capable boolean: maps to TE-CCL's SHARP-style switch model selector and gates the CollNet/NVLS action.
- Recent-collective timing window: TE-CCL's open problem ("dynamic alpha/beta") is exactly this signal. DynamICCL observes it; TE-CCL does not.
Exploration prior on (algorithm, protocol):
- msg < 16 KiB: Tree + LL (alpha-bound; TE-CCL Figure 3 evidence)
- 16 KiB - 1 MiB: Tree or Ring + LL128 (transition zone)
- msg >= 1 MiB: Ring + Simple + max nChannels (beta-bound; pipelining-friendly)
- On SHARP-capable fabric and AllReduce, also explore CollNet/NVLS above 1 MiB (TE-CCL's "in-network copy halves time" regime).
Exploration budget allocation.
- More exploration budget on heterogeneous fabrics where TE-CCL showed Ring/Tree are far from optimal (NDv2, DGX2, AMD).
- Less on canonical NVLink-symmetric DGX1 where Ring is already near-optimal.
- Topology-fingerprint-conditioned exploration noise.
Reward shaping.
- Primary: r = -collective_wall_clock_us.
- Optional secondary: penalize p99 tail latency to discourage the "lucky-fast" schedule that becomes catastrophic under interference — exactly the failure mode TE-CCL flags as future work.
Research positioning. TE-CCL is a static, open-loop schedule synthesizer for centrally-managed clusters with stable topology. DynamICCL is the run-time, closed-loop, congestion-adaptive layer that picks among schedules (including TE-CCL-emitted ones) and tunes NCCL parameters. The papers are orthogonal layers in the same stack — their composition (TE-CCL synthesizes; DynamICCL selects + tunes) is the natural deployment.
Open-problem alignment. TE-CCL Section 8 lists four open problems. DynamICCL directly addresses three:
- Multi-tenant / unknown topology: DynamICCL's online RL learns from observed runtime feedback regardless of topology change.
- Compute-aware AllReduce: DynamICCL's reward is end-to-end wall-clock, which already includes reduction-kernel time.
- Stragglers / failures: DynamICCL's recent-timing-window feature lets the policy detect and route around interference. The fourth problem (hardware-specific lowering) belongs to the MSCCLang/GC3 layer, not DynamICCL.