TE-CCL: Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

Xuting Liu (UPenn), Behnaz Arzani (MSR), Siva Kesava Reddy Kakarla (MSR), Liangyu Zhao (UW), Vincent Liu (UPenn), Miguel Castro (OpenAI), Srikanth Kandula (Microsoft), Luke Marshall (MSR) | ACM SIGCOMM 2024 | DOI: 10.1145/3651890.3672249

Problem

Cloud operators of single-tenant, centrally-managed GPU training clusters need a collective-communication-library (CCL) optimizer that can produce near-optimal schedules at the scale of today's deployments — hundreds of GPUs across tens of chassis on heterogeneous topologies (DGX1, DGX2, NDv2, AMD MI250x). Existing CCL synthesis falls into two camps and neither covers this regime. Precise solvers (SCCL with SAT, MSCCL's prior MILP variants) are bandwidth-optimal but cannot scale beyond one or two chassis — solver times explode super-exponentially. Heuristic solvers (TACCL with hand-designed "communication sketches") scale better but require human-in-the-loop sketching and produce schedules that are up to 2x slower than truly optimal. The result: ML systems teams either accept NCCL/RCCL's hard-coded Ring/Tree templates (which ignore irregular fabrics), or they pay synthesis-time costs incompatible with operations at production scale. There is no optimizer that simultaneously (a) scales to many chassis, (b) produces near-optimal algorithm bandwidth, and (c) requires no manual topology hints.

Core Insight

CCL scheduling is structurally a multi-commodity flow (MCF) problem from traffic engineering — the long-studied LP/MILP framework used to route demands across WAN backbones — but with four CCL-specific extensions: (i) discretized chunked sends (integer flow variables to prevent half-chunk delivery), (ii) in-network multicast/copy (intermediate GPUs replicate data, breaking flow conservation), (iii) explicit per-link latency alpha and bandwidth beta modeling so small-message regimes are correct, and (iv) store-and-forward buffering in deep GPU memory rather than shallow-buffered routers. Adopting MCF as the substrate enables the authors to plug in classic TE scaling techniques — LP relaxation (when copies are not needed, e.g. AllToAll) and time-partitioned A* search with Floyd-Warshall reward shaping — and reach 256-GPU scale where SAT/SAT-like CCL solvers fail.

Method

TE-CCL is a three-stage pipeline:

+------------------------------------------------------------------+
| INPUT                                                            |
|   Topology graph (nodes N, switches S, edges E,                  |
|     link capacity T_ij, link latency alpha_ij)                   |
|   Demand matrix D : N x N x C -> {0,1} (chunk-level)             |
|   Epoch duration tau, # epochs K                                 |
+----------------------------+-------------------------------------+
                             v
+------------------------------------------------------------------+
| OPTIMIZER                                                        |
|   General MILP (multicast / AllReduce / AllGather)               |
|   - integer flow vars F_{s,i,j,k,c}  (chunk c at epoch k on i->j)|
|   - buffer vars B_{s,n,k,c}                                      |
|   - reward vars R_{s,d,k}                                        |
|   Reduction to LP for AllToAll (no copy required)                |
|                                                                  |
|   Constraints:                                                   |
|     [C1] Source-buffer initial conditions                        |
|     [C2] Capacity         sum F_{*,i,j,k,*} <= T_ij * tau        |
|     [C3] Flow conservation with delay (delta_ij = ceil(alpha/tau))|
|     [C4] Buffer evolution (store-and-forward)                    |
|     [C5] Destination       R_{s,d,K,c} = D_{s,d,c}               |
|     [C6] Multicast/copy via max/min operators                    |
|                                                                  |
|   Objective:  max sum_{k,s,d} (1/(k+1)) * R_{s,d,k}              |
|               (rewards earlier delivery; preserves correctness)  |
+----------------------------+-------------------------------------+
                             v
+------------------------------------------------------------------+
| SCALING                                                          |
|   (a) LP relaxation         -> AllToAll, AllGather (no copy)     |
|   (b) A* time partitioning  -> general MILP at scale             |
|         pre: Floyd-Warshall FW_{n,d} on alpha-weighted graph     |
|         add logical clique edges weighted by FW                  |
|         solve round-by-round; carry late-arriving chunks Q       |
|   (c) Algorithm 1: binary search over (C_tau, n_e in {4,8,12})   |
|       to find minimum feasible epoch count                       |
+----------------------------+-------------------------------------+
                             v
+------------------------------------------------------------------+
| POST-PROCESS                                                     |
|   Reverse-DFS prune: zero out flows that don't contribute        |
|   to satisfying demand (cleans LP-relaxation slack)              |
+----------------------------+-------------------------------------+
                             v
+------------------------------------------------------------------+
| OUTPUT: MSCCL-compatible XML schedule                            |
|         executable on NCCL/RCCL via MSCCL runtime                |
+------------------------------------------------------------------+

Two switch abstractions are supported: SHARP-style (switch is a zero-buffer node that supports copy, modeling NVSwitch + SHARP / NVSwitch + multicast) and legacy hyper-edge (switch removed, replaced by a clique of direct GPU-GPU edges with port-capacity constraints — matches TACCL's switch model).

Experimental Setup

Component	Value
Solver	Gurobi (MILP and LP)
Output format	MSCCL-XML, executed on NCCL/RCCL
Simulation topologies	DGX1 (8 GPUs/32 edges), DGX2 (17/32), NDv2 (8/32), Internal-1 (4/8), Internal-2 (2/2)
Real testbed	2-chassis AMD cluster, 32x MI250x GPUs, ROCm 6.0
Baselines	NCCL, RCCL, MSCCL/SCCL (SAT-based), TACCL (sketch heuristic)
Workloads	AllGather, AllToAll, AllReduce
Chunk size	25 KB (standard); buffer sizes swept 1 KB to 1 GB
Metric	Algorithm bandwidth (GB/s), end-to-end completion time (us), solver wall-clock

Headline Quantitative Results

Schedule quality vs. TACCL (the heuristic state of the art):

2x better algorithm bandwidth on NDv2 across collectives
AMD MI250x testbed: 2.14x faster than TACCL, 3.18x faster than RCCL
Figure 5 shows >3000% (30x) improvement vs. TACCL for small transfers on NDv2 / DGX2 in some regimes (TACCL's sketches collapse there)

Schedule quality vs. MSCCL (the precise SAT-based competitor):

AllToAll, 8 chunks: TE-CCL solves in 1.88 s vs. MSCCL timeout at 10,032 s (>5000x speedup with comparable schedule quality)
Table 3 (DGX1, K=10, 25 KB chunks): for 3-chunk AllGather, TE-CCL delivers in 6.1 us vs. MSCCL 8.0 us (~24% faster) by exposing pipelining MSCCL cannot see
1-chunk AllGather: MSCCL 3.4 us vs. TE-CCL 4.0 us — MSCCL wins marginally when no pipelining is possible (single-chunk regime is TE-CCL's only loss case)

Solver scalability (Table 4, A technique):*

Internal-1 AllGather, 64 GPUs: 3000 s
Internal-1 AllGather, 128 GPUs: 7 hours
Internal-2 AllGather, 256 GPUs: 2.8 hours
Internal-1 AllToAll, 128 GPUs: 800 s
Internal-2 AllToAll, 256 GPUs: 1500 s
TACCL on 2-chassis DGX2 AllToAll: timed out after 4+4 hours
MSCCL on >2 chassis: not feasible (SAT explodes)

Per-regime conclusions:

Small messages: alpha (link latency) modeling is decisive — without it, throughput error exceeds 100% for transfers <10 KB (Figure 3)
Large messages: in-network copy reduces AllReduce/AllGather time by up to 50% at 0.21 GB on DGX1 and Internal-1 topologies
Pipelining: TE-CCL's epoch-based formulation lets nodes forward chunks while still receiving others, the key win over MSCCL's rigid-barrier formulation

Limitations

Failure handling: The MCF formulation simplifies Clos topologies to a "big-switch" abstraction; intra-chassis link failures are not modeled and would require resolving.
Cloud dynamics: Assumes stable per-link alpha and beta; in multi-tenant or shared environments these fluctuate, breaking the pre-computed schedule.
Compute cost ignored: AllReduce treats reduction math as free; GPU-side reduction kernel overhead (a real cost on small messages) is outside the model.
Hardware mapping: No accounting for thread-block budget, register pressure, memory-channel contention — the gap that MSCCLang/GC3 fill on the codegen side.
Static schedules only: Open-loop execution of a pre-computed XML schedule. No runtime adaptation to stragglers or topology change (only coarse robustness via inflated alpha/beta margins).
Single-tenant assumption: The whole optimizer presumes a centrally-managed cluster where topology and demand are known.

Open Problems Called Out

Multi-tenant / shared networks: when topology and per-link latency are unknown or change, MCF inputs become uncertain — needs an online-learning or robust-optimization extension.
Compute-aware AllReduce: explicit modeling of reduction-kernel cost via multi-stage demand matrices.
Robust schedules under stragglers / failures: schedules that degrade gracefully rather than fail when an intra-chassis link drops.
Hardware-specific lowering: integration with MSCCLang/GC3-style threadblock-aware codegen so the algorithmic schedule TE-CCL produces is matched by an equally good kernel.

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective algorithm (Ring/Tree/CollNet/NVLS), protocol (LL/LL128/Simple), nChannels, numThreads, and chunkSize to minimize collective wall-clock time on HPC GPU clusters. State features include log-binned message size, model intensity I = C/D, local batch size, topology fingerprint, and an LSTM-encoded recent-collective timing window. Reward is -collective_wall_clock_us. TE-CCL provides DynamICCL with both action-space priors and reward-design validation:

TE-CCL is a synthesizer; DynamICCL is a selector. They compose, not compete. TE-CCL produces XML schedules consumable by NCCL via MSCCL. DynamICCL's action space includes an "MSCCL/CollNet schedule slot" — TE-CCL is one of the schedule generators that populates that slot. The papers are orthogonal layers.
Topology fingerprint is the right state feature. TE-CCL's 2x gains over TACCL come from not assuming a uniform topology: NDv2, DGX2 and AMD have different intra/inter-chassis bandwidth ratios and different alpha per link. DynamICCL must observe an explicit topology embedding (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet) so its policy can specialize per fabric.
Message-size regimes are real and decisive. TE-CCL Table 3 and Figure 3 confirm the phase structure DynamICCL already exploits: small messages are alpha-bound (latency-dominated), large messages are beta-bound (bandwidth/pipelining-dominated). DynamICCL's exploration prior should match: protocol=LL + algorithm=Tree below ~64 KiB; protocol=Simple + algorithm=Ring + max nChannels above ~1 MiB.
Pipelining = chunkSize/numPipeOps action axis. TE-CCL's win over MSCCL is explicit pipelining within an epoch model. DynamICCL tunes chunkSize directly — this is the same lever, exposed at the tuner level. The TE-CCL evidence (3-chunk AllGather drops from 8.0 us to 6.1 us purely from pipelining) is empirical justification for keeping chunkSize as a first-class action dimension.
Reward should be collective wall-clock, not algBw proxy. TE-CCL's MILP objective rewards earlier delivery (sum of (1/(k+1)) * R_{s,d,k}), not raw bandwidth. DynamICCL's reward -collective_wall_clock_us matches this design choice — both systems reject peak-throughput proxies in favor of completion-time metrics. This is a useful cross-validation.
Solver-time vs. exploration-budget analogy. TE-CCL spends 7 hours synthesizing one 128-GPU schedule; DynamICCL spends comparable wall-clock on RL training. The DynamICCL design must amortize that cost across many (workload, message-size) inquiries — and TE-CCL's experience that solver time grows with topology size suggests DynamICCL should split its policy network by topology cluster rather than train one monolith.
Static schedules are a strong baseline, but DynamICCL is the adaptive layer. TE-CCL itself flags as future work the regime where alpha/beta fluctuate — exactly DynamICCL's value proposition. An RL agent observing a recent-collective timing window can adapt to congestion or interference, while TE-CCL's pre-computed XML cannot. This is the natural research positioning: TE-CCL handles "synthesize the best schedule for the steady-state topology", DynamICCL handles "pick the right schedule and parameters per collective at run time."
Open Problem #1 ("multi-tenant / unknown topology") maps directly onto DynamICCL's online-RL formulation. The DynamICCL tuner-plugin observes runtime feedback and adapts — this is the answer TE-CCL gestures at but does not implement.