TE-CCL: Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem
Xuting Liu (UPenn), Behnaz Arzani (MSR), Siva Kesava Reddy Kakarla (MSR), Liangyu Zhao (UW), Vincent Liu (UPenn), Miguel Castro (OpenAI), Srikanth Kandula (Microsoft), Luke Marshall (MSR) | ACM SIGCOMM 2024 | DOI: 10.1145/3651890.3672249
Problem
Cloud operators of single-tenant, centrally-managed GPU training clusters need a collective-communication-library (CCL) optimizer that can produce near-optimal schedules at the scale of today's deployments — hundreds of GPUs across tens of chassis on heterogeneous topologies (DGX1, DGX2, NDv2, AMD MI250x). Existing CCL synthesis falls into two camps and neither covers this regime. Precise solvers (SCCL with SAT, MSCCL's prior MILP variants) are bandwidth-optimal but cannot scale beyond one or two chassis — solver times explode super-exponentially. Heuristic solvers (TACCL with hand-designed "communication sketches") scale better but require human-in-the-loop sketching and produce schedules that are up to 2x slower than truly optimal. The result: ML systems teams either accept NCCL/RCCL's hard-coded Ring/Tree templates (which ignore irregular fabrics), or they pay synthesis-time costs incompatible with operations at production scale. There is no optimizer that simultaneously (a) scales to many chassis, (b) produces near-optimal algorithm bandwidth, and (c) requires no manual topology hints.
Core Insight
CCL scheduling is structurally a multi-commodity flow (MCF)
problem from traffic engineering — the long-studied LP/MILP
framework used to route demands across WAN backbones — but with four
CCL-specific extensions: (i) discretized chunked sends
(integer flow variables to prevent half-chunk delivery), (ii)
in-network multicast/copy (intermediate GPUs replicate data,
breaking flow conservation), (iii) explicit per-link latency
alpha and bandwidth beta modeling so
small-message regimes are correct, and (iv) store-and-forward
buffering in deep GPU memory rather than shallow-buffered
routers. Adopting MCF as the substrate enables the authors to plug in
classic TE scaling techniques — LP relaxation (when copies are not
needed, e.g. AllToAll) and time-partitioned A* search with
Floyd-Warshall reward shaping — and reach 256-GPU scale where
SAT/SAT-like CCL solvers fail.
Method
TE-CCL is a three-stage pipeline:
+------------------------------------------------------------------+
| INPUT |
| Topology graph (nodes N, switches S, edges E, |
| link capacity T_ij, link latency alpha_ij) |
| Demand matrix D : N x N x C -> {0,1} (chunk-level) |
| Epoch duration tau, # epochs K |
+----------------------------+-------------------------------------+
v
+------------------------------------------------------------------+
| OPTIMIZER |
| General MILP (multicast / AllReduce / AllGather) |
| - integer flow vars F_{s,i,j,k,c} (chunk c at epoch k on i->j)|
| - buffer vars B_{s,n,k,c} |
| - reward vars R_{s,d,k} |
| Reduction to LP for AllToAll (no copy required) |
| |
| Constraints: |
| [C1] Source-buffer initial conditions |
| [C2] Capacity sum F_{*,i,j,k,*} <= T_ij * tau |
| [C3] Flow conservation with delay (delta_ij = ceil(alpha/tau))|
| [C4] Buffer evolution (store-and-forward) |
| [C5] Destination R_{s,d,K,c} = D_{s,d,c} |
| [C6] Multicast/copy via max/min operators |
| |
| Objective: max sum_{k,s,d} (1/(k+1)) * R_{s,d,k} |
| (rewards earlier delivery; preserves correctness) |
+----------------------------+-------------------------------------+
v
+------------------------------------------------------------------+
| SCALING |
| (a) LP relaxation -> AllToAll, AllGather (no copy) |
| (b) A* time partitioning -> general MILP at scale |
| pre: Floyd-Warshall FW_{n,d} on alpha-weighted graph |
| add logical clique edges weighted by FW |
| solve round-by-round; carry late-arriving chunks Q |
| (c) Algorithm 1: binary search over (C_tau, n_e in {4,8,12}) |
| to find minimum feasible epoch count |
+----------------------------+-------------------------------------+
v
+------------------------------------------------------------------+
| POST-PROCESS |
| Reverse-DFS prune: zero out flows that don't contribute |
| to satisfying demand (cleans LP-relaxation slack) |
+----------------------------+-------------------------------------+
v
+------------------------------------------------------------------+
| OUTPUT: MSCCL-compatible XML schedule |
| executable on NCCL/RCCL via MSCCL runtime |
+------------------------------------------------------------------+
Two switch abstractions are supported: SHARP-style (switch is a zero-buffer node that supports copy, modeling NVSwitch + SHARP / NVSwitch + multicast) and legacy hyper-edge (switch removed, replaced by a clique of direct GPU-GPU edges with port-capacity constraints — matches TACCL's switch model).
Experimental Setup
| Component | Value |
|---|---|
| Solver | Gurobi (MILP and LP) |
| Output format | MSCCL-XML, executed on NCCL/RCCL |
| Simulation topologies | DGX1 (8 GPUs/32 edges), DGX2 (17/32), NDv2 (8/32), Internal-1 (4/8), Internal-2 (2/2) |
| Real testbed | 2-chassis AMD cluster, 32x MI250x GPUs, ROCm 6.0 |
| Baselines | NCCL, RCCL, MSCCL/SCCL (SAT-based), TACCL (sketch heuristic) |
| Workloads | AllGather, AllToAll, AllReduce |
| Chunk size | 25 KB (standard); buffer sizes swept 1 KB to 1 GB |
| Metric | Algorithm bandwidth (GB/s), end-to-end completion time (us), solver wall-clock |
Headline Quantitative Results
Schedule quality vs. TACCL (the heuristic state of the art):
- 2x better algorithm bandwidth on NDv2 across collectives
- AMD MI250x testbed: 2.14x faster than TACCL, 3.18x faster than RCCL
- Figure 5 shows >3000% (30x) improvement vs. TACCL for small transfers on NDv2 / DGX2 in some regimes (TACCL's sketches collapse there)
Schedule quality vs. MSCCL (the precise SAT-based competitor):
- AllToAll, 8 chunks: TE-CCL solves in 1.88 s vs. MSCCL timeout at 10,032 s (>5000x speedup with comparable schedule quality)
- Table 3 (DGX1, K=10, 25 KB chunks): for 3-chunk AllGather, TE-CCL delivers in 6.1 us vs. MSCCL 8.0 us (~24% faster) by exposing pipelining MSCCL cannot see
- 1-chunk AllGather: MSCCL 3.4 us vs. TE-CCL 4.0 us — MSCCL wins marginally when no pipelining is possible (single-chunk regime is TE-CCL's only loss case)
Solver scalability (Table 4, A technique):*
- Internal-1 AllGather, 64 GPUs: 3000 s
- Internal-1 AllGather, 128 GPUs: 7 hours
- Internal-2 AllGather, 256 GPUs: 2.8 hours
- Internal-1 AllToAll, 128 GPUs: 800 s
- Internal-2 AllToAll, 256 GPUs: 1500 s
- TACCL on 2-chassis DGX2 AllToAll: timed out after 4+4 hours
- MSCCL on >2 chassis: not feasible (SAT explodes)
Per-regime conclusions:
- Small messages: alpha (link latency) modeling is decisive — without it, throughput error exceeds 100% for transfers <10 KB (Figure 3)
- Large messages: in-network copy reduces AllReduce/AllGather time by up to 50% at 0.21 GB on DGX1 and Internal-1 topologies
- Pipelining: TE-CCL's epoch-based formulation lets nodes forward chunks while still receiving others, the key win over MSCCL's rigid-barrier formulation
Limitations
- Failure handling: The MCF formulation simplifies Clos topologies to a "big-switch" abstraction; intra-chassis link failures are not modeled and would require resolving.
- Cloud dynamics: Assumes stable per-link
alphaandbeta; in multi-tenant or shared environments these fluctuate, breaking the pre-computed schedule. - Compute cost ignored: AllReduce treats reduction math as free; GPU-side reduction kernel overhead (a real cost on small messages) is outside the model.
- Hardware mapping: No accounting for thread-block budget, register pressure, memory-channel contention — the gap that MSCCLang/GC3 fill on the codegen side.
- Static schedules only: Open-loop execution of a pre-computed XML schedule. No runtime adaptation to stragglers or topology change (only coarse robustness via inflated alpha/beta margins).
- Single-tenant assumption: The whole optimizer presumes a centrally-managed cluster where topology and demand are known.
Open Problems Called Out
- Multi-tenant / shared networks: when topology and per-link latency are unknown or change, MCF inputs become uncertain — needs an online-learning or robust-optimization extension.
- Compute-aware AllReduce: explicit modeling of reduction-kernel cost via multi-stage demand matrices.
- Robust schedules under stragglers / failures: schedules that degrade gracefully rather than fail when an intra-chassis link drops.
- Hardware-specific lowering: integration with MSCCLang/GC3-style threadblock-aware codegen so the algorithmic schedule TE-CCL produces is matched by an equally good kernel.
Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that selects
per-collective algorithm (Ring/Tree/CollNet/NVLS),
protocol (LL/LL128/Simple), nChannels,
numThreads, and chunkSize to minimize
collective wall-clock time on HPC GPU clusters. State features include
log-binned message size, model intensity I = C/D, local batch size,
topology fingerprint, and an LSTM-encoded recent-collective timing
window. Reward is -collective_wall_clock_us. TE-CCL
provides DynamICCL with both action-space priors and
reward-design validation:
TE-CCL is a synthesizer; DynamICCL is a selector. They compose, not compete. TE-CCL produces XML schedules consumable by NCCL via MSCCL. DynamICCL's action space includes an "MSCCL/CollNet schedule slot" — TE-CCL is one of the schedule generators that populates that slot. The papers are orthogonal layers.
Topology fingerprint is the right state feature. TE-CCL's 2x gains over TACCL come from not assuming a uniform topology: NDv2, DGX2 and AMD have different intra/inter-chassis bandwidth ratios and different
alphaper link. DynamICCL must observe an explicit topology embedding (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet) so its policy can specialize per fabric.Message-size regimes are real and decisive. TE-CCL Table 3 and Figure 3 confirm the phase structure DynamICCL already exploits: small messages are alpha-bound (latency-dominated), large messages are beta-bound (bandwidth/pipelining-dominated). DynamICCL's exploration prior should match: protocol=LL + algorithm=Tree below ~64 KiB; protocol=Simple + algorithm=Ring + max nChannels above ~1 MiB.
Pipelining = chunkSize/numPipeOps action axis. TE-CCL's win over MSCCL is explicit pipelining within an epoch model. DynamICCL tunes
chunkSizedirectly — this is the same lever, exposed at the tuner level. The TE-CCL evidence (3-chunk AllGather drops from 8.0 us to 6.1 us purely from pipelining) is empirical justification for keeping chunkSize as a first-class action dimension.Reward should be collective wall-clock, not algBw proxy. TE-CCL's MILP objective rewards earlier delivery (sum of
(1/(k+1)) * R_{s,d,k}), not raw bandwidth. DynamICCL's reward-collective_wall_clock_usmatches this design choice — both systems reject peak-throughput proxies in favor of completion-time metrics. This is a useful cross-validation.Solver-time vs. exploration-budget analogy. TE-CCL spends 7 hours synthesizing one 128-GPU schedule; DynamICCL spends comparable wall-clock on RL training. The DynamICCL design must amortize that cost across many (workload, message-size) inquiries — and TE-CCL's experience that solver time grows with topology size suggests DynamICCL should split its policy network by topology cluster rather than train one monolith.
Static schedules are a strong baseline, but DynamICCL is the adaptive layer. TE-CCL itself flags as future work the regime where alpha/beta fluctuate — exactly DynamICCL's value proposition. An RL agent observing a recent-collective timing window can adapt to congestion or interference, while TE-CCL's pre-computed XML cannot. This is the natural research positioning: TE-CCL handles "synthesize the best schedule for the steady-state topology", DynamICCL handles "pick the right schedule and parameters per collective at run time."
Open Problem #1 ("multi-tenant / unknown topology") maps directly onto DynamICCL's online-RL formulation. The DynamICCL tuner-plugin observes runtime feedback and adapts — this is the answer TE-CCL gestures at but does not implement.