TE-CCL: Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem — Detailed Summary

Xuting Liu (UPenn), Behnaz Arzani (MSR), Siva Kesava Reddy Kakarla (MSR), Liangyu Zhao (UW), Vincent Liu (UPenn), Miguel Castro (OpenAI), Srikanth Kandula (Microsoft), Luke Marshall (MSR) | ACM SIGCOMM 2024 | DOI: 10.1145/3651890.3672249

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.


Abstract

"Cloud operators utilize collective communication optimizers to enhance the efficiency of the single-tenant, centrally managed training clusters they manage. However, current optimizers struggle to scale for such use cases and often compromise solution quality for scalability. Our solution, TE-CCL, adopts a traffic-engineering-based approach to collective communication. Compared to a state-of-the-art optimizer, TACCL, TE-CCL produced schedules with 2x better performance on topologies TACCL supports (and its solver took a similar amount of time as TACCL's heuristic-based approach). TE-CCL additionally scales to larger topologies than TACCL. On our GPU testbed, TE-CCL outperformed TACCL by 2.14x and RCCL by 3.18x in terms of algorithm bandwidth."


1. Introduction

Why CCL synthesis is needed:

The two prior camps:

TE-CCL's positioning:

Headline numbers (introduced here, detailed later):


2. Background and Motivation

2.1 The Need for Fast Collective Scheduling

2.2 Relationship with TE Solutions

The paper draws a structural analogy between CCL scheduling and WAN traffic engineering. The MCF substrate of TE matches CCL's needs: inputs (topology, demands), constraints (capacity, flow conservation), objectives (max throughput / min completion time). But four CCL-specific differences must be handled:

Discretized Sends: "TE problems mostly focus on traffic bundles with high rates and the problem of allocating a fixed fraction of link capacity to each demand. Instead, CCLs have to schedule small- to medium-sized demands, which introduces more structure and adds new and, in some cases, hard-to-model constraints and dependencies."

In-Network Copies: "TE problems often assume flow conservation as a fundamental constraint; in contrast, collectives benefit significantly from copying data at intermediate GPUs, e.g., for tree broadcast/reduce patterns."

Latency and Queuing: "TE problems get away with focusing on steady-state effects... because they assume large traffic bundles. In contrast, we cannot ignore the effects of propagation and queuing delay for small transfers; modeling them is essential to CCL scheduling."

Support for Storage and Caching: "TE problems generally assume that data is received and sent as soon as possible... we can speed up solvers substantially if we use the available GPU memory."

These four extensions structure the rest of the paper.


3. The TE-CCL Model

3.1 The General Model (MILP)

Variables (Table 1 notation):

Variable Description
N, S Set of GPU nodes and switches
E Set of unidirectional edges
C Chunk IDs {0, 1, ..., C}
D Demand function N x N x C -> {0,1}
tau Epoch duration
K Number of epochs
F_{s,i,j,k,c} Flow of source-s chunk c over link (i,j) at epoch k
B_{s,i,k,c} Buffer at node i of source-s chunk c at start of epoch k
T_{ij} Capacity of link (i,j)
alpha_{ij} Fixed latency of link (i,j)
delta_{ij} Number of epochs in alpha_{ij} (i.e. ceil(alpha_{ij}/tau))
R_{s,d,k} Chunks from s read by d in epoch k

Constraints (in order, paraphrased to ASCII):

  1. Source-buffer initial condition: B_{s,n,0,c} = max_{d in N} D_{s,d,c} if n == s else 0 The source initializes its buffer with the maximum chunk demand destined for any node; non-source nodes start empty.

  2. Capacity: sum_{s in N} sum_{c in C} F_{s,i,j,k,c} <= T_{ij} * tau Total chunks traversing link (i,j) in epoch k cannot exceed the link's per-epoch capacity.

  3. Flow conservation with delay: `B_{s,n,k,c} + sum_{j: (j,n) in E} F_{s,j,n, k - ceil(delta_{jn}), c}

    = max_{j: (n,j) in E} F_{s,n,j, k+1, c}What is in noden's buffer plus arriving flow (delayed by the link's propagation latency) must cover what nsends out next epoch. Themax` operator on the right side encodes multicast: the same chunk can be replicated to multiple outgoing links without duplicating consumption.

  4. Buffer evolution (store-and-forward): B_{s,n,k,c} = B_{s,n,k-1,c} + sum_{j: (j,n) in E} F_{s,j,n, k - ceil(delta_{jn}) - 1, c} Buffer at the start of epoch k equals buffer at start of k-1 plus all chunks that arrive in the interim.

  5. Destination: R_{s,d,k,c} = min(D_{s,d,c}, B_{s,d,k+1,c}) R_{s,d,K,c} = D_{s,d,c} for the final epoch K Demand must be fully satisfied by the last epoch.

  6. Multicast/copy: encoded implicitly via the max on the right-hand side of (3) — the same buffered chunk can be sent on multiple outgoing edges in the same epoch.

  7. Reward variable: R_{s,d,k} is the count of source-s chunks delivered to destination d by epoch k, summed across c.

  8. Objective: maximize sum_{k in K, s,d in N, s != d} (1/(k+1)) * R_{s,d,k} Rewards earlier delivery; the 1/(k+1) weight pushes the optimizer to finish chunks as quickly as possible while preserving correctness (full demand satisfaction by epoch K).

Why integer (MILP) for the general case:

3.2 Three Aspects That the Model Captures

(Figure 2 in the paper) — three things the model must represent for correctness:

Figure 3 quantifies the latency point: ignoring alpha produces relative throughput error >100% for transfers <10 KB.


4. Scaling

4.1 LP Conversion (for collectives without in-network copy)

4.2 The A* Technique (for general MILP at scale)

The MILP for AllGather and AllReduce-with-copy does not relax to LP. The A* technique partitions the problem in time into sequential rounds, each round solving a smaller MILP.

Step-by-step:

  1. Pre-processing: Compute Floyd-Warshall shortest-path distances FW_{n,d} between every pair of nodes on the topology graph (weighted by alpha).
  2. Logical clique edges: For each round, augment the topology with logical edges from every node n to every destination d, weighted by FW_{n,d}. These are virtual; they do not consume real link capacity.
  3. Round objective: Solve a K_round-epoch MILP that maximizes the same rewarded delivery objective, but the logical-edge weights give partial credit for moving a chunk closer to its destination even if it does not arrive in this round.
  4. Inter-round state: Late-arriving chunks from this round (Q_{s,c,k',r}) are carried forward as initial buffer state to round r+1.
  5. Demand update: chunks satisfied this round are removed from the demand matrix D for round r+1.
  6. Termination: repeat until all demands are satisfied.

The Floyd-Warshall reward is the A heuristic*: each round greedily makes progress toward the destinations; rounds chained together produce a globally near-optimal schedule.

4.3 Algorithm 1 — Epoch Bound Finder

The MILP requires an a-priori choice of K (number of epochs). Too small and the problem is infeasible; too large and solver time balloons. Algorithm 1 finds the smallest feasible K:

This is much faster than running the full MILP at every candidate K.


5. Important Considerations

5.1 Epoch Duration tau

5.2 Number of Epochs K

5.3 Switch Models

Two abstractions are supported, matching different real-world fabrics:

5.4 Buffer Sizing


6. Evaluation

6.1 Topologies (Table 2)

Topology GPUs/chassis Edges/chassis
Internal 1 4 8
Internal 2 2 2
DGX1 8 32
NDv2 8 32
DGX2 17 32
AMD 16 56

DGX1, DGX2, NDv2 are publicly-known NVIDIA topologies. Internal-1 and Internal-2 are proprietary cloud topologies (Internal-1 has alpha = 0.6 / 0.75 us per link). AMD is a real testbed configuration (2 chassis, 32 MI250x GPUs, ROCm 6.0).

6.2 Solver Times — TE-CCL vs. Baselines

Direct head-to-head (single-chassis, small chunk count):

TACCL solver behavior:

TE-CCL solver scaling (Table 4):

Topology Collective # GPUs Solver Time
Internal 1 AllGather (A*) 64 3000 s
Internal 1 AllGather (A*) 128 7 h
Internal 2 AllGather (A*) 128 1300 s
Internal 2 AllGather (A*) 256 2.8 h
Internal 1 AllToAll 16 66 s
Internal 1 AllToAll 128 800 s
Internal 2 AllToAll 256 1500 s

AllToAll uses LP relaxation; AllGather uses A* with MILP. AllToAll solves much faster because LP scales much better than MILP.

6.3 Schedule Quality — Algorithm Bandwidth

Single-chassis simulation (NDv2, DGX2, DGX1):

Pipelining wins (Table 3, DGX1, K=10, 25 KB chunks):

Collective # chunks MSCCL (us) TE-CCL (us) Pipelining?
AllGather 1 3.4 4.0 No
AllGather 2 5.1 5.0 Yes
AllGather 3 8.0 6.1 Yes
AllToAll 1 3.4 4.0 No

6.4 AMD MI250x Real-Hardware Testbed (Figure 8)

6.5 In-Network Copy Wins

For large-message AllReduce / AllGather (workloads that admit multicast):

6.6 Latency alpha Modeling Sensitivity (Figure 3)


The paper organizes prior work into three families:

(A) Multicast TE / WAN TE — predecessors to TE-CCL's MCF approach, historically applied to network routing rather than collectives.

(B) Prior CCL synthesizers:

(C) NCCL / RCCL fixed-template libraries: Ring and double-binary Tree only; tuned for canonical NVLink-symmetric DGX1/DGX2 topologies. TE-CCL beats them on heterogeneous fabrics by 2-3x.


8. Limitations and Future Work

8.1 Failure Handling

8.2 Cloud Dynamics / Multi-Tenancy

8.3 Compute Cost in AllReduce

8.4 Hardware-Specific Mapping

8.5 Static-Schedule Assumption


9. Discussion of NCCL / RCCL / MSCCL / TACCL

9.1 NCCL and RCCL

9.2 MSCCL / SCCL

9.3 TACCL

9.4 Pipelining as the Key Distinction

9.5 In-Network Multicast (SHARP, NVSwitch)


10. Cross-Cutting Empirical Take-Aways

Take-away Derived from
MILP scales further than SAT for CCL TE-CCL solves 256-GPU AllToAll where MSCCL/SCCL is infeasible
Sketches are not needed if MILP/A* is used TE-CCL beats TACCL by 2x with no human input
Latency alpha modeling is decisive at small messages Figure 3: >100% error without alpha at <10 KB
Pipelining beats rigid barriers Table 3: AllGather 3-chunk drops 8.0 us -> 6.1 us
In-network copy halves AllReduce/AllGather at large sizes DGX1 / Internal-1 at 0.21 GB: 50% reduction
Real hardware confirms simulation: 3.18x vs RCCL on AMD Figure 8
LP relaxation is correct only when no copy is needed AllToAll uses LP; AllReduce/AllGather use MILP
Static schedules suffice for centrally-managed clusters Open-loop execution via MSCCL XML

11. Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer. Its policy selects, per collective invocation, the algorithm (Ring / Tree / CollNet / NVLS), protocol (LL / LL128 / Simple), nChannels, numThreads, and chunkSize to minimize collective wall-clock time. State features: log-binned message size, model intensity I = C/D, local batch size, topology fingerprint, and an LSTM-encoded recent-collective timing window. Reward is -collective_wall_clock_us. DynamICCL operates inside NCCL via the tuner-plugin API.

TE-CCL connects to DynamICCL across multiple design axes:

Direct mappings to DynamICCL design:

TE-CCL finding DynamICCL design implication
2x improvement over TACCL purely from explicit pipelining DynamICCL's chunkSize / numPipeOps action axis is real and matters; never collapse it.
Latency alpha causes >100% throughput error at <10 KB Strong action prior: at small messages, prefer LL + Tree (low-latency path). DynamICCL should encode alpha-vs-beta regime as a state feature (effectively log-binned message size).
In-network copy halves time at >100 MB on NVSwitch+SHARP DynamICCL should expose CollNet / NVLS as an action when the topology fingerprint indicates SHARP-capable fabric; otherwise mask it out.
MILP on 128-GPU AllGather takes 7 hours Synthesis cost is large; DynamICCL's RL-trained policy amortizes this across many invocations and can pick a TE-CCL-emitted schedule slot at run time without re-solving.
TE-CCL emits MSCCL XML, executable via NCCL DynamICCL's action space includes a "use MSCCL schedule X" option. The two systems compose; DynamICCL is the run-time selector over a library that includes TE-CCL outputs alongside vanilla Ring/Tree.
Static schedules cannot adapt to dynamic alpha/beta DynamICCL's online-RL formulation directly solves this gap — the recent-collective timing window in state lets the policy detect congestion.
Topology heterogeneity (DGX1/DGX2/NDv2/AMD) is the primary win Topology fingerprint must be an explicit state feature, not assumed away. Train per-fabric policy heads or feed topology as embedding.

Specific design priors for the RL agent:

  1. Action space layering. DynamICCL's primary action space is the NCCL parameter tuple (algorithm, protocol, nChannels, numThreads, chunkSize). To incorporate TE-CCL, add a categorical sub-action "MSCCL schedule slot in {none, ring_default, te_ccl_AG_v1, te_ccl_A2A_v1, ...}". The policy learns when the synthesized schedule beats the vendor default.

  2. Reward consistency. TE-CCL's MILP objective rewards earlier delivery, summed 1/(k+1) * R_{s,d,k}. DynamICCL's -collective_wall_clock_us is a stricter version of the same intent — minimize completion time. Cross-validation: both papers agree that completion-time, not throughput, is the right metric.

  3. State features grounded by TE-CCL evidence.

    • Log-binned message size (small / medium / large) corresponds to the alpha-bound / transition / beta-bound regimes TE-CCL identifies via Figure 3.
    • Topology fingerprint corresponds to TE-CCL's distinct schedule wins per fabric (DGX1 vs. DGX2 vs. NDv2 vs. AMD).
    • SHARP-capable / multicast-capable boolean: maps to TE-CCL's SHARP-style switch model selector and gates the CollNet/NVLS action.
    • Recent-collective timing window: TE-CCL's open problem ("dynamic alpha/beta") is exactly this signal. DynamICCL observes it; TE-CCL does not.
  4. Exploration prior on (algorithm, protocol):

    • msg < 16 KiB: Tree + LL (alpha-bound; TE-CCL Figure 3 evidence)
    • 16 KiB - 1 MiB: Tree or Ring + LL128 (transition zone)
    • msg >= 1 MiB: Ring + Simple + max nChannels (beta-bound; pipelining-friendly)
    • On SHARP-capable fabric and AllReduce, also explore CollNet/NVLS above 1 MiB (TE-CCL's "in-network copy halves time" regime).
  5. Exploration budget allocation.

    • More exploration budget on heterogeneous fabrics where TE-CCL showed Ring/Tree are far from optimal (NDv2, DGX2, AMD).
    • Less on canonical NVLink-symmetric DGX1 where Ring is already near-optimal.
    • Topology-fingerprint-conditioned exploration noise.
  6. Reward shaping.

    • Primary: r = -collective_wall_clock_us.
    • Optional secondary: penalize p99 tail latency to discourage the "lucky-fast" schedule that becomes catastrophic under interference — exactly the failure mode TE-CCL flags as future work.
  7. Research positioning. TE-CCL is a static, open-loop schedule synthesizer for centrally-managed clusters with stable topology. DynamICCL is the run-time, closed-loop, congestion-adaptive layer that picks among schedules (including TE-CCL-emitted ones) and tunes NCCL parameters. The papers are orthogonal layers in the same stack — their composition (TE-CCL synthesizes; DynamICCL selects + tunes) is the natural deployment.

  8. Open-problem alignment. TE-CCL Section 8 lists four open problems. DynamICCL directly addresses three:

    • Multi-tenant / unknown topology: DynamICCL's online RL learns from observed runtime feedback regardless of topology change.
    • Compute-aware AllReduce: DynamICCL's reward is end-to-end wall-clock, which already includes reduction-kernel time.
    • Stragglers / failures: DynamICCL's recent-timing-window feature lets the policy detect and route around interference. The fourth problem (hardware-specific lowering) belongs to the MSCCLang/GC3 layer, not DynamICCL.