TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
Aashaka Shah (UT Austin), Vijay Chidambaram (UT Austin / VMware Research), Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi (Microsoft Research), Rachee Singh (Microsoft / Cornell) | NSDI '23 | USENIX, April 17-19, 2023
Problem
Inter-GPU communication has become the dominant scaling bottleneck for distributed model training. The paper measures BERT idling 11% and DeepLight 63% of the time waiting on the network. Production libraries (NCCL, MSCCL) ship a small set of fixed algorithm templates — Ring, double-binary Tree, NVLink-aware variants — that were hand-designed for a few canonical topologies. On modern heterogeneous clusters (Azure NDv2, Nvidia DGX-2) where intra-node NVLink (up to 300 GB/s) coexists with inter-node InfiniBand (12.5-25 GB/s) and per-node GPU/NIC ratios vary, these templates leave large performance gaps. Prior synthesis work (SCCL) used SMT to search the full algorithm space but stalls beyond 8 GPUs / a single node, because the search space grows exponentially with a quadratic power of cluster size.
Core Insight
Algorithm synthesis becomes tractable at multi-node scale if a human designer first sketches the shape of the algorithm — which links to use, how to handle switches, and what symmetries to enforce — leaving only the routing, ordering, and chunk-contiguity decisions to a MILP solver. A "communication sketch" plus a three-stage MILP decomposition turns a previously 24+ hour search into ~minutes for 80 GPUs.
Method
TACCL splits collective synthesis into a three-step MILP pipeline guided by a user-provided sketch:
+------------------+ +------------------+ +-----------------+
| Communication | | Physical | | Collective |
| Sketch | | Topology | | (input/output |
| (logical topo, | | (DGX-2, NDv2, | | postcondition) |
| switch policy, | | NVLink, IB) | | |
| symmetry, | +--------+---------+ +--------+--------+
| input size, | | |
| chunk count) | v v
+--------+---------+ +------------------------------------+
| | Step 1: Routing (MILP, relaxed) |
+------------>| - bandwidth-relaxed lower bound |
| - per-chunk path on each link |
+------------------+-----------------+
v
+------------------------------------+
| Step 2: Heuristic Ordering |
| - longest-remaining-path first |
| - tie-break: shortest-traversed |
+------------------+-----------------+
v
+------------------------------------+
| Step 3: Contiguity Scheduling |
| (MILP, exact) |
| - merge chunks to amortize alpha |
| - or pipeline for beta utilizn |
+------------------+-----------------+
v
+------------------------------------+
| TACCL-EF (XML) |
| single-kernel NCCL interpreter |
+------------------------------------+
The sketch supplies four fields: (1) logical
topology — a subset of the physical graph that bounds search,
(2) switch-hyperedge policy — uc-max
(maximize concurrent links, latency-optimal) or uc-min
(minimize concurrent links, congestion-avoiding for bandwidth), (3)
algorithm symmetry — automorphisms that fold the search
space, and (4) expected input size to instantiate the
alpha-beta cost model.
Experimental Setup
| Component | Value |
|---|---|
| Platform A | Azure NDv2 (8x V100 / node, NVLink intra, 100 Gb/s IB inter, no GPUDirect RDMA) |
| Platform B | Nvidia DGX-2 (16x V100 / node, NVSwitch fully-connected, 8 IB NICs shared) |
| GPUs swept | 8 (1 node), 16 (1-2 nodes), 32 (4 nodes), up to 80 |
| Collectives | AllGather, AllToAll, AllReduce |
| Buffer sizes | 1 KB to 1 GB |
| Baselines | NCCL v2.8.4-1, SCCL, MSCCL |
| Workloads | Transformer-XL, BERT, Mixture-of-Experts |
| Cost model | alpha-beta: NVLink alpha=0.7 us; beta=8 us/MB (DGX-2), 46 us/MB (NDv2); IB alpha=1.7 us, beta=106 us/MB |
| Solver | MILP via Gurobi |
| Backend | TACCL-EF XML; NCCL-embedded interpreter; single-kernel launch per collective |
| PyTorch integration | 1-line change to torch.distributed swap |
Headline Quantitative Results
Microbenchmark speedups vs. NCCL v2.8.4-1:
| Collective | Platform | Buffer regime | Speedup |
|---|---|---|---|
| AllGather | DGX-2 (2 nodes) | 1 KB - 1 MB | 4.9x - 6.7x |
| AllGather | DGX-2 | 2 MB - 64 MB | 10% - 3.8x |
| AllGather | DGX-2 | 256 MB - 1 GB | 20% - 25% |
| AllGather | NDv2 (2 nodes) | 1 KB - 1 MB | 12% - 35% |
| AllGather | NDv2 | > 1 MB | 61% - 3.4x |
| AllToAll | DGX-2 | 1 KB - 16 KB | up to 55% |
| AllToAll | DGX-2 | >= 2 MB | up to 15% |
| AllToAll | NDv2 | 16 MB - 1 GB | 53% - 66% |
| AllReduce | DGX-2 | 1 KB - 4 MB | 49% - 6.4x |
| AllReduce | DGX-2 | 16 MB - 256 MB | 2% - 37% |
| AllReduce | DGX-2 | >= 512 MB | -9% (slower) |
| AllReduce | NDv2 | <= 1 MB | up to 28% |
| AllReduce | NDv2 | large | 28% - 2.7x |
End-to-end training (samples/sec or steps/sec, vs. NCCL):
| Workload | GPUs | Speedup range |
|---|---|---|
| Transformer-XL | 16 (2 nodes) | 11% - 1.94x |
| Transformer-XL | 32 (4 nodes) | 2% - 1.44x |
| BERT | 16 | 12% - 2.36x |
| BERT | 32 | 7% - 1.74x |
| Mixture-of-Experts | 16 | 17% throughput |
Synthesis cost (Table 2 — selected):
| Collective | Sketch | Time |
|---|---|---|
| AllGather | dgx2-sk-1 | 35.8 s |
| AllGather | ndv2-sk-1 | 2.6 s |
| AllToAll | dgx2-sk-2 | 92.5 s |
| AllToAll | ndv2-sk-1 | 1809.8 s (feasible at 4m 14s) |
| AllReduce | dgx2-sk-1 | 6.1 s |
| AllReduce | ndv2-sk-1 | 0.3 s |
SCCL fails to synthesize any multi-node algorithm within 24 hours; TACCL completes 80-GPU synthesis in under 8 minutes.
Limitations
- No fused communication ops. TACCL-EF lowers to
separate
recv,reduce,copy,sendinstructions. NCCL's fusedreceive-reduce-copy-sendkernel beats TACCL by ~9% on AllReduce= 512 MB on DGX-2.
- Still NP-hard. Sketches cut search dramatically but the underlying MILP grows quadratically with cluster size; hierarchical composition is left as future work.
- Sketch authorship is human. No automated controller picks switch-hyperedge policy, chunk count, or symmetry — the designer must understand the topology.
- Flow-conservation violation. Non-source GPUs may forward a chunk over multiple links while having received it only once, breaking classical network-flow assumptions and forcing a custom MILP.
- Limited topology generality. Non-hierarchical topologies have fewer expressible sketches than hierarchical ones (NDv2/DGX-2).
Open Problems Called Out
- Hierarchical composition. Compose synthesized algorithms across levels (intra-node, intra-rack, inter-rack) to scale to thousands of GPUs without re-solving the full MILP.
- Automated sketch search. A controller that picks logical topology, switch policy, and symmetry from features of the physical topology — replacing the human designer.
- Lowering to fused instructions. Extend TACCL-EF to
emit NCCL's
recv-reduce-copy-sendstyle fused kernels and recover the large- message AllReduce gap. - Adaptive runtime selection. TACCL synthesis is offline; choosing among multiple synthesized algorithms at runtime based on observed message sizes / topology load is unaddressed.
Relevance to DynamICCL
TACCL operates as an offline, ahead-of-time algorithm generator; DynamICCL operates as a runtime, per-collective algorithm selector. They sit at orthogonal layers and are complementary — TACCL expands the catalog of available algorithms, DynamICCL learns to pick among them. The following implications shape DynamICCL's design:
Action-space expansion. TACCL produces algorithms that are neither Ring nor Tree but switch-hyperedge-policy-conditioned hybrids. DynamICCL's action space (
algorithmenum) should be extended beyond the four NCCL built-ins to include "TACCL-EF synthesized" entries keyed by sketch identifier — turning each TACCL output into a discrete RL action.State feature: switch-hyperedge utilization regime. TACCL shows the optimal
uc-maxvsuc-minchoice depends on message size crossing a sharp threshold (small -> max, large -> min). DynamICCL should encode an analogous "concurrent-link-pressure" feature in its state vector — derivable from message size and topology fingerprint.Reward shaping consistent with alpha-beta. TACCL's MILP minimizes
time = alpha + beta * sizewhich exactly matches DynamICCL's reward-collective_wall_clock_us. Reward sign and unit are compatible; no transformation needed when comparing TACCL-synthesized algorithms against NCCL Ring/Tree under DynamICCL's policy.Topology fingerprint must be more granular. TACCL's per-platform alpha-beta values differ by ~6x (DGX-2 NVLink beta = 8, NDv2 NVLink beta = 46) due to per-link bandwidth differences. DynamICCL's topology embedding should capture link-class beta values, not just the four-class fingerprint (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet).
Exploration prior on chunk count. TACCL's
input_chunkupparameter (often 2 for AllGather, finer for AllReduce) maps directly to NCCL'schunkSize/numPipeOps. DynamICCL's exploration prior should bias toward 2-4 chunks for small messages, more chunks (8-16) for large messages — matching TACCL's observed sweet spots.Research positioning — synthesis vs. selection. TACCL leaves "adaptive runtime selection" as an open problem (see Limitations item above). DynamICCL is precisely that runtime selector; a joint TACCL
- DynamICCL stack — TACCL pre-generates a library of algorithms, DynamICCL chooses among them at each invocation — is the natural composition and a clear positioning argument for DynamICCL's contribution.
Synthesis-time as exploration budget context. TACCL takes 30 seconds to 30 minutes per algorithm. DynamICCL's exploration must amortize against a fixed library (cannot synthesize on the fly), so the action set should be discrete-and-precomputed rather than parametric-and-continuous.