TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches

Aashaka Shah (UT Austin), Vijay Chidambaram (UT Austin / VMware Research), Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi (Microsoft Research), Rachee Singh (Microsoft / Cornell) | NSDI '23 | USENIX, April 17-19, 2023

Problem

Inter-GPU communication has become the dominant scaling bottleneck for distributed model training. The paper measures BERT idling 11% and DeepLight 63% of the time waiting on the network. Production libraries (NCCL, MSCCL) ship a small set of fixed algorithm templates — Ring, double-binary Tree, NVLink-aware variants — that were hand-designed for a few canonical topologies. On modern heterogeneous clusters (Azure NDv2, Nvidia DGX-2) where intra-node NVLink (up to 300 GB/s) coexists with inter-node InfiniBand (12.5-25 GB/s) and per-node GPU/NIC ratios vary, these templates leave large performance gaps. Prior synthesis work (SCCL) used SMT to search the full algorithm space but stalls beyond 8 GPUs / a single node, because the search space grows exponentially with a quadratic power of cluster size.

Core Insight

Algorithm synthesis becomes tractable at multi-node scale if a human designer first sketches the shape of the algorithm — which links to use, how to handle switches, and what symmetries to enforce — leaving only the routing, ordering, and chunk-contiguity decisions to a MILP solver. A "communication sketch" plus a three-stage MILP decomposition turns a previously 24+ hour search into ~minutes for 80 GPUs.

Method

TACCL splits collective synthesis into a three-step MILP pipeline guided by a user-provided sketch:

+------------------+   +------------------+   +-----------------+
| Communication    |   | Physical         |   | Collective      |
| Sketch           |   | Topology         |   | (input/output   |
| (logical topo,   |   | (DGX-2, NDv2,    |   |  postcondition) |
|  switch policy,  |   |  NVLink, IB)     |   |                 |
|  symmetry,       |   +--------+---------+   +--------+--------+
|  input size,     |            |                      |
|  chunk count)    |            v                      v
+--------+---------+   +------------------------------------+
         |             |   Step 1: Routing (MILP, relaxed)  |
         +------------>|   - bandwidth-relaxed lower bound  |
                       |   - per-chunk path on each link    |
                       +------------------+-----------------+
                                          v
                       +------------------------------------+
                       |   Step 2: Heuristic Ordering       |
                       |   - longest-remaining-path first   |
                       |   - tie-break: shortest-traversed  |
                       +------------------+-----------------+
                                          v
                       +------------------------------------+
                       |   Step 3: Contiguity Scheduling    |
                       |   (MILP, exact)                    |
                       |   - merge chunks to amortize alpha |
                       |   - or pipeline for beta utilizn   |
                       +------------------+-----------------+
                                          v
                       +------------------------------------+
                       |   TACCL-EF (XML)                   |
                       |   single-kernel NCCL interpreter   |
                       +------------------------------------+

The sketch supplies four fields: (1) logical topology — a subset of the physical graph that bounds search, (2) switch-hyperedge policy — uc-max (maximize concurrent links, latency-optimal) or uc-min (minimize concurrent links, congestion-avoiding for bandwidth), (3) algorithm symmetry — automorphisms that fold the search space, and (4) expected input size to instantiate the alpha-beta cost model.

Experimental Setup

Component	Value
Platform A	Azure NDv2 (8x V100 / node, NVLink intra, 100 Gb/s IB inter, no GPUDirect RDMA)
Platform B	Nvidia DGX-2 (16x V100 / node, NVSwitch fully-connected, 8 IB NICs shared)
GPUs swept	8 (1 node), 16 (1-2 nodes), 32 (4 nodes), up to 80
Collectives	AllGather, AllToAll, AllReduce
Buffer sizes	1 KB to 1 GB
Baselines	NCCL v2.8.4-1, SCCL, MSCCL
Workloads	Transformer-XL, BERT, Mixture-of-Experts
Cost model	alpha-beta: NVLink alpha=0.7 us; beta=8 us/MB (DGX-2), 46 us/MB (NDv2); IB alpha=1.7 us, beta=106 us/MB
Solver	MILP via Gurobi
Backend	TACCL-EF XML; NCCL-embedded interpreter; single-kernel launch per collective
PyTorch integration	1-line change to `torch.distributed` swap

Headline Quantitative Results

Microbenchmark speedups vs. NCCL v2.8.4-1:

Collective	Platform	Buffer regime	Speedup
AllGather	DGX-2 (2 nodes)	1 KB - 1 MB	4.9x - 6.7x
AllGather	DGX-2	2 MB - 64 MB	10% - 3.8x
AllGather	DGX-2	256 MB - 1 GB	20% - 25%
AllGather	NDv2 (2 nodes)	1 KB - 1 MB	12% - 35%
AllGather	NDv2	> 1 MB	61% - 3.4x
AllToAll	DGX-2	1 KB - 16 KB	up to 55%
AllToAll	DGX-2	>= 2 MB	up to 15%
AllToAll	NDv2	16 MB - 1 GB	53% - 66%
AllReduce	DGX-2	1 KB - 4 MB	49% - 6.4x
AllReduce	DGX-2	16 MB - 256 MB	2% - 37%
AllReduce	DGX-2	>= 512 MB	-9% (slower)
AllReduce	NDv2	<= 1 MB	up to 28%
AllReduce	NDv2	large	28% - 2.7x

End-to-end training (samples/sec or steps/sec, vs. NCCL):

Workload	GPUs	Speedup range
Transformer-XL	16 (2 nodes)	11% - 1.94x
Transformer-XL	32 (4 nodes)	2% - 1.44x
BERT	16	12% - 2.36x
BERT	32	7% - 1.74x
Mixture-of-Experts	16	17% throughput

Synthesis cost (Table 2 — selected):

Collective	Sketch	Time
AllGather	dgx2-sk-1	35.8 s
AllGather	ndv2-sk-1	2.6 s
AllToAll	dgx2-sk-2	92.5 s
AllToAll	ndv2-sk-1	1809.8 s (feasible at 4m 14s)
AllReduce	dgx2-sk-1	6.1 s
AllReduce	ndv2-sk-1	0.3 s

SCCL fails to synthesize any multi-node algorithm within 24 hours; TACCL completes 80-GPU synthesis in under 8 minutes.

Limitations

No fused communication ops. TACCL-EF lowers to separate recv, reduce, copy, send instructions. NCCL's fused receive-reduce-copy-send kernel beats TACCL by ~9% on AllReduce

= 512 MB on DGX-2.
Still NP-hard. Sketches cut search dramatically but the underlying MILP grows quadratically with cluster size; hierarchical composition is left as future work.
Sketch authorship is human. No automated controller picks switch-hyperedge policy, chunk count, or symmetry — the designer must understand the topology.
Flow-conservation violation. Non-source GPUs may forward a chunk over multiple links while having received it only once, breaking classical network-flow assumptions and forcing a custom MILP.
Limited topology generality. Non-hierarchical topologies have fewer expressible sketches than hierarchical ones (NDv2/DGX-2).

Open Problems Called Out

Hierarchical composition. Compose synthesized algorithms across levels (intra-node, intra-rack, inter-rack) to scale to thousands of GPUs without re-solving the full MILP.
Automated sketch search. A controller that picks logical topology, switch policy, and symmetry from features of the physical topology — replacing the human designer.
Lowering to fused instructions. Extend TACCL-EF to emit NCCL's recv-reduce-copy-send style fused kernels and recover the large- message AllReduce gap.
Adaptive runtime selection. TACCL synthesis is offline; choosing among multiple synthesized algorithms at runtime based on observed message sizes / topology load is unaddressed.

Relevance to DynamICCL

TACCL operates as an offline, ahead-of-time algorithm generator; DynamICCL operates as a runtime, per-collective algorithm selector. They sit at orthogonal layers and are complementary — TACCL expands the catalog of available algorithms, DynamICCL learns to pick among them. The following implications shape DynamICCL's design:

Action-space expansion. TACCL produces algorithms that are neither Ring nor Tree but switch-hyperedge-policy-conditioned hybrids. DynamICCL's action space (algorithm enum) should be extended beyond the four NCCL built-ins to include "TACCL-EF synthesized" entries keyed by sketch identifier — turning each TACCL output into a discrete RL action.
State feature: switch-hyperedge utilization regime. TACCL shows the optimal uc-max vs uc-min choice depends on message size crossing a sharp threshold (small -> max, large -> min). DynamICCL should encode an analogous "concurrent-link-pressure" feature in its state vector — derivable from message size and topology fingerprint.
Reward shaping consistent with alpha-beta. TACCL's MILP minimizes time = alpha + beta * size which exactly matches DynamICCL's reward -collective_wall_clock_us. Reward sign and unit are compatible; no transformation needed when comparing TACCL-synthesized algorithms against NCCL Ring/Tree under DynamICCL's policy.
Topology fingerprint must be more granular. TACCL's per-platform alpha-beta values differ by ~6x (DGX-2 NVLink beta = 8, NDv2 NVLink beta = 46) due to per-link bandwidth differences. DynamICCL's topology embedding should capture link-class beta values, not just the four-class fingerprint (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet).
Exploration prior on chunk count. TACCL's input_chunkup parameter (often 2 for AllGather, finer for AllReduce) maps directly to NCCL's chunkSize / numPipeOps. DynamICCL's exploration prior should bias toward 2-4 chunks for small messages, more chunks (8-16) for large messages — matching TACCL's observed sweet spots.
Research positioning — synthesis vs. selection. TACCL leaves "adaptive runtime selection" as an open problem (see Limitations item above). DynamICCL is precisely that runtime selector; a joint TACCL
- DynamICCL stack — TACCL pre-generates a library of algorithms, DynamICCL chooses among them at each invocation — is the natural composition and a clear positioning argument for DynamICCL's contribution.
Synthesis-time as exploration budget context. TACCL takes 30 seconds to 30 minutes per algorithm. DynamICCL's exploration must amortize against a fixed library (cannot synthesize on the fly), so the action set should be discrete-and-precomputed rather than parametric-and-continuous.