TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches

Aashaka Shah (UT Austin), Vijay Chidambaram (UT Austin / VMware Research), Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi (Microsoft Research), Rachee Singh (Microsoft / Cornell) | NSDI '23 | USENIX, April 17-19, 2023


Problem

Inter-GPU communication has become the dominant scaling bottleneck for distributed model training. The paper measures BERT idling 11% and DeepLight 63% of the time waiting on the network. Production libraries (NCCL, MSCCL) ship a small set of fixed algorithm templates — Ring, double-binary Tree, NVLink-aware variants — that were hand-designed for a few canonical topologies. On modern heterogeneous clusters (Azure NDv2, Nvidia DGX-2) where intra-node NVLink (up to 300 GB/s) coexists with inter-node InfiniBand (12.5-25 GB/s) and per-node GPU/NIC ratios vary, these templates leave large performance gaps. Prior synthesis work (SCCL) used SMT to search the full algorithm space but stalls beyond 8 GPUs / a single node, because the search space grows exponentially with a quadratic power of cluster size.


Core Insight

Algorithm synthesis becomes tractable at multi-node scale if a human designer first sketches the shape of the algorithm — which links to use, how to handle switches, and what symmetries to enforce — leaving only the routing, ordering, and chunk-contiguity decisions to a MILP solver. A "communication sketch" plus a three-stage MILP decomposition turns a previously 24+ hour search into ~minutes for 80 GPUs.


Method

TACCL splits collective synthesis into a three-step MILP pipeline guided by a user-provided sketch:

+------------------+   +------------------+   +-----------------+
| Communication    |   | Physical         |   | Collective      |
| Sketch           |   | Topology         |   | (input/output   |
| (logical topo,   |   | (DGX-2, NDv2,    |   |  postcondition) |
|  switch policy,  |   |  NVLink, IB)     |   |                 |
|  symmetry,       |   +--------+---------+   +--------+--------+
|  input size,     |            |                      |
|  chunk count)    |            v                      v
+--------+---------+   +------------------------------------+
         |             |   Step 1: Routing (MILP, relaxed)  |
         +------------>|   - bandwidth-relaxed lower bound  |
                       |   - per-chunk path on each link    |
                       +------------------+-----------------+
                                          v
                       +------------------------------------+
                       |   Step 2: Heuristic Ordering       |
                       |   - longest-remaining-path first   |
                       |   - tie-break: shortest-traversed  |
                       +------------------+-----------------+
                                          v
                       +------------------------------------+
                       |   Step 3: Contiguity Scheduling    |
                       |   (MILP, exact)                    |
                       |   - merge chunks to amortize alpha |
                       |   - or pipeline for beta utilizn   |
                       +------------------+-----------------+
                                          v
                       +------------------------------------+
                       |   TACCL-EF (XML)                   |
                       |   single-kernel NCCL interpreter   |
                       +------------------------------------+

The sketch supplies four fields: (1) logical topology — a subset of the physical graph that bounds search, (2) switch-hyperedge policyuc-max (maximize concurrent links, latency-optimal) or uc-min (minimize concurrent links, congestion-avoiding for bandwidth), (3) algorithm symmetry — automorphisms that fold the search space, and (4) expected input size to instantiate the alpha-beta cost model.


Experimental Setup

Component Value
Platform A Azure NDv2 (8x V100 / node, NVLink intra, 100 Gb/s IB inter, no GPUDirect RDMA)
Platform B Nvidia DGX-2 (16x V100 / node, NVSwitch fully-connected, 8 IB NICs shared)
GPUs swept 8 (1 node), 16 (1-2 nodes), 32 (4 nodes), up to 80
Collectives AllGather, AllToAll, AllReduce
Buffer sizes 1 KB to 1 GB
Baselines NCCL v2.8.4-1, SCCL, MSCCL
Workloads Transformer-XL, BERT, Mixture-of-Experts
Cost model alpha-beta: NVLink alpha=0.7 us; beta=8 us/MB (DGX-2), 46 us/MB (NDv2); IB alpha=1.7 us, beta=106 us/MB
Solver MILP via Gurobi
Backend TACCL-EF XML; NCCL-embedded interpreter; single-kernel launch per collective
PyTorch integration 1-line change to torch.distributed swap

Headline Quantitative Results

Microbenchmark speedups vs. NCCL v2.8.4-1:

Collective Platform Buffer regime Speedup
AllGather DGX-2 (2 nodes) 1 KB - 1 MB 4.9x - 6.7x
AllGather DGX-2 2 MB - 64 MB 10% - 3.8x
AllGather DGX-2 256 MB - 1 GB 20% - 25%
AllGather NDv2 (2 nodes) 1 KB - 1 MB 12% - 35%
AllGather NDv2 > 1 MB 61% - 3.4x
AllToAll DGX-2 1 KB - 16 KB up to 55%
AllToAll DGX-2 >= 2 MB up to 15%
AllToAll NDv2 16 MB - 1 GB 53% - 66%
AllReduce DGX-2 1 KB - 4 MB 49% - 6.4x
AllReduce DGX-2 16 MB - 256 MB 2% - 37%
AllReduce DGX-2 >= 512 MB -9% (slower)
AllReduce NDv2 <= 1 MB up to 28%
AllReduce NDv2 large 28% - 2.7x

End-to-end training (samples/sec or steps/sec, vs. NCCL):

Workload GPUs Speedup range
Transformer-XL 16 (2 nodes) 11% - 1.94x
Transformer-XL 32 (4 nodes) 2% - 1.44x
BERT 16 12% - 2.36x
BERT 32 7% - 1.74x
Mixture-of-Experts 16 17% throughput

Synthesis cost (Table 2 — selected):

Collective Sketch Time
AllGather dgx2-sk-1 35.8 s
AllGather ndv2-sk-1 2.6 s
AllToAll dgx2-sk-2 92.5 s
AllToAll ndv2-sk-1 1809.8 s (feasible at 4m 14s)
AllReduce dgx2-sk-1 6.1 s
AllReduce ndv2-sk-1 0.3 s

SCCL fails to synthesize any multi-node algorithm within 24 hours; TACCL completes 80-GPU synthesis in under 8 minutes.


Limitations


Open Problems Called Out

  1. Hierarchical composition. Compose synthesized algorithms across levels (intra-node, intra-rack, inter-rack) to scale to thousands of GPUs without re-solving the full MILP.
  2. Automated sketch search. A controller that picks logical topology, switch policy, and symmetry from features of the physical topology — replacing the human designer.
  3. Lowering to fused instructions. Extend TACCL-EF to emit NCCL's recv-reduce-copy-send style fused kernels and recover the large- message AllReduce gap.
  4. Adaptive runtime selection. TACCL synthesis is offline; choosing among multiple synthesized algorithms at runtime based on observed message sizes / topology load is unaddressed.

Relevance to DynamICCL

TACCL operates as an offline, ahead-of-time algorithm generator; DynamICCL operates as a runtime, per-collective algorithm selector. They sit at orthogonal layers and are complementary — TACCL expands the catalog of available algorithms, DynamICCL learns to pick among them. The following implications shape DynamICCL's design:

  1. Action-space expansion. TACCL produces algorithms that are neither Ring nor Tree but switch-hyperedge-policy-conditioned hybrids. DynamICCL's action space (algorithm enum) should be extended beyond the four NCCL built-ins to include "TACCL-EF synthesized" entries keyed by sketch identifier — turning each TACCL output into a discrete RL action.

  2. State feature: switch-hyperedge utilization regime. TACCL shows the optimal uc-max vs uc-min choice depends on message size crossing a sharp threshold (small -> max, large -> min). DynamICCL should encode an analogous "concurrent-link-pressure" feature in its state vector — derivable from message size and topology fingerprint.

  3. Reward shaping consistent with alpha-beta. TACCL's MILP minimizes time = alpha + beta * size which exactly matches DynamICCL's reward -collective_wall_clock_us. Reward sign and unit are compatible; no transformation needed when comparing TACCL-synthesized algorithms against NCCL Ring/Tree under DynamICCL's policy.

  4. Topology fingerprint must be more granular. TACCL's per-platform alpha-beta values differ by ~6x (DGX-2 NVLink beta = 8, NDv2 NVLink beta = 46) due to per-link bandwidth differences. DynamICCL's topology embedding should capture link-class beta values, not just the four-class fingerprint (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet).

  5. Exploration prior on chunk count. TACCL's input_chunkup parameter (often 2 for AllGather, finer for AllReduce) maps directly to NCCL's chunkSize / numPipeOps. DynamICCL's exploration prior should bias toward 2-4 chunks for small messages, more chunks (8-16) for large messages — matching TACCL's observed sweet spots.

  6. Research positioning — synthesis vs. selection. TACCL leaves "adaptive runtime selection" as an open problem (see Limitations item above). DynamICCL is precisely that runtime selector; a joint TACCL

    • DynamICCL stack — TACCL pre-generates a library of algorithms, DynamICCL chooses among them at each invocation — is the natural composition and a clear positioning argument for DynamICCL's contribution.
  7. Synthesis-time as exploration budget context. TACCL takes 30 seconds to 30 minutes per algorithm. DynamICCL's exploration must amortize against a fixed library (cannot synthesize on the fly), so the action set should be discrete-and-precomputed rather than parametric-and-continuous.