TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches — Detailed Summary
Aashaka Shah (UT Austin), Vijay Chidambaram (UT Austin / VMware Research), Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi (Microsoft Research), Rachee Singh (Microsoft / Cornell) | NSDI '23 | USENIX, April 17-19, 2023 | ISBN 978-1-939133-33-5
Per-section breakdown that mirrors the paper's headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them. Section headings preserved verbatim:
- Introduction, 2. Background and Motivation, 3. Communication Sketches, 4. Physical Topologies of GPU Systems, 5. TACCL Synthesizer,
- Backend, 7. Evaluation, 8. Related Work, 9. Conclusion and Future Work.
Abstract
- ML training is increasingly distributed across many GPUs and
servers; collectives (
ALLREDUCE,ALLTOALL,ALLGATHER) are the dominant scaling bottleneck. - TACCL is a synthesis tool that generates collective algorithms automatically, guided by a high-level designer-supplied communication sketch.
- Sketches reduce the search space drastically, enabling synthesis on multi-node clusters where prior SMT-based synthesis (SCCL) timed out.
- Headline outcome: up to 6.7x speedup over NCCL on microbenchmarks; end-to-end training of Transformer-XL and BERT speeds up 11% to 2.36x.
1. Introduction
Problem statement:
- Inter-GPU communication is the dominant scaling bottleneck. The paper measures BERT spending 11% and DeepLight 63% of step time idle on network communication.
- NCCL ships fixed templates: Ring (bandwidth-optimal, latency-bad) and double-binary Tree (latency-good, bandwidth-suboptimal). These templates are hand-tuned for a small set of canonical topologies and do not adapt to heterogeneous clusters.
Motivation:
- Modern clusters mix intra-node NVLink (hundreds of GB/s) with inter-node InfiniBand (12.5-25 GB/s). The optimal algorithm shape depends on this asymmetry — and on switch behavior, GPU/NIC ratios, and message size.
- SCCL (NSDI '21, also Microsoft) showed full-search synthesis is feasible at small scale (8 GPUs / 1 node) but does not scale: SCCL fails to synthesize multi-node algorithms within 24 hours.
Contributions (verbatim from Sec. 1):
- Communication sketches — a new abstraction that lets a designer provide high-level intuition (logical topology, routing, switch policy, symmetry).
- Scalable synthesis — a three-step MILP decomposition (Routing, Heuristic Ordering, Contiguity Scheduling) that scales beyond the single-node limit of SCCL.
- TACCL runtime (TACCL-EF) — an XML execution format and an NCCL-embedded interpreter that runs the synthesized algorithm in a single GPU kernel launch.
- Empirical results — significant speedups on Azure NDv2 (V100, IB) and Nvidia DGX-2 (V100, NVSwitch) for AllGather, AllToAll, and AllReduce, plus end-to-end training gains.
2. Background and Motivation
Collectives:
ALLREDUCE: every GPU contributes one tensor, all GPUs end with the reduction (sum) of all tensors. Critical for data-parallel gradient averaging.ALLTOALL: rankisends slicejto rankj. Critical for Mixture-of-Experts and DLRM (sparse-feature shuffling).ALLGATHER: every GPU contributes its slice; all GPUs end with the full concatenation. Used in ZeRO-style optimizer-state sharding.
NCCL templates:
- Ring AllReduce: O(N) hops, full bisection bandwidth; optimal bandwidth utilization at large message sizes.
- Double-binary Tree AllReduce: O(log N) hops; latency-optimal at small message sizes.
- Both are pre-coded; selecting between them is NCCL's runtime responsibility.
Topology heterogeneity (motivating numbers):
- NVLink intra-node bandwidth up to ~300 GB/s.
- InfiniBand inter-node bandwidth 12.5-25 GB/s effective.
- Per-link alpha (latency) and beta (per-byte cost) differ by orders of magnitude across the link types.
Prior synthesis work:
- SCCL (NSDI '21) — SMT-based; complete search; limited to 8 GPUs / single node due to exponential growth.
- Blink — heuristic spanning-tree packing; does not search the algorithm space; useful when NCCL cannot form a clean ring.
- Plink — locality-aware hierarchical reduction; uses fixed primitives.
- Program sketching (Solar-Lezama, ASPLOS '06) — TACCL borrows the conceptual frame of "partial program with holes filled by a solver".
3. Communication Sketches
A sketch is a partially specified program with holes. The designer fills in the topology-aware structural choices; the solver fills in routing, ordering, and contiguity. The four sketch fields:
3.1 Logical Topology
- A user-defined subset of the physical topology graph.
- Switches are abstracted as either explicit hyperedges or replaced by a small set of GPU-to-GPU direct links (the "relay" strategy).
- Purpose: bound the routing search space and avoid over-subscribing shared resources (NICs, root-complex PCIe lanes).
3.2 Switch-Hyperedge Policy
For each switch (NVSwitch on DGX-2, IBSwitch in the inter-node fabric) the sketch declares a contention policy:
| Policy | Behavior | Best regime |
|---|---|---|
uc-max |
Maximize concurrent connections through the switch | Small messages, latency-bound |
uc-min |
Minimize concurrent connections | Large messages, bandwidth-bound |
The MILP encodes this as a signed penalty term in the objective:
Minimize time + gamma * sum(is_util) where gamma < 0 for
uc-max and gamma > 0 for uc-min (Eq.
11).
3.3 Algorithm Symmetry
- Designer supplies automorphisms of the rank set / chunk set.
- Common symmetries: rotational (ring-style), hierarchical (intra-node symmetric, inter-node symmetric).
- Effect: the MILP is solved on the quotient by the symmetry group, shrinking the variable count and search time.
3.4 Expected Input Size and Chunk Partitioning
- Designer hints the synthesizer about target message size (e.g.,
input_size: "1M"). - Chunk count
Cpartitions the buffer into atomic scheduling units (sketch fieldinput_chunkup). Typical values 2-16.
3.5 Concrete Sketches (Listing 1, paraphrased)
dgx2-sk-1— high-bandwidth AllGather on DGX-2:intranode_sketch: strategy "switch", all 16 GPUs in one NVSwitch group,switch_hyperedge_strategy: ["uc-min"].internode_sketch: strategy "relay",internode_conn: {"1": [0], "3": [2], ...}— GPU 1 in node A talks only to GPUs {0, 2} in node B.beta_split: {"1": 1}— relay GPUs do not split bandwidth.hyperparameters:input_chunkup: 2,input_size: "1M".
dgx2-sk-2— low-latency variant:- Chunk size 1 KB, switch policy
uc-max, IB beta cost doubled (2 * beta_IB) to model shared-NIC contention.
- Chunk size 1 KB, switch policy
ndv2-sk-1:- Dedicated sender/receiver GPUs are picked by PCIe proximity to the NIC, avoiding the over-subscribed CPU-root-complex link.
4. Physical Topologies of GPU Systems
The paper measures and models two production topologies (Figure 5):
4.1 Azure NDv2
- 8 V100 GPUs per node.
- Intra-node NVLink graph: ring/mesh hybrid (not fully connected).
- PCIe graph: hierarchical tree under two CPU root complexes.
- Inter-node: 100 Gb/s InfiniBand. No GPUDirect RDMA — payloads must stage through host memory, doubling effective beta.
4.2 Nvidia DGX-2
- 16 V100 GPUs per node, fully connected via two NVSwitches.
- Inter-node: 8 IB NICs, shared across all 16 GPUs (hence the doubled-beta sketch trick).
4.3 Bandwidth Degradation under Concurrency (Figure 4)
- NVSwitch and IBSwitch effective per-pair bandwidth drops as the number of concurrent connections grows (queueing delay).
- This is exactly the empirical justification for
uc-minon bandwidth-bound regimes.
5. TACCL Synthesizer
5.1 Cost Model
The alpha-beta model for a single link transfer of s
bytes:
Cost(src -> r, s) = alpha(src, r) + beta(src, r) * s
Per-platform values (Table 1):
| Link Type | Platform | alpha (us) | beta (us/MB) |
|---|---|---|---|
| NVLink | Azure NDv2 | 0.7 | 46 |
| NVLink | Nvidia DGX-2 | 0.7 | 8 |
| InfiniBand | Azure NDv2 | 1.7 | 106 |
| InfiniBand | Nvidia DGX-2 | 1.7 | 106 |
5.2 Three-Step Decomposition
The MILP for the full collective is too large at multi-node scale. TACCL splits it:
Step 1: Routing Step 2: Ordering Step 3: Contiguity
+--------------------+ +-------------------+ +--------------------+
| MILP (relaxed): | | Heuristic greedy: | | MILP (exact): |
| - per-chunk path | -----> | - longest | ----> | - merge for alpha |
| - bandwidth- | | remaining-path | | - pipeline for beta|
| relaxed lower | | first | | - schedule chunks |
| bound on time | | - tie: shortest | | |
| | | traversed first | | |
+--------------------+ +-------------------+ +--------------------+
5.3 MILP Variables
| Variable | Type | Meaning |
|---|---|---|
time |
continuous | Total execution time (objective to minimize) |
start[c, r] |
continuous | Time chunk c becomes available at rank r |
send[c, src, r] |
continuous | Time c starts being sent src->r |
is_sent[c, src, r] |
binary | 1 if chunk c traverses link src->r |
is_util[src, r] |
binary | 1 if a switch-hyperedge link is used |
is_together[c, o, r] |
binary | 1 if chunks c and o are merged for contiguous transfer to r |
5.4 Selected Constraints (verbatim numbering)
(Eq. 1) Minimize time
(Eq. 2) time >= start[c, r] for all (c, r) in postcondition
(Eq. 4) send[c, src, r] >= start[c, src] (cannot forward before receive)
(Eq. 5) is_sent[c, src, r] => start[c, r] = send[c, src, r] + lat(src, r)
(Eq. 6) time >= sum_{c in C} ( lat(src, r) * is_sent[c, src, r] )
for all (src, r) in L (relaxed bandwidth lower bound)
(Eq. 11) Minimize time + gamma * sum(is_util) (switch-hyperedge policy term)
(Eq. 15) sum over inter-node links is_sent[c, src, r] >= 1
for chunks crossing nodes (inter-node existence constraint)
(Eq. 16) is_together[c, o, r] => send[c, src, r] = send[o, src, r]
(Eq. 17) lat[c, src, r] = alpha(src, r) + beta(src, r)
* sum_o is_together[c, o, r]
Eq. 6 is the bandwidth relaxation — it allows transfers on the same link to overlap in Step 1, restoring exact serialization in Step 3. This relaxation is the key scalability trick over SCCL's monolithic SMT.
5.5 Logical Topology Encoder
- Converts physical graph to a smaller GPU-to-GPU graph.
- Switches become "switch-hyperedges": a set of potential direct links
with a cap on simultaneous activity (modeled by
is_util). - This is what makes the MILP tractable: instead of modeling every switch-internal queue, the policy term in Eq. 11 captures the effective latency/bandwidth trade-off.
5.6 Flow-Conservation Violation
A non-source GPU may send the same chunk over multiple
outgoing links while having received it only once. Classical
network-flow formulations do not allow this; TACCL's bespoke MILP must
encode forwarding explicitly via is_sent and
start decoupling, rather than borrowing standard
min-cost-flow constraints.
5.7 Scalability Numbers
- SCCL: cannot synthesize any multi-node collective within 24 hours.
- TACCL: synthesizes 80-GPU collectives in under 8 minutes.
- Per-collective synthesis times in Table 2 range from 0.3 s (AllReduce / NDv2) to 30 minutes (AllToAll / NDv2 with full optimality; feasible solution returned in 4m 14s).
6. Backend
6.1 TACCL-EF (Execution Format)
- XML format describing the collective as a set of "GPU programs".
- Each GPU program is a list of threadblocks; each threadblock contains ordered instructions.
- Instructions:
send,receive(with optional reduction),local copy. No fusedrecv-reduce-copy-send(this is the source of the large-message AllReduce gap). - Buffers (input, output, scratch) are sliced into equal chunks matching the synthesized chunk count.
6.2 Interpreter
- Embedded in NCCL — reuses NCCL's transport (IB verbs, NVLink, PCIe-P2P) and its connection-establishment logic.
- Replaces NCCL's hard-coded ring/tree state machine with a generic XML-driven dispatcher.
- Single kernel launch: the entire collective runs in one CUDA kernel, eliminating per-step launch overhead. Critical for small messages where TACCL's biggest gains live.
6.3 Threading Model
- Each threadblock is restricted to communicating with at most one peer GPU — simplifies synchronization, avoids cross-peer fences.
- Multiple threadblocks per GPU enable concurrent inter-GPU links.
6.4 Instances
- To saturate high-bandwidth NVLinks the algorithm can be replicated
in parallel
ntimes by sub-dividing chunks intonsub-chunks. - Trade-off: more instances = better link utilization, more synch overhead.
6.5 PyTorch Integration
- One-line change to
torch.distributed: register TACCL as the collective backend in place of NCCL. - API-compatible with NCCL — drop-in for any framework that uses NCCL.
7. Evaluation
All speedups are vs. NCCL v2.8.4-1 unless stated otherwise.
7.1 AllGather (Figure 6)
| Platform | Buffer | Speedup |
|---|---|---|
| DGX-2 (2 nodes) | 1 KB - 1 MB | 4.9x - 6.7x |
| DGX-2 | 2 MB - 64 MB | 10% - 3.8x |
| DGX-2 | 256 MB - 1 GB | 20% - 25% |
| NDv2 (2 nodes) | 1 KB - 1 MB | 12% - 35% |
| NDv2 | > 1 MB | 61% - 3.4x |
Insight: AllGather has no reduction, so TACCL's lack of fused-ops penalty is irrelevant; gains are largest where NCCL's small-message overhead dominates.
7.2 AllToAll (Figure 7)
| Platform | Buffer | Speedup |
|---|---|---|
| DGX-2 | 1 KB - 16 KB | up to 55% |
| DGX-2 | >= 2 MB | up to 15% |
| NDv2 | 16 MB - 1 GB | 53% - 66% |
Insight: AllToAll is bandwidth-heavy on NDv2 because every rank pair
must communicate; TACCL's uc-min and per-NIC routing matter
most.
7.3 AllReduce (Figure 8)
| Platform | Buffer | Speedup |
|---|---|---|
| DGX-2 | 1 KB - 4 MB | 49% - 6.4x |
| DGX-2 | 16 MB - 256 MB | 2% - 37% |
| DGX-2 | >= 512 MB | -9% (TACCL slower) |
| NDv2 | <= 1 MB | up to 28% |
| NDv2 | large | 28% - 2.7x |
Insight: large-message AllReduce on DGX-2 is where TACCL loses to
NCCL because NCCL's recv-reduce-copy-send fused kernel
saves a register- file round-trip per chunk; TACCL's separate
instructions cannot match.
7.4 Sensitivity Analysis (Figure 9)
- Varying the sketch's
n(IB connections) shifts the curve; too few connections starve bandwidth, too many over-saturate the shared NIC. - Varying chunk size: small chunks favor latency regime, large chunks favor bandwidth regime — confirms the alpha-beta trade-off.
7.5 End-to-End Training (Figure 10)
| Workload | Cluster | Speedup vs. NCCL |
|---|---|---|
| Transformer-XL | 16 GPU (2 nodes) | 11% - 1.94x (batch-size dependent) |
| Transformer-XL | 32 GPU (4 nodes) | 2% - 1.44x |
| BERT | 16 GPU | 12% - 2.36x |
| BERT | 32 GPU | 7% - 1.74x |
| Mixture-of-Experts | 16 GPU | 17% throughput |
Insight: end-to-end gains are highest at small batch sizes (more collectives per second, latency dominates) and shrink at large batch sizes (compute dominates).
7.6 Synthesis Time (Table 2)
| Collective | Sketch | Time |
|---|---|---|
| AllGather | dgx2-sk-1 | 35.8 s |
| AllGather | ndv2-sk-1 | 2.6 s |
| AllToAll | dgx2-sk-2 | 92.5 s |
| AllToAll | ndv2-sk-1 | 1809.8 s (feasible at 254 s) |
| AllReduce | dgx2-sk-1 | 6.1 s |
| AllReduce | ndv2-sk-1 | 0.3 s |
For comparison, SCCL times out at 24 hours for all multi-node configurations.
8. Related Work
| System | Approach | Limitation TACCL improves |
|---|---|---|
| SCCL [9] | SMT, full search | Single node / 8 GPUs only |
| Blink [51] | Spanning-tree packing | Doesn't search algorithm space |
| Plink | Hierarchical reduction with fixed primitives | No synthesis |
| NCCL | Hand-tuned templates | No automation, no per-topology adaptation |
| MSCCL | Microsoft's NCCL fork; supports custom XML programs (similar runtime to TACCL-EF) | Provides runtime but not synthesis — TACCL is the synthesis layer that produces inputs MSCCL can consume |
| Program Sketching [Solar-Lezama, 47] | General partial-program synthesis | Conceptual ancestor, not domain-specific |
The MSCCL/TACCL relationship is the most subtle: MSCCL is a runtime library descended from SCCL's runtime that can execute custom collectives expressed in XML. TACCL fills the missing piece — how to generate those XML programs scalably.
9. Conclusion and Future Work
- Sketches make collective synthesis tractable at multi-node scale.
- TACCL beats NCCL by up to 6.7x on microbenchmarks and accelerates end-to-end Transformer-XL/BERT training by up to 2.36x.
- Future work explicitly listed:
- Hierarchical composition of synthesized algorithms to scale to thousands of GPUs.
- Automated controllers for sketch-space exploration (no human in the loop).
- Lowering to fused communication instructions to recover the large-message AllReduce gap.
- Adaptive runtime selection among synthesized algorithms based on observed message size and topology load — explicitly named as a missing piece.
10. Limitations of the Methodology
- No fused ops in TACCL-EF. Verbatim: "NCCL uses the
more optimized fused communication instructions (such as
receive-reduce-copy-send) in itsALLREDUCEcommunication which are unavailable in TACCL's lowering." Result: -9% on AllReduce >= 512 MB on DGX-2. - NP-hard core. Verbatim: "While TACCL scales to multi-node topologies, the synthesis technique is still based on solving an NP-hard problem that grows exponentially with a quadratic power with scale."
- Topological generality. Verbatim: "The amount of exploration we can do with different communication sketches may be more limited in [non-hierarchical] cases than for hierarchical topologies."
- Flow-conservation violation. Verbatim: "Non-source GPUs in a collective can send the same chunk over different links in parallel while having received that chunk only once, which violates an important flow-conservation property used extensively in network flow problem literature."
- One platform generation. V100 only. A100/H100 + NVSwitch generation 3, NVLink 4.0, NDR IB are unmeasured.
- One framework. PyTorch only; integration with TensorFlow/JAX collectives left as engineering future work.
11. Cross-Cutting Empirical Take-Aways
| Take-away | Derived from |
|---|---|
Switch-hyperedge policy is regime-dependent: uc-max for
small, uc-min for large messages |
Sec. 3.2 + Figure 4 |
| Routing + Ordering + Contiguity decomposition makes 80-GPU MILP tractable | Sec. 5 + Table 2 |
| Single-kernel-launch interpreter is essential for small-message gains | Sec. 6 + 6.7x AllGather speedup at 1 KB |
Fused recv-reduce-copy-send is the missing instruction
for large-message AllReduce parity |
Sec. 7.3 + Sec. 10 limitations |
| Sketch authorship is human and topology-specific; not yet automated | Sec. 3, 4, 9 |
12. Discussion of NCCL
- TACCL-EF runs inside NCCL — reusing transport, ranks, and channels.
- TACCL replaces the NCCL algorithm chooser at synthesis time, not at runtime — once a TACCL-EF XML is loaded, every collective of that type runs the synthesized algorithm.
- Because TACCL plugs in below the framework but above the wire, it occupies the same architectural slot that NCCL's tuner-plugin API exposes, but with full algorithm-replacement semantics rather than parameter tuning.
13. Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective parameters (algorithm, protocol, nChannels, numThreads, chunkSize) at runtime via the NCCL tuner-plugin API. TACCL is an offline synthesizer that produces algorithms; DynamICCL is an online selector that picks among algorithms. The two layers are complementary: TACCL expands the action vocabulary, DynamICCL learns when to use each entry.
| TACCL finding | DynamICCL design implication |
|---|---|
| Up to 6.7x speedup on small-message AllGather via custom switch-hyperedge routing | Action space must include TACCL-synthesized algorithms as discrete actions, not only the four NCCL built-ins (Ring/Tree/CollNet/NVLS) |
uc-max vs uc-min switching is sharply
size-dependent |
State feature: size-bin should drive a learned "concurrent-link-pressure" feature; reward signal will reflect contention crossover |
| alpha-beta cost model with platform-specific values (DGX-2 NVLink beta = 8 us/MB vs NDv2 NVLink beta = 46 us/MB) | Topology fingerprint must encode link-class beta values, not just NVLink-vs-PCIe-vs-IB-vs-Ethernet — a 6x within-class variation matters |
| Single-kernel-launch interpreter dominates per-step launch | Action choice should track expected per-collective overhead; reward
function -collective_wall_clock_us already captures
this |
| Synthesis takes seconds to tens of minutes | Action set must be discrete and precomputed; DynamICCL cannot synthesize on the fly. Library-of-algorithms model: TACCL pre-generates per-(topology, message-size-band, collective) algorithms, DynamICCL chooses |
| End-to-end Transformer-XL / BERT speedups are batch-size sensitive | State feature: local batch size is correctly already in DynamICCL's state vector — confirmed important by TACCL's data |
Sketch field input_chunkup (2-16 chunks) maps to NCCL
chunkSize |
Exploration prior on chunkSize: 2 chunks for small messages, 8-16 for large — matches TACCL sweet spots |
| Future-work item: "adaptive runtime selection among synthesized algorithms" | DynamICCL is literally the system TACCL's future-work section calls for. Position the contribution as the runtime-selection counterpart to TACCL's offline-synthesis layer |
Specific design priors derived from TACCL:
Action-space expansion. Add TACCL-EF synthesized algorithms as first-class discrete actions in DynamICCL's action enum. The action space becomes {Ring, Tree, CollNet, NVLS, TACCL-1, TACCL-2, ...} per collective type. Each TACCL entry corresponds to a (sketch, target- size-band) pre-synthesized binary.
State-vector feature: per-link beta and alpha. Replace the coarse 4-class topology fingerprint with a vector of effective alpha and beta values per link class on the actual cluster. TACCL's Table 1 shows within-class variation up to 6x — coarse fingerprints throw away signal.
State-vector feature: switch concurrency regime. Add a feature that reflects whether the current message-size band favors
uc-max-style routing (small) oruc-min-style (large). This can be a learned scalar conditioned on log-binned message size and topology.Reward shaping consistency. TACCL's MILP minimizes
time = alpha + beta * size; DynamICCL's reward-collective_wall_clock_usis identical in unit and sign. No transformation needed when running TACCL-synthesized algorithms inside DynamICCL's evaluation loop.Exploration budget allocation. Spend more exploration on the transition zones TACCL identifies (1-16 KB AllGather; 16 MB - 1 GB AllToAll; 1-4 MB AllReduce). At deep small / deep large extremes the optimum is more stable; the policy can converge faster there.
Research positioning. TACCL's paper explicitly leaves "adaptive runtime selection" as future work in Sec. 9. DynamICCL is precisely that runtime selector. A clean joint-stack story is:
- TACCL (offline) builds a library of high-quality algorithms per (topology, collective, message-size-band).
- DynamICCL (online) learns a policy that picks among them per invocation, conditioned on dynamic state (recent timing, batch size, model intensity).
- Together they form a complete synthesize-then-select pipeline, bridging the offline/online gap that neither system addresses alone.
Open-problem alignment. TACCL's listed future-work items map to DynamICCL design choices:
- "Hierarchical composition" -> DynamICCL's per-level action decomposition (intra-node algorithm and inter-node algorithm as separate sub-actions).
- "Automated sketch search" -> DynamICCL's policy could in principle produce sketch hints, but in the near term DynamICCL consumes human-authored sketches as fixed inputs.
- "Lowering to fused instructions" -> orthogonal to DynamICCL; TACCL's compiler problem, not the RL agent's.