AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training — Brief Summary

Authors: Guanbin Xu, Zhihao Le, Yinhe Chen, Zhiqi Lin, Zewen Jin (USTC); Youshan Miao (Microsoft Research); Cheng Li (USTC) Venue: USENIX NSDI 2025, Philadelphia, PA, April 28–30, 2025 URL: https://www.usenix.org/conference/nsdi25/presentation/xu-guanbin Code: https://github.com/gbxu/autoccl

Problem

Collective communication libraries (NCCL, RCCL) are universally treated as pre-tuned black boxes by the distributed ML training community. Most communication optimization research — scheduling, compression, topology-aware algorithms — assumes the underlying library is already optimally configured. AutoCCL's empirical study shows this assumption is wrong: NCCL's default configurations are sub-optimal, and tuning six low-level parameters can improve bandwidth by up to 35%. The challenge is twofold: (1) the combinatorial search space across six parameters spans millions of candidate configurations, making exhaustive search impractical; (2) computation and communication execute concurrently during training, so communication performance measured in isolation does not predict performance under training's computational interference.

Core Insight

Decouple NCCL's six performance-sensitive parameters into two categories — implementation parameters (Algorithm A, Protocol P, Transport T), which define topology and data-movement semantics and have a small search space — and resource parameters (Nchannel NC, Nthread NT, Chunk size C), which govern parallelism allocation and have a large, continuous search space. The joint impact of NC, NT, C on communication bandwidth is characterized by a unimodal function with a sweet spot. This unimodal structure enables coordinate descent search within each (A, P, T) subspace rather than exhaustive enumeration. Embed the search into the iterative structure of DNN training (online tuning) to amortize overhead and capture true performance under computational interference.

Method

AutoCCL operates in three phases:

1. Parameter classification and subspace division. The six NCCL parameters are split:

Implementation parameters {A, P, T}: small search space (1 x 3 x 3 = 9 combinations), each combination defines an independent subspace.
Resource parameters {NC, NT, C}: large, continuous search space (up to millions), tuned within each subspace via coordinate descent.

2. Performance modeling for resource parameters. Within a fixed (A, P, T) subspace, the collective execution has two phases: phase 0 (transport — data read from peer GPUs to local buffer) and phase 1 (protocol — data from buffer to SM, reduction, writeback). The effective bandwidth is:

beta(NC, NT, C) = min(beta_0(NC, C), beta_1(NC, NT))

beta_0(NC, C) = (NC * C) / (alpha_0 + NC*C / (beta_bar_0 * gamma))
beta_1(NC, NT) = (NC * NT) / (alpha_1 + NC*NT / (beta_bar_1 * gamma))

where gamma = congestion coefficient (increases with NC, NT, C)

This unimodal model justifies coordinate descent: fixing any two of {NC, NT, C}, bandwidth first rises then falls as the third increases — there is always a single peak.

3. Online tuning. AutoCCL integrates the tuning loop into the first k training iterations (k << N total iterations). One GPU per communication group is designated the Leader; it runs the Optimizer (coordinate descent search) and Coordinator (broadcasts optimal config to all Workers via atomic update). Workers use default NCCL config until the Leader broadcasts a tuned config. The online approach captures computational interference naturally — profiling runs during real training, so resource contention from concurrent GEMM operations is reflected in the measured bandwidth.

Architecture:

NCCL (peer-to-peer design):
  All GPUs = identical Peers
  Each Peer independently selects config from NCCL cost model

AutoCCL:
  One GPU = Leader  [Optimizer + Coordinator]
  Other GPUs = Workers
  Leader runs tuning; broadcasts optimal config atomically;
  Workers switch to tuned config transparently

AutoCCL is implemented on NCCL v2.18.3 (9,176 lines of C++). It is loaded as a preloadable dynamic library — no code changes to PyTorch or MegatronLM required.

Key Results

Cluster A: 2 nodes, 8x A40 GPUs each, 400 Gbps NVLink intra-node, 400 Gbps InfiniBand inter-node. Cluster B: 4 nodes, 8x A40 GPUs each, PCIe intra-node, 100 Gbps InfiniBand inter-node.

Communication microbenchmarks (vs. NCCL and AFNFA):

AllGather on PCIe-8GPU: AutoCCL +22.66% vs. NCCL; AFNFA ≈ NCCL
ReduceScatter on PCIe-8GPU: AutoCCL +27.52% vs. NCCL; AFNFA ≈ NCCL
AllReduce on PCIe-8GPU through 32-GPU: 1.28x–1.15x vs. NCCL
NVLink-8GPU: 1.29x–1.22x vs. NCCL (benefits exceed even highly-optimized NVLink path)
With computational interference (GEMM running concurrently): AutoCCL achieves 1.29x (AllGather), 1.50x (ReduceScatter), 1.38x (AllReduce) vs. NCCL; AFNFA shows no improvement for AllGather/ReduceScatter under interference

End-to-end DNN training (vs. NCCL and AFNFA): | Model | Cluster | AutoCCL vs. NCCL | AutoCCL vs. AFNFA | |---|---|---|---| | Phi-2-2B | PCIe-8GPU | 1.14x | 1.10x | | Phi-2-2B | NVLink-8GPU | 1.00x | 1.07x | | Llama-3.1-8B | PCIe-8GPU | 1.18x | 1.10x | | Yi-1.5-34B | PCIe-32GPU | 1.07x | 1.07x | | VGG-19 | PCIe-8GPU | ~1.05x | — |

Best case: up to 1.32x speedup in per-iteration training time vs. both NCCL and AFNFA. Up to 32% improvement on PCIe clusters. On NVLink, gains are modest because computation-communication overlap means excessive communication optimization can slow overall throughput.

Online tuning convergence: For LLM models with thousands of repeated microbatch communication operations per iteration, AutoCCL converges within just a few training iterations (< 10). For VGG-19 (~150 ms/iteration), convergence takes no more than 10 minutes.

Limitations

Does not optimize transport-specific parameters (e.g., NCCL_IB_SPLIT_DATA_ON_QPS, NCCL_NET_GDR_LEVEL, P2P CUDA memcpy) due to risk of deadlocks and system-specific dependencies.
Assumes fixed (A, P, T) subspaces are independent. In practice, the optimal resource parameters within one subspace may depend on the implementation parameters of adjacent subspaces in a complex workload.
Unimodal model is theoretical. The performance model is validated empirically on AllGather/NVLink but may not hold for all primitives, message sizes, and topologies.
Online tuning adds overhead to early training iterations. For short jobs (few iterations), this overhead may not be fully amortized.
No support for dynamic reconfiguration during training. If network conditions or computation load change after convergence, AutoCCL does not re-tune. Optimal configuration is determined once and broadcast.
Single communication group leader design can be a bottleneck for very large clusters with many distinct communication groups.
P2P transport disabled. In Cluster B (PCIe), P2P is unavailable for some GPU pairs; AutoCCL avoids P2P transport to prevent failures, leaving some optimization potential unrealized.

Relevance to DynamICCL

AutoCCL is the closest existing work to DynamICCL and is the primary technical baseline and differentiation point.

Shared goal: Both AutoCCL and DynamICCL aim to automatically tune NCCL low-level parameters to reduce collective communication latency during distributed DNN training.

Key differentiation of DynamICCL:

RL vs. coordinate descent. AutoCCL uses coordinate descent within a unimodal performance model. DynamICCL uses a DQN agent — allowing it to learn non-unimodal, multi-modal, or non-stationary performance landscapes without analytical modeling assumptions.
Reactive congestion detection. AutoCCL has no congestion awareness; it profiles and tunes once, then broadcasts a fixed config. DynamICCL's Agent-1 (LSTM+CUSUM) continuously monitors network traffic for congestion and triggers Agent-2 to reconfigure NCCL dynamically. This makes DynamICCL adaptive to real-time network state changes that AutoCCL cannot handle.
Dynamic re-tuning. AutoCCL converges to a static optimal config. DynamICCL's RL agent can reconfigure at runtime as training phase, workload intensity, or network congestion evolves.
Parameter space. AutoCCL tunes {A, P, T, NC, NT, C} — matching DynamICCL's action space (algorithm, protocol, nChannels, numThreads). DynamICCL's action space for Agent-2 is: algorithm (ring/tree/collnet_direct/collnet_chain/nvls/nvls_tree/pat), protocol (ll/ll128/simple), nChannels (1–8), numThreads — fully covering AutoCCL's tunable dimensions while adding more algorithm variants.
Congestion-aware vs. congestion-agnostic. AutoCCL's online tuning captures interference on average but does not detect or respond to transient congestion events. DynamICCL's CUSUM-based detector identifies specific congestion events and triggers targeted reconfiguration.

AutoCCL's empirical study (Section 3) — establishing that NCCL default configs are sub-optimal, that six parameters matter, and that computational interference substantially changes optimal configuration — provides direct motivation and quantitative grounding for DynamICCL's design.