AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training — Detailed Summary

Authors: Guanbin Xu†, Zhihao Le†, Yinhe Chen†, Zhiqi Lin†, Zewen Jin† (USTC); Youshan Miao‡ (Microsoft Research); Cheng Li†◦ (USTC) Venue: 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 2025), April 28–30, 2025, Philadelphia, PA URL: https://www.usenix.org/conference/nsdi25/presentation/xu-guanbin Code: https://github.com/gbxu/autoccl (open source)


Abstract

AutoCCL is an automated NCCL parameter tuning tool that improves collective communication performance during distributed DNN training without requiring additional hardware investment or code changes. It identifies six performance-sensitive NCCL parameters, classifies them into implementation parameters (small search space, fixed per subspace) and resource parameters (large search space, tuned via coordinate descent within each subspace), and proposes an online tuning approach that embeds the search into the early iterations of training. This captures computational interference effects and amortizes tuning overhead. AutoCCL achieves 1.24–1.29x bandwidth improvement over NCCL and 1.15–1.22x over AFNFA on communication microbenchmarks, and up to 1.32x end-to-end training speedup on three LLMs and one vision model.


1. Motivation and Problem Statement

1.1 Communication as a Training Bottleneck

Large-scale DNN training distributes computation across GPU clusters and relies on collective communication primitives (AllReduce, AllGather, ReduceScatter) to synchronize model parameters and activations. Communication is a well-documented bottleneck limiting distributed training scalability. The workloads in Table 1 illustrate the scale:

Model Parallelism AllGather ReduceScatter AllReduce
Phi-2-2B TP(8)+DP(4) (80MB_8_3,120), (80MB_8_2,064) same (632MB_4_1)
Llama-3.1-8B TP(8)+DP(4) (128MB_8_6,240), (857MB_4_1), (128MB_8_4,128), (1,710MB_4_1)
Yi-1.5-34B TP(8)+PP(4) (56MB_8_61,440), (56MB_8_93,184)
VGG-19-0.14B DP(32) (15–392MB_32_6)

Format: (message_size_group_size_count_per_iteration). Yi-1.5-34B executes AllGather 61,440 times per training iteration, each passing 56 MB across 8 GPUs.

1.2 NCCL as a Black Box

Most research optimizing communication assumes NCCL is already well-tuned. The field has focused on: compute-communication overlap scheduling, gradient compression, topology-aware collective algorithm design. These approaches build on top of NCCL. AutoCCL's empirical study in Section 3 directly falsifies the "well-tuned" assumption:

1.3 Two Core Challenges

Challenge 1 — Search space explosion. NCCL has 28 performance-sensitive parameters total; AutoCCL identifies 6 that have the most impact. The combined search space for these 6 parameters is shown in Figure 2:

Exhaustive testing of AllGather with 80 MB message on a single PCIe-8GPU node takes several hours.

Challenge 2 — Computational interference. During training, GPU SMs run both compute kernels (GEMM for attention, FFN layers) and communication kernels concurrently. They compete for SM cores, L2 cache, and global memory bandwidth. This interference significantly degrades communication performance:


2. Background

2.1 NCCL Parameters

When a training framework calls a collective primitive (e.g., ncclAllGather), NCCL creates a communication task and assigns it a configuration tuple <A, P, T, NC, NT, C>:

Parameter Meaning Values
A (Algorithm) Collective topology: how data flows Tree, Ring
P (Protocol) Data movement method between buffer and SM LL (Low-Latency), LL128, Simple
T (Transport) Physical data transfer mechanism P2P (peer-to-peer, NVLink/PCIe), SHM (shared memory)
NC (Nchannel) Number of parallel data partitions / threadblocks 1 ≤ NC ≤ 128
NT (Nthread) Threads per threadblock (32 × i, i in {1..20}) 32–640
C (Chunk size) Data granularity per chunk in each serial transmission step 256 × i KB (i in {1..8K})

Algorithm (A): Determines the reduction topology.

Protocol (P): Determines how data is managed in GPU memory during the collective.

Transport (T): Determines the physical link used.

NC (Nchannel): Each channel is an independent pipeline of NC threadblocks transmitting concurrently. Increasing NC increases parallelism but also increases congestion on shared resources (L2, DRAM, NIC).

NT (Nthread): Threads per threadblock. More threads = more SM occupancy = faster protocol computation, but also more SM resource pressure.

C (Chunk size): Granularity of each serial transmission step within a channel. Larger C reduces overhead per step but increases memory pressure. The total number of serial steps = message_size / (NC × C).

2.2 Distributed Training Communication Patterns

Primitive Usage context Typical sizes
AllReduce Data parallelism gradient sync 15 MB – 1.7 GB
AllGather Tensor parallelism forward pass (weight gathering) 56 MB – 1.7 GB
ReduceScatter Tensor parallelism backward pass (gradient scatter) 56 MB – 1.7 GB

In tensor parallelism, AllGather and ReduceScatter are on the critical path of every microbatch forward and backward pass respectively. In data parallelism, AllReduce occurs once per mini-batch for gradient synchronization.


3. NCCL Tuning Opportunities (Section 3 — Empirical Study)

3.1 Stand-alone Communication (No Computational Interference)

AllGather, 80 MB, 8-GPU PCIe (Table 3):

Config A P T NC NT C Bandwidth
C_P0 (NCCL default) Ring Simple SHM 2 256 2 MB baseline
C_P1 Ring Simple P2P 2 96 32 KB
C_P2 Ring Simple P2P 8 256 32 KB
C_P3 Ring Simple P2P 8 96 32 KB 2.69x NCCL default

C_P3 achieves 2.69x the bandwidth of C_P0. The difference: C_P3 uses P2P transport (bypassing CPU) and splits the message into more partitions (NC=8 vs. 2) with smaller chunks (C=32 KB vs. 2 MB). The NCCL default uses SHM transport — suboptimal for this hardware.

AllGather, 80 MB, 8-GPU NVLink (Table 4):

Config A P T NC NT C
C_N0 (NCCL default) Ring Simple P2P 8 512 2 MB
C_N3 (best) Ring Simple SHM 64 512 108 KB

C_N3 achieves 1.28x C_N0. On NVLink, SHM transport outperforms P2P — the opposite of the PCIe case. This confirms that optimal transport is hardware-dependent and NCCL's default is not universally correct.

AllReduce, 8-GPU PCIe (Table 5):

Task Config A NC NT C Bandwidth
64 MB, 16 C_0 (NCCL default) Ring 2 256 512 KB 4.0 GB/s
64 MB, 16 C_2 Tree 8 160 59 KB 8.9 GB/s (2.23x)
15 MB, 8 C_0 (NCCL default) Ring 2 256 512 KB
15 MB, 8 C_4 Ring 10 128 27 KB 8.8 GB/s

The NCCL default selects Ring for both tasks. Tree is better for the larger task (64 MB, group 16). The optimal algorithm is task-specific and NCCL's alpha-beta model does not always predict this correctly.

Key finding: NCCL's default is only 81.2% of the best achievable for AllGather on PCIe-8GPU; up to 2.69x improvement is achievable.

3.2 Computational Interference (Table 6)

AllGather, 80 MB, 8-GPU NVLink, with concurrent GEMM (A[3456,128] × B[128,3456] + sigmoid):

Condition NCCL Default BW Tuned (AutoCCL) BW
Communication-only 30.08 GB/s 38.62 GB/s (+28.4%)
Light computation 26.74 GB/s 35.14 GB/s (+31.4%)
Heavy computation 18.26 GB/s 32.44 GB/s (+77.7%)

Three important observations from this table:

  1. Even light computation degrades NCCL default bandwidth by 12.8%.
  2. Heavy computation degrades NCCL default by 39.3%.
  3. AutoCCL tuning recovers performance dramatically — under heavy interference, the tuned config achieves 77.7% more bandwidth than NCCL default, and is even 7.8–16.8% higher than NCCL default without any interference.

4. Communication Tuning Method (AutoCCL Algorithm)

4.1 Parameter Classification

AutoCCL divides the six parameters into two categories:

Implementation parameters {A, P, T}: These define the topology, communication algorithm, and physical transport mechanism. They have a small combined search space (2 × 3 × 3 = 18 combinations at most, reduced to feasible subsets by hardware). They are a prerequisite for resource parameter selection — the optimal resource parameters depend on which (A, P, T) subspace is used. The paper finds that prior guidelines for selecting A (Tree vs. Ring based on message size) and T (P2P vs. SHM based on interconnect) are not always correct.

Resource parameters {NC, NT, C}: These govern the degree of parallelism and data granularity. They have a large search space (NC up to 128, NT up to 640, C up to 2048 × 256 = 512 MB). The key analytical finding: their joint impact on bandwidth is unimodal — for any fixed pair, bandwidth as a function of the third parameter first rises then falls, with a single peak.

4.2 Performance Model for Resource Parameters

The collective execution has two serial phases:

Phase 0 (Transport): Reading data from peer GPUs into local buffer. NC parallel channels, each transferring a message of size C per serial step. Number of serial steps = M / (NC × C).

t_0(NC, C) = M/(NC*C) * (alpha_0 + NC*C/(beta_bar_0 * gamma))

beta_0(NC, C) = M / t_0(NC, C)
             = (NC * C) / (alpha_0 + NC*C/(beta_bar_0 * gamma))

where alpha_0 = initialization latency, beta_bar_0 = peak transport bandwidth, gamma = congestion coefficient (monotonically increasing in NC, NT, C).

Phase 1 (Protocol): Loading data from buffer to SM, performing reduction, writing back. NT threads per channel.

beta_1(NC, NT) = (NC * NT) / (alpha_1 + NC*NT/(beta_bar_1 * gamma))

Since transport and protocol stages are serial and data-dependent:

beta(NC, NT, C) = min(beta_0(NC, C), beta_1(NC, NT))    ... (Equation 1)

Both beta_0 and beta_1 individually exhibit unimodal behavior: they increase then saturate then potentially decrease as NC or C or NT increases (due to growing gamma). Their minimum also exhibits unimodal behavior, enabling coordinate descent to find the global maximum.

4.3 Tuning Algorithm

Algorithm 1: Subspace-Directed Tuning

Input: Task w
optimum <- nil
for each subspace s in [A × P × T]:
    config <- CoordinateDescentSearch(s)      // Algorithm 2
    if config.BwDelta(optimum) > 0:
        optimum <- config
return optimum

The outer loop iterates through all (A, P, T) subspaces — at most 18 combinations. For each subspace, coordinate descent searches the optimal (NC, NT, C).

Algorithm 2: Coordinate Descent Search

Input: Subspace s
M <- 3     // number of resource parameters: NC, NT, C
p <- random config from subspace s
optimum <- p
dim, tuned_dim, lr <- 0, 0, 0.01
while tuned_dim <= M:
    p.ProfileBw()
    if p.BwDelta(optimum) > 0:
        lr <- p.BwDelta(optimum) / p.Bw()   // learning rate = relative improvement
        optimum, tuned_dim <- p, 0
    else:
        tuned_dim <- tuned_dim + 1
        dim, lr <- dim + 1, 0.01             // move to next dimension
    p[dim] <- p[dim] + lr                    // step along current dimension
return optimum

This is gradient ascent along each resource parameter dimension, guided by measured bandwidth improvement. The learning rate adapts to the magnitude of improvement. When a dimension stops improving, the algorithm moves to the next. Termination when all M=3 dimensions have been tuned.

4.4 Online Tuning Architecture

                    Training Cluster
         ┌─────────────────────────────────────┐
         │  Communication Group                │
         │                                     │
         │  ┌──────────────────────────────┐   │
         │  │       Leader Node            │   │
         │  │  ┌──────────┐ ┌──────────┐  │   │
         │  │  │ Optimizer│ │Coordinator│  │   │
         │  │  └──────┬───┘ └────┬─────┘  │   │
         │  │         │          │ broadcast│   │
         │  │  ┌──────▼──────────▼──────┐  │   │
         │  │  │ Executor               │  │   │
         │  │  │ [Config Table: Default,│  │   │
         │  │  │  Tuned]                │  │   │
         │  │  │ [History Table]        │  │   │
         │  └──┴────────────────────────┴──┘   │
         │                                     │
         │  ┌──────────────────────────────┐   │
         │  │       Worker Node(s)         │   │
         │  │  ┌──────────────────────┐    │   │
         │  │  │ Executor             │    │   │
         │  │  │ [Config Table: Default│   │   │
         │  │  │  Tuned]              │    │   │
         │  │  └──────────────────────┘    │   │
         │  └──────────────────────────────┘   │
         └─────────────────────────────────────┘

Flow:
  Iteration k:  Leader Executor profiles current config -> Optimizer updates History Table
                Optimizer runs one coordinate descent step -> generates candidate config
                If candidate better: Coordinator broadcasts new config to all Workers
                Workers switch to new config atomically for subsequent iterations
  Iteration k+1..N-k: All nodes use tuned config (beneficial iterations)

Key design choices:

4.5 Implementation Details


5. Evaluation

5.1 Hardware Setup

Cluster A (NVLink/InfiniBand):

Cluster B (PCIe/InfiniBand):

DNN models for end-to-end evaluation (Table 8):

Model MBS GBS TP PP DP
Phi-2-2B 8 512 8 1 1–4
Llama-3.1-8B 2 256 8 1 1–4
Yi-1.5-34B 1 1,024 8 4 1
VGG-19-0.14B 32 32×[8,16,32] 1 1 8–32

5.2 Communication Microbenchmarks

AllGather and ReduceScatter (PCIe-8GPU, no interference — Figure 7):

AllGather and ReduceScatter (NVLink-8GPU, no interference):

AllReduce (PCIe-8GPU through PCIe-32GPU — Figure 8):

With computational interference (GEMM, Figure 9 — 128 MB message):

With variable interference (Figure 10 — scaling message size and GEMM size together):

5.3 End-to-End Training Results (Figure 11)

Model Setup vs. NCCL vs. AFNFA
Phi-2-2B PCIe-8GPU 1.14x 1.10x
Phi-2-2B PCIe-16GPU 1.00x 1.07x
Phi-2-2B NVLink-8GPU 1.07x 1.07x
Phi-2-2B NVLink-16GPU ~1.00x ~1.00x
Llama-3.1-8B PCIe-8GPU 1.18x 1.10x
Llama-3.1-8B PCIe-16GPU 1.00x 1.10x
Llama-3.1-8B NVLink-8GPU ~1.00x 1.07x
Yi-1.5-34B PCIe-32GPU 1.07x 1.07x
VGG-19 PCIe-8GPU 1.05x

Best case: 1.32x speedup (AutoCCL vs. NCCL and AFNFA, some PCIe configurations). On NVLink, gains are modest because: (a) NVLink provides much higher bandwidth, so communication is less the bottleneck, and (b) over-optimizing communication can slow computation in overlapped scenarios.

AFNFA sometimes degrades below NCCL in end-to-end training because its offline-tuned global configuration is not adapted to the specific computational interference patterns of individual training tasks.

5.4 Online Tuning Convergence (Figure 12)

For Phi-2-2B and Llama-3.1-8B, AutoCCL converges within the first 4–7 training iterations. This is because each training iteration executes thousands of identical AllGather/ReduceScatter operations (one per microbatch per layer), giving AutoCCL abundant profiling samples per iteration.

For VGG-19 (~150–200 ms/iteration, only 6 AllReduce per iteration), convergence takes up to 10 minutes. Tuning overhead per iteration is a slight extension of iteration time; since k << N total iterations, the amortized overhead is negligible.


6. Design Choices and Limitations

6.1 Parameter Selection and Scope (Section 8.1)

AutoCCL explicitly excludes transport-specific parameters like:

Reasons: enabling P2P CUDA memcpy can cause deadlocks in some configurations; IB-specific parameters depend on hardware and are typically already optimal for the NIC type; the study found these provide limited gains. These are left for future work.

6.2 Task Failures and Valid Configuration Bounds (Section 8.2)

Two types of failures observed:

  1. Implementation parameter failures (e.g., P2P on hardware that doesn't support it) — AutoCCL avoids by excluding P2P in Cluster B.
  2. Resource saturation failures (excessively large NC, NT, C values exhaust GPU memory or cause crashes) — mitigated by imposing upper bounds on resource parameters, which is safe due to the unimodal nature of the performance function (very large values never improve performance).

Category Representative works Relationship to AutoCCL
Compute-communication overlap scheduling Horovod, Centauri, TicTac Orthogonal; AutoCCL improves the underlying primitive that schedulers invoke
Gradient compression HiPress, ZeRO++, Espresso Complementary; AutoCCL can be used to tune the compressed collective
Collective algorithm design Swing (NSDI 24), MCCLang, Blink Orthogonal; these design new algorithms; AutoCCL selects among existing ones
Network tuning / NCCL configuration AFNFA (APNet 2023), NCCL built-in cost model Direct competitors; AutoCCL outperforms both by online profiling under real interference
Topology-aware training Topoopt (NSDI 23), TCCL (ASPLOS 24) Higher-level placement; AutoCCL optimizes the per-task communication config

AFNFA (Wang et al., APNet 2023): The closest prior work to AutoCCL. AFNFA automates NCCL configuration exploration using a random forest trained on offline-sampled data. Key limitations: (1) offline-tuned global configuration does not account for per-task differences or computational interference, (2) training 1% of sampled data with random forest is expensive, (3) configurations degrade under interference since the model was trained without it. AutoCCL's online tuning approach explicitly addresses all three.


8. Relevance to DynamICCL

AutoCCL is the primary technical baseline and differentiation target for DynamICCL. The two systems share the same fundamental goal — automatic NCCL parameter selection to reduce collective communication latency during distributed DNN training — but differ fundamentally in mechanism, scope, and adaptability.

8.1 Parameter Space Alignment

AutoCCL tunes: {Algorithm A, Protocol P, Transport T, Nchannel NC, Nthread NT, Chunk size C}

DynamICCL Agent-2 action space: {algorithm (ring/tree/collnet_direct/collnet_chain/nvls/nvls_tree/pat), protocol (ll/ll128/simple), nChannels (1–8), numThreads}

The overlap is near-complete. DynamICCL adds more algorithm variants (collnet, nvls, nvls_tree, pat) not present in AutoCCL, reflecting more recent NCCL capabilities. DynamICCL omits Chunk size as a direct action, instead relying on NCCL's internal chunk selection given the other parameters.

8.2 Key Design Differences

Dimension AutoCCL DynamICCL
Search method Coordinate descent (analytical unimodal model) DQN reinforcement learning
Congestion awareness None — captures average interference via online profiling Explicit: LSTM+CUSUM Agent-1 detects congestion events
Reconfiguration Static: converges once, broadcasts fixed config Dynamic: RL agent can reconfigure at any time
Performance model assumption Unimodal bandwidth function None — RL learns the mapping without analytical assumptions
Interference modeling Captured implicitly via online profiling Explicit congestion state feeds into Agent-2's state
Collective types AllGather, ReduceScatter, AllReduce AllReduce, AllGather, ReduceScatter
Algorithm choices Tree, Ring Ring, Tree, CollNet_Direct, CollNet_Chain, NVLS, NVLS_Tree, PAT
Deployment overhead Near-zero (preload library, transparent) Requires NCCL tuner plugin API integration
Tuning convergence Fast (few iterations for LLMs) Policy gradient requires more environment interaction

8.3 AutoCCL Results as DynamICCL Motivation

AutoCCL's empirical Section 3 provides quantitative grounding for DynamICCL's design decisions:

8.4 DynamICCL Advantages over AutoCCL

  1. Congestion detection and reactive reconfiguration. AutoCCL tunes once and holds configuration static. DynamICCL's LSTM+CUSUM Agent-1 detects transient network congestion events and triggers Agent-2 to reconfigure NCCL in response. This is critical in multi-tenant HPC environments (Chameleon Cloud) where network conditions change during a training run.

  2. No unimodal assumption. Coordinate descent requires the performance surface to be unimodal. Multi-tenant network contention, non-linear GPU cache effects, and concurrent pipeline parallelism can create multi-modal performance surfaces. DynamICCL's DQN agent makes no such assumption.

  3. Extended algorithm space. CollNet, NVLS, and PAT algorithms are available in recent NCCL versions for specific hardware configurations (e.g., SHARP in-network compute, NVLink Switch). AutoCCL does not include these. DynamICCL's action space covers them.

  4. Phase-aware optimization. DynamICCL can learn different configurations for different phases of training (warm-up, steady-state, large-batch finalization) as the RL policy learns temporal patterns. AutoCCL's one-time convergence cannot differentiate phases.


Citation

@inproceedings{xu2025autoccl,
  title     = {AutoCCL: Automated Collective Communication Tuning for
               Accelerating Distributed and Parallel DNN Training},
  author    = {Xu, Guanbin and Le, Zhihao and Chen, Yinhe and Lin, Zhiqi and
               Jin, Zewen and Miao, Youshan and Li, Cheng},
  booktitle = {22nd USENIX Symposium on Networked Systems Design and Implementation
               (NSDI 25)},
  pages     = {667--683},
  year      = {2025},
  publisher = {USENIX Association}
}