AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training — Detailed Summary

Authors: Guanbin Xu†, Zhihao Le†, Yinhe Chen†, Zhiqi Lin†, Zewen Jin† (USTC); Youshan Miao‡ (Microsoft Research); Cheng Li†◦ (USTC) Venue: 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 2025), April 28–30, 2025, Philadelphia, PA URL: https://www.usenix.org/conference/nsdi25/presentation/xu-guanbin Code: https://github.com/gbxu/autoccl (open source)

Abstract

AutoCCL is an automated NCCL parameter tuning tool that improves collective communication performance during distributed DNN training without requiring additional hardware investment or code changes. It identifies six performance-sensitive NCCL parameters, classifies them into implementation parameters (small search space, fixed per subspace) and resource parameters (large search space, tuned via coordinate descent within each subspace), and proposes an online tuning approach that embeds the search into the early iterations of training. This captures computational interference effects and amortizes tuning overhead. AutoCCL achieves 1.24–1.29x bandwidth improvement over NCCL and 1.15–1.22x over AFNFA on communication microbenchmarks, and up to 1.32x end-to-end training speedup on three LLMs and one vision model.

1. Motivation and Problem Statement

1.1 Communication as a Training Bottleneck

Large-scale DNN training distributes computation across GPU clusters and relies on collective communication primitives (AllReduce, AllGather, ReduceScatter) to synchronize model parameters and activations. Communication is a well-documented bottleneck limiting distributed training scalability. The workloads in Table 1 illustrate the scale:

Model	Parallelism	AllGather	ReduceScatter	AllReduce
Phi-2-2B	TP(8)+DP(4)	(80MB_8_3,120), (80MB_8_2,064)	same	(632MB_4_1)
Llama-3.1-8B	TP(8)+DP(4)	(128MB_8_6,240), (857MB_4_1), (128MB_8_4,128), (1,710MB_4_1)	—	—
Yi-1.5-34B	TP(8)+PP(4)	(56MB_8_61,440), (56MB_8_93,184)	—	—
VGG-19-0.14B	DP(32)	—	—	(15–392MB_32_6)

Format: (message_size_group_size_count_per_iteration). Yi-1.5-34B executes AllGather 61,440 times per training iteration, each passing 56 MB across 8 GPUs.

1.2 NCCL as a Black Box

Most research optimizing communication assumes NCCL is already well-tuned. The field has focused on: compute-communication overlap scheduling, gradient compression, topology-aware collective algorithm design. These approaches build on top of NCCL. AutoCCL's empirical study in Section 3 directly falsifies the "well-tuned" assumption:

NCCL's default configuration for AllGather on an 8-GPU A40 PCIe node achieves only 81.2% of the bandwidth of the best attainable configuration.
Tuned configurations outperform NCCL default by up to 35% bandwidth.
NCCL's built-in cost model (alpha-beta network model) correctly selects algorithm in some cases but frequently misjudges, particularly when hardware and workload do not match the model's assumptions.

1.3 Two Core Challenges

Challenge 1 — Search space explosion. NCCL has 28 performance-sensitive parameters total; AutoCCL identifies 6 that have the most impact. The combined search space for these 6 parameters is shown in Figure 2:

AllGather: millions of combinations at 256 KB; tens of millions at 1 GB
ReduceScatter: similar scale
AllReduce: smaller but still large

Exhaustive testing of AllGather with 80 MB message on a single PCIe-8GPU node takes several hours.

Challenge 2 — Computational interference. During training, GPU SMs run both compute kernels (GEMM for attention, FFN layers) and communication kernels concurrently. They compete for SM cores, L2 cache, and global memory bandwidth. This interference significantly degrades communication performance:

AllGather bandwidth drops 12.8% under light GEMM interference and 39.3% under heavy GEMM interference (Table 6).
After AutoCCL tuning, the same AllGather recovers to 34–78% above the NCCL default across all interference levels.
Critically, the optimal NCCL configuration changes under interference. A config that is optimal in isolation may be far from optimal under training load.

2. Background

2.1 NCCL Parameters

When a training framework calls a collective primitive (e.g., ncclAllGather), NCCL creates a communication task and assigns it a configuration tuple <A, P, T, NC, NT, C>:

Parameter	Meaning	Values
A (Algorithm)	Collective topology: how data flows	Tree, Ring
P (Protocol)	Data movement method between buffer and SM	LL (Low-Latency), LL128, Simple
T (Transport)	Physical data transfer mechanism	P2P (peer-to-peer, NVLink/PCIe), SHM (shared memory)
NC (Nchannel)	Number of parallel data partitions / threadblocks	1 ≤ NC ≤ 128
NT (Nthread)	Threads per threadblock (32 × i, i in {1..20})	32–640
C (Chunk size)	Data granularity per chunk in each serial transmission step	256 × i KB (i in {1..8K})

Algorithm (A): Determines the reduction topology.

Ring: data passes sequentially around a ring of GPUs. Bandwidth scales with N-1/N for N GPUs; latency scales with N-1.
Tree: binary tree broadcast/reduce. Logarithmic latency; more efficient for small messages.

Protocol (P): Determines how data is managed in GPU memory during the collective.

LL (Low-Latency): optimized for small messages; minimizes startup latency.
LL128: extended low-latency for 128-byte aligned transfers.
Simple: optimized for large messages; maximum throughput.

Transport (T): Determines the physical link used.

P2P: direct GPU-to-GPU transfer via NVLink or PCIe DMA, bypassing CPU.
SHM: CPU shared memory; used when P2P is unavailable or slower.

NC (Nchannel): Each channel is an independent pipeline of NC threadblocks transmitting concurrently. Increasing NC increases parallelism but also increases congestion on shared resources (L2, DRAM, NIC).

NT (Nthread): Threads per threadblock. More threads = more SM occupancy = faster protocol computation, but also more SM resource pressure.

C (Chunk size): Granularity of each serial transmission step within a channel. Larger C reduces overhead per step but increases memory pressure. The total number of serial steps = message_size / (NC × C).

2.2 Distributed Training Communication Patterns

Primitive	Usage context	Typical sizes
AllReduce	Data parallelism gradient sync	15 MB – 1.7 GB
AllGather	Tensor parallelism forward pass (weight gathering)	56 MB – 1.7 GB
ReduceScatter	Tensor parallelism backward pass (gradient scatter)	56 MB – 1.7 GB

In tensor parallelism, AllGather and ReduceScatter are on the critical path of every microbatch forward and backward pass respectively. In data parallelism, AllReduce occurs once per mini-batch for gradient synchronization.

3. NCCL Tuning Opportunities (Section 3 — Empirical Study)

3.1 Stand-alone Communication (No Computational Interference)

AllGather, 80 MB, 8-GPU PCIe (Table 3):

Config	A	P	T	NC	NT	C	Bandwidth
C_P0 (NCCL default)	Ring	Simple	SHM	2	256	2 MB	baseline
C_P1	Ring	Simple	P2P	2	96	32 KB
C_P2	Ring	Simple	P2P	8	256	32 KB
C_P3	Ring	Simple	P2P	8	96	32 KB	2.69x NCCL default

C_P3 achieves 2.69x the bandwidth of C_P0. The difference: C_P3 uses P2P transport (bypassing CPU) and splits the message into more partitions (NC=8 vs. 2) with smaller chunks (C=32 KB vs. 2 MB). The NCCL default uses SHM transport — suboptimal for this hardware.

AllGather, 80 MB, 8-GPU NVLink (Table 4):

Config	A	P	T	NC	NT	C
C_N0 (NCCL default)	Ring	Simple	P2P	8	512	2 MB
C_N3 (best)	Ring	Simple	SHM	64	512	108 KB

C_N3 achieves 1.28x C_N0. On NVLink, SHM transport outperforms P2P — the opposite of the PCIe case. This confirms that optimal transport is hardware-dependent and NCCL's default is not universally correct.

AllReduce, 8-GPU PCIe (Table 5):

Task	Config	A	NC	NT	C	Bandwidth
64 MB, 16	C_0 (NCCL default)	Ring	2	256	512 KB	4.0 GB/s
64 MB, 16	C_2	Tree	8	160	59 KB	8.9 GB/s (2.23x)
15 MB, 8	C_0 (NCCL default)	Ring	2	256	512 KB	—
15 MB, 8	C_4	Ring	10	128	27 KB	8.8 GB/s

The NCCL default selects Ring for both tasks. Tree is better for the larger task (64 MB, group 16). The optimal algorithm is task-specific and NCCL's alpha-beta model does not always predict this correctly.

Key finding: NCCL's default is only 81.2% of the best achievable for AllGather on PCIe-8GPU; up to 2.69x improvement is achievable.

3.2 Computational Interference (Table 6)

AllGather, 80 MB, 8-GPU NVLink, with concurrent GEMM (A[3456,128] × B[128,3456] + sigmoid):

Condition	NCCL Default BW	Tuned (AutoCCL) BW
Communication-only	30.08 GB/s	38.62 GB/s (+28.4%)
Light computation	26.74 GB/s	35.14 GB/s (+31.4%)
Heavy computation	18.26 GB/s	32.44 GB/s (+77.7%)

Three important observations from this table:

Even light computation degrades NCCL default bandwidth by 12.8%.
Heavy computation degrades NCCL default by 39.3%.
AutoCCL tuning recovers performance dramatically — under heavy interference, the tuned config achieves 77.7% more bandwidth than NCCL default, and is even 7.8–16.8% higher than NCCL default without any interference.

4. Communication Tuning Method (AutoCCL Algorithm)

4.1 Parameter Classification

AutoCCL divides the six parameters into two categories:

Implementation parameters {A, P, T}: These define the topology, communication algorithm, and physical transport mechanism. They have a small combined search space (2 × 3 × 3 = 18 combinations at most, reduced to feasible subsets by hardware). They are a prerequisite for resource parameter selection — the optimal resource parameters depend on which (A, P, T) subspace is used. The paper finds that prior guidelines for selecting A (Tree vs. Ring based on message size) and T (P2P vs. SHM based on interconnect) are not always correct.

Resource parameters {NC, NT, C}: These govern the degree of parallelism and data granularity. They have a large search space (NC up to 128, NT up to 640, C up to 2048 × 256 = 512 MB). The key analytical finding: their joint impact on bandwidth is unimodal — for any fixed pair, bandwidth as a function of the third parameter first rises then falls, with a single peak.

4.2 Performance Model for Resource Parameters

The collective execution has two serial phases:

Phase 0 (Transport): Reading data from peer GPUs into local buffer. NC parallel channels, each transferring a message of size C per serial step. Number of serial steps = M / (NC × C).

t_0(NC, C) = M/(NC*C) * (alpha_0 + NC*C/(beta_bar_0 * gamma))

beta_0(NC, C) = M / t_0(NC, C)
             = (NC * C) / (alpha_0 + NC*C/(beta_bar_0 * gamma))

where alpha_0 = initialization latency, beta_bar_0 = peak transport bandwidth, gamma = congestion coefficient (monotonically increasing in NC, NT, C).

Phase 1 (Protocol): Loading data from buffer to SM, performing reduction, writing back. NT threads per channel.

beta_1(NC, NT) = (NC * NT) / (alpha_1 + NC*NT/(beta_bar_1 * gamma))

Since transport and protocol stages are serial and data-dependent:

beta(NC, NT, C) = min(beta_0(NC, C), beta_1(NC, NT))    ... (Equation 1)

Both beta_0 and beta_1 individually exhibit unimodal behavior: they increase then saturate then potentially decrease as NC or C or NT increases (due to growing gamma). Their minimum also exhibits unimodal behavior, enabling coordinate descent to find the global maximum.

4.3 Tuning Algorithm

Algorithm 1: Subspace-Directed Tuning

Input: Task w
optimum <- nil
for each subspace s in [A × P × T]:
    config <- CoordinateDescentSearch(s)      // Algorithm 2
    if config.BwDelta(optimum) > 0:
        optimum <- config
return optimum

The outer loop iterates through all (A, P, T) subspaces — at most 18 combinations. For each subspace, coordinate descent searches the optimal (NC, NT, C).

Algorithm 2: Coordinate Descent Search

Input: Subspace s
M <- 3     // number of resource parameters: NC, NT, C
p <- random config from subspace s
optimum <- p
dim, tuned_dim, lr <- 0, 0, 0.01
while tuned_dim <= M:
    p.ProfileBw()
    if p.BwDelta(optimum) > 0:
        lr <- p.BwDelta(optimum) / p.Bw()   // learning rate = relative improvement
        optimum, tuned_dim <- p, 0
    else:
        tuned_dim <- tuned_dim + 1
        dim, lr <- dim + 1, 0.01             // move to next dimension
    p[dim] <- p[dim] + lr                    // step along current dimension
return optimum

This is gradient ascent along each resource parameter dimension, guided by measured bandwidth improvement. The learning rate adapts to the magnitude of improvement. When a dimension stops improving, the algorithm moves to the next. Termination when all M=3 dimensions have been tuned.

4.4 Online Tuning Architecture

                    Training Cluster
         ┌─────────────────────────────────────┐
         │  Communication Group                │
         │                                     │
         │  ┌──────────────────────────────┐   │
         │  │       Leader Node            │   │
         │  │  ┌──────────┐ ┌──────────┐  │   │
         │  │  │ Optimizer│ │Coordinator│  │   │
         │  │  └──────┬───┘ └────┬─────┘  │   │
         │  │         │          │ broadcast│   │
         │  │  ┌──────▼──────────▼──────┐  │   │
         │  │  │ Executor               │  │   │
         │  │  │ [Config Table: Default,│  │   │
         │  │  │  Tuned]                │  │   │
         │  │  │ [History Table]        │  │   │
         │  └──┴────────────────────────┴──┘   │
         │                                     │
         │  ┌──────────────────────────────┐   │
         │  │       Worker Node(s)         │   │
         │  │  ┌──────────────────────┐    │   │
         │  │  │ Executor             │    │   │
         │  │  │ [Config Table: Default│   │   │
         │  │  │  Tuned]              │    │   │
         │  │  └──────────────────────┘    │   │
         │  └──────────────────────────────┘   │
         └─────────────────────────────────────┘

Flow:
  Iteration k:  Leader Executor profiles current config -> Optimizer updates History Table
                Optimizer runs one coordinate descent step -> generates candidate config
                If candidate better: Coordinator broadcasts new config to all Workers
                Workers switch to new config atomically for subsequent iterations
  Iteration k+1..N-k: All nodes use tuned config (beneficial iterations)

Key design choices:

Leader election: One GPU per communication group designated Leader; others are Workers. The Leader runs an additional thread for the Optimizer (using CUDA kernel asynchrony to measure wall-clock execution time, since CUDA kernels execute asynchronously).
Atomic config update: The Coordinator uses a broadcast with synchronous semantics to update the Config Table on both Leader and all Workers simultaneously, preventing inconsistent states.
History Table: The Leader maintains a History Table recording execution times per (task, config) pair across iterations. The Optimizer uses this to decide whether to initiate a new tuning round.
Convergence behavior: For LLM training (thousands of repetitions of the same communication task per iteration due to microbatching), AutoCCL converges within just a few training iterations. For VGG-19 (one AllReduce per iteration, ~150–200 ms/iter), convergence takes no more than 10 minutes.

4.5 Implementation Details

Built on NCCL v2.18.3 (9,176 lines of C++).
Supports AllGather, AllReduce, ReduceScatter.
Modified NCCL's configuration generation and transport initialization to allow flexible configuration selection.
Leader spawns an additional C++ thread for Optimizer; Coordinator uses Linux sockets for inter-node broadcast.
Loaded as a preloadable dynamic library (LD_PRELOAD): zero code changes in PyTorch, MegatronLM, or user training scripts.
Open-sourced at https://github.com/gbxu/autoccl.

5. Evaluation

5.1 Hardware Setup

Cluster A (NVLink/InfiniBand):

2 nodes, 8x A40 GPU each (48 GB GDDR6, PCIe 4.0)
Intra-node: 400 Gbps NVLink (pairs of 4 GPUs)
Inter-node: 400 Gbps InfiniBand
AMD EPYC 9654 CPU, 64 GB host memory

Cluster B (PCIe/InfiniBand):

4 nodes, 8x A40 GPU each
Intra-node: PCIe 4.0 (no NVLink)
Inter-node: 100 Gbps InfiniBand
Intel Xeon Gold 5320 CPU, 504 GB host memory
P2P communication unavailable between some GPUs

DNN models for end-to-end evaluation (Table 8):

Model	MBS	GBS	TP	PP	DP
Phi-2-2B	8	512	8	1	1–4
Llama-3.1-8B	2	256	8	1	1–4
Yi-1.5-34B	1	1,024	8	4	1
VGG-19-0.14B	32	32×[8,16,32]	1	1	8–32

5.2 Communication Microbenchmarks

AllGather and ReduceScatter (PCIe-8GPU, no interference — Figure 7):

AutoCCL: +22.66% vs. NCCL (AllGather), +27.52% vs. NCCL (ReduceScatter)
AFNFA: nearly identical to NCCL (offline-tuned config does not adapt to per-task needs)

AllGather and ReduceScatter (NVLink-8GPU, no interference):

AutoCCL: 1.38x (AllGather), 1.39x (ReduceScatter) vs. NCCL
Even on NVLink — which NCCL is heavily optimized for — AutoCCL achieves significant gains

AllReduce (PCIe-8GPU through PCIe-32GPU — Figure 8):

AutoCCL: 1.28x–1.15x vs. NCCL across setups
NVLink-8GPU and NVLink-16GPU: 1.29x–1.22x vs. NCCL

With computational interference (GEMM, Figure 9 — 128 MB message):

AllGather: AutoCCL 1.29x vs. NCCL; AFNFA 1.00x (no improvement)
ReduceScatter: AutoCCL 1.50x vs. NCCL; AFNFA 1.00x
AllReduce: AutoCCL 1.38x vs. NCCL; AFNFA 1.02x
AFNFA's offline config is optimized for no-interference conditions; under interference, it either matches NCCL or degrades below it.

With variable interference (Figure 10 — scaling message size and GEMM size together):

AutoCCL maintains 1.11–1.76x vs. NCCL for AllGather above 32 MB
AutoCCL 1.16–1.39x vs. NCCL for AllReduce above 32 MB
AFNFA shows benefits only for small messages (<16 MB) for AllGather/AllReduce

5.3 End-to-End Training Results (Figure 11)

Model	Setup	vs. NCCL	vs. AFNFA
Phi-2-2B	PCIe-8GPU	1.14x	1.10x
Phi-2-2B	PCIe-16GPU	1.00x	1.07x
Phi-2-2B	NVLink-8GPU	1.07x	1.07x
Phi-2-2B	NVLink-16GPU	~1.00x	~1.00x
Llama-3.1-8B	PCIe-8GPU	1.18x	1.10x
Llama-3.1-8B	PCIe-16GPU	1.00x	1.10x
Llama-3.1-8B	NVLink-8GPU	~1.00x	1.07x
Yi-1.5-34B	PCIe-32GPU	1.07x	1.07x
VGG-19	PCIe-8GPU	1.05x	—

Best case: 1.32x speedup (AutoCCL vs. NCCL and AFNFA, some PCIe configurations). On NVLink, gains are modest because: (a) NVLink provides much higher bandwidth, so communication is less the bottleneck, and (b) over-optimizing communication can slow computation in overlapped scenarios.

AFNFA sometimes degrades below NCCL in end-to-end training because its offline-tuned global configuration is not adapted to the specific computational interference patterns of individual training tasks.

5.4 Online Tuning Convergence (Figure 12)

For Phi-2-2B and Llama-3.1-8B, AutoCCL converges within the first 4–7 training iterations. This is because each training iteration executes thousands of identical AllGather/ReduceScatter operations (one per microbatch per layer), giving AutoCCL abundant profiling samples per iteration.

For VGG-19 (~150–200 ms/iteration, only 6 AllReduce per iteration), convergence takes up to 10 minutes. Tuning overhead per iteration is a slight extension of iteration time; since k << N total iterations, the amortized overhead is negligible.

6. Design Choices and Limitations

6.1 Parameter Selection and Scope (Section 8.1)

AutoCCL explicitly excludes transport-specific parameters like:

NCCL_IB_SPLIT_DATA_ON_QPS, NCCL_NET_GDR_LEVEL (InfiniBand-specific)
P2P_USE_CUDA_MEMCPY (enables CUDA memcpy for P2P)

Reasons: enabling P2P CUDA memcpy can cause deadlocks in some configurations; IB-specific parameters depend on hardware and are typically already optimal for the NIC type; the study found these provide limited gains. These are left for future work.

6.2 Task Failures and Valid Configuration Bounds (Section 8.2)

Two types of failures observed:

Implementation parameter failures (e.g., P2P on hardware that doesn't support it) — AutoCCL avoids by excluding P2P in Cluster B.
Resource saturation failures (excessively large NC, NT, C values exhaust GPU memory or cause crashes) — mitigated by imposing upper bounds on resource parameters, which is safe due to the unimodal nature of the performance function (very large values never improve performance).

Category	Representative works	Relationship to AutoCCL
Compute-communication overlap scheduling	Horovod, Centauri, TicTac	Orthogonal; AutoCCL improves the underlying primitive that schedulers invoke
Gradient compression	HiPress, ZeRO++, Espresso	Complementary; AutoCCL can be used to tune the compressed collective
Collective algorithm design	Swing (NSDI 24), MCCLang, Blink	Orthogonal; these design new algorithms; AutoCCL selects among existing ones
Network tuning / NCCL configuration	AFNFA (APNet 2023), NCCL built-in cost model	Direct competitors; AutoCCL outperforms both by online profiling under real interference
Topology-aware training	Topoopt (NSDI 23), TCCL (ASPLOS 24)	Higher-level placement; AutoCCL optimizes the per-task communication config

AFNFA (Wang et al., APNet 2023): The closest prior work to AutoCCL. AFNFA automates NCCL configuration exploration using a random forest trained on offline-sampled data. Key limitations: (1) offline-tuned global configuration does not account for per-task differences or computational interference, (2) training 1% of sampled data with random forest is expensive, (3) configurations degrade under interference since the model was trained without it. AutoCCL's online tuning approach explicitly addresses all three.

8. Relevance to DynamICCL

AutoCCL is the primary technical baseline and differentiation target for DynamICCL. The two systems share the same fundamental goal — automatic NCCL parameter selection to reduce collective communication latency during distributed DNN training — but differ fundamentally in mechanism, scope, and adaptability.

8.1 Parameter Space Alignment

AutoCCL tunes: {Algorithm A, Protocol P, Transport T, Nchannel NC, Nthread NT, Chunk size C}

DynamICCL Agent-2 action space: {algorithm (ring/tree/collnet_direct/collnet_chain/nvls/nvls_tree/pat), protocol (ll/ll128/simple), nChannels (1–8), numThreads}

The overlap is near-complete. DynamICCL adds more algorithm variants (collnet, nvls, nvls_tree, pat) not present in AutoCCL, reflecting more recent NCCL capabilities. DynamICCL omits Chunk size as a direct action, instead relying on NCCL's internal chunk selection given the other parameters.

8.2 Key Design Differences

Dimension	AutoCCL	DynamICCL
Search method	Coordinate descent (analytical unimodal model)	DQN reinforcement learning
Congestion awareness	None — captures average interference via online profiling	Explicit: LSTM+CUSUM Agent-1 detects congestion events
Reconfiguration	Static: converges once, broadcasts fixed config	Dynamic: RL agent can reconfigure at any time
Performance model assumption	Unimodal bandwidth function	None — RL learns the mapping without analytical assumptions
Interference modeling	Captured implicitly via online profiling	Explicit congestion state feeds into Agent-2's state
Collective types	AllGather, ReduceScatter, AllReduce	AllReduce, AllGather, ReduceScatter
Algorithm choices	Tree, Ring	Ring, Tree, CollNet_Direct, CollNet_Chain, NVLS, NVLS_Tree, PAT
Deployment overhead	Near-zero (preload library, transparent)	Requires NCCL tuner plugin API integration
Tuning convergence	Fast (few iterations for LLMs)	Policy gradient requires more environment interaction

8.3 AutoCCL Results as DynamICCL Motivation

AutoCCL's empirical Section 3 provides quantitative grounding for DynamICCL's design decisions:

The fact that NCCL default achieves only 81.2% of optimal bandwidth validates DynamICCL's premise that tuning is necessary and impactful.
The finding that optimal configuration changes under computational interference (Table 6) directly motivates DynamICCL's Agent-1 congestion detector and dynamic reconfiguration capability.
The finding that AFNFA's offline-tuned configuration degrades below NCCL under interference (Figure 9, 10) validates DynamICCL's choice of online, reactive tuning via RL rather than offline policy learning.
The different optimal configurations for the same task on PCIe vs. NVLink hardware (Tables 3, 4) validate DynamICCL's position that hardware-specific, workload-specific tuning is essential.

8.4 DynamICCL Advantages over AutoCCL

Congestion detection and reactive reconfiguration. AutoCCL tunes once and holds configuration static. DynamICCL's LSTM+CUSUM Agent-1 detects transient network congestion events and triggers Agent-2 to reconfigure NCCL in response. This is critical in multi-tenant HPC environments (Chameleon Cloud) where network conditions change during a training run.
No unimodal assumption. Coordinate descent requires the performance surface to be unimodal. Multi-tenant network contention, non-linear GPU cache effects, and concurrent pipeline parallelism can create multi-modal performance surfaces. DynamICCL's DQN agent makes no such assumption.
Extended algorithm space. CollNet, NVLS, and PAT algorithms are available in recent NCCL versions for specific hardware configurations (e.g., SHARP in-network compute, NVLink Switch). AutoCCL does not include these. DynamICCL's action space covers them.
Phase-aware optimization. DynamICCL can learn different configurations for different phases of training (warm-up, steady-state, large-batch finalization) as the RL policy learns temporal patterns. AutoCCL's one-time convergence cannot differentiate phases.

Citation

@inproceedings{xu2025autoccl,
  title     = {AutoCCL: Automated Collective Communication Tuning for
               Accelerating Distributed and Parallel DNN Training},
  author    = {Xu, Guanbin and Le, Zhihao and Chen, Yinhe and Lin, Zhiqi and
               Jin, Zewen and Miao, Youshan and Li, Cheng},
  booktitle = {22nd USENIX Symposium on Networked Systems Design and Implementation
               (NSDI 25)},
  pages     = {667--683},
  year      = {2025},
  publisher = {USENIX Association}
}