AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training — Detailed Summary
Authors: Guanbin Xu†, Zhihao Le†, Yinhe Chen†, Zhiqi Lin†, Zewen Jin† (USTC); Youshan Miao‡ (Microsoft Research); Cheng Li†◦ (USTC) Venue: 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 2025), April 28–30, 2025, Philadelphia, PA URL: https://www.usenix.org/conference/nsdi25/presentation/xu-guanbin Code: https://github.com/gbxu/autoccl (open source)
Abstract
AutoCCL is an automated NCCL parameter tuning tool that improves collective communication performance during distributed DNN training without requiring additional hardware investment or code changes. It identifies six performance-sensitive NCCL parameters, classifies them into implementation parameters (small search space, fixed per subspace) and resource parameters (large search space, tuned via coordinate descent within each subspace), and proposes an online tuning approach that embeds the search into the early iterations of training. This captures computational interference effects and amortizes tuning overhead. AutoCCL achieves 1.24–1.29x bandwidth improvement over NCCL and 1.15–1.22x over AFNFA on communication microbenchmarks, and up to 1.32x end-to-end training speedup on three LLMs and one vision model.
1. Motivation and Problem Statement
1.1 Communication as a Training Bottleneck
Large-scale DNN training distributes computation across GPU clusters and relies on collective communication primitives (AllReduce, AllGather, ReduceScatter) to synchronize model parameters and activations. Communication is a well-documented bottleneck limiting distributed training scalability. The workloads in Table 1 illustrate the scale:
| Model | Parallelism | AllGather | ReduceScatter | AllReduce |
|---|---|---|---|---|
| Phi-2-2B | TP(8)+DP(4) | (80MB_8_3,120), (80MB_8_2,064) | same | (632MB_4_1) |
| Llama-3.1-8B | TP(8)+DP(4) | (128MB_8_6,240), (857MB_4_1), (128MB_8_4,128), (1,710MB_4_1) | — | — |
| Yi-1.5-34B | TP(8)+PP(4) | (56MB_8_61,440), (56MB_8_93,184) | — | — |
| VGG-19-0.14B | DP(32) | — | — | (15–392MB_32_6) |
Format: (message_size_group_size_count_per_iteration). Yi-1.5-34B executes AllGather 61,440 times per training iteration, each passing 56 MB across 8 GPUs.
1.2 NCCL as a Black Box
Most research optimizing communication assumes NCCL is already well-tuned. The field has focused on: compute-communication overlap scheduling, gradient compression, topology-aware collective algorithm design. These approaches build on top of NCCL. AutoCCL's empirical study in Section 3 directly falsifies the "well-tuned" assumption:
- NCCL's default configuration for AllGather on an 8-GPU A40 PCIe node achieves only 81.2% of the bandwidth of the best attainable configuration.
- Tuned configurations outperform NCCL default by up to 35% bandwidth.
- NCCL's built-in cost model (alpha-beta network model) correctly selects algorithm in some cases but frequently misjudges, particularly when hardware and workload do not match the model's assumptions.
1.3 Two Core Challenges
Challenge 1 — Search space explosion. NCCL has 28 performance-sensitive parameters total; AutoCCL identifies 6 that have the most impact. The combined search space for these 6 parameters is shown in Figure 2:
- AllGather: millions of combinations at 256 KB; tens of millions at 1 GB
- ReduceScatter: similar scale
- AllReduce: smaller but still large
Exhaustive testing of AllGather with 80 MB message on a single PCIe-8GPU node takes several hours.
Challenge 2 — Computational interference. During training, GPU SMs run both compute kernels (GEMM for attention, FFN layers) and communication kernels concurrently. They compete for SM cores, L2 cache, and global memory bandwidth. This interference significantly degrades communication performance:
- AllGather bandwidth drops 12.8% under light GEMM interference and 39.3% under heavy GEMM interference (Table 6).
- After AutoCCL tuning, the same AllGather recovers to 34–78% above the NCCL default across all interference levels.
- Critically, the optimal NCCL configuration changes under interference. A config that is optimal in isolation may be far from optimal under training load.
2. Background
2.1 NCCL Parameters
When a training framework calls a collective primitive (e.g., ncclAllGather), NCCL creates a communication task and assigns it a configuration tuple <A, P, T, NC, NT, C>:
| Parameter | Meaning | Values |
|---|---|---|
| A (Algorithm) | Collective topology: how data flows | Tree, Ring |
| P (Protocol) | Data movement method between buffer and SM | LL (Low-Latency), LL128, Simple |
| T (Transport) | Physical data transfer mechanism | P2P (peer-to-peer, NVLink/PCIe), SHM (shared memory) |
| NC (Nchannel) | Number of parallel data partitions / threadblocks | 1 ≤ NC ≤ 128 |
| NT (Nthread) | Threads per threadblock (32 × i, i in {1..20}) | 32–640 |
| C (Chunk size) | Data granularity per chunk in each serial transmission step | 256 × i KB (i in {1..8K}) |
Algorithm (A): Determines the reduction topology.
- Ring: data passes sequentially around a ring of GPUs. Bandwidth scales with N-1/N for N GPUs; latency scales with N-1.
- Tree: binary tree broadcast/reduce. Logarithmic latency; more efficient for small messages.
Protocol (P): Determines how data is managed in GPU memory during the collective.
- LL (Low-Latency): optimized for small messages; minimizes startup latency.
- LL128: extended low-latency for 128-byte aligned transfers.
- Simple: optimized for large messages; maximum throughput.
Transport (T): Determines the physical link used.
- P2P: direct GPU-to-GPU transfer via NVLink or PCIe DMA, bypassing CPU.
- SHM: CPU shared memory; used when P2P is unavailable or slower.
NC (Nchannel): Each channel is an independent pipeline of NC threadblocks transmitting concurrently. Increasing NC increases parallelism but also increases congestion on shared resources (L2, DRAM, NIC).
NT (Nthread): Threads per threadblock. More threads = more SM occupancy = faster protocol computation, but also more SM resource pressure.
C (Chunk size): Granularity of each serial transmission step within a channel. Larger C reduces overhead per step but increases memory pressure. The total number of serial steps = message_size / (NC × C).
2.2 Distributed Training Communication Patterns
| Primitive | Usage context | Typical sizes |
|---|---|---|
| AllReduce | Data parallelism gradient sync | 15 MB – 1.7 GB |
| AllGather | Tensor parallelism forward pass (weight gathering) | 56 MB – 1.7 GB |
| ReduceScatter | Tensor parallelism backward pass (gradient scatter) | 56 MB – 1.7 GB |
In tensor parallelism, AllGather and ReduceScatter are on the critical path of every microbatch forward and backward pass respectively. In data parallelism, AllReduce occurs once per mini-batch for gradient synchronization.
3. NCCL Tuning Opportunities (Section 3 — Empirical Study)
3.1 Stand-alone Communication (No Computational Interference)
AllGather, 80 MB, 8-GPU PCIe (Table 3):
| Config | A | P | T | NC | NT | C | Bandwidth |
|---|---|---|---|---|---|---|---|
| C_P0 (NCCL default) | Ring | Simple | SHM | 2 | 256 | 2 MB | baseline |
| C_P1 | Ring | Simple | P2P | 2 | 96 | 32 KB | |
| C_P2 | Ring | Simple | P2P | 8 | 256 | 32 KB | |
| C_P3 | Ring | Simple | P2P | 8 | 96 | 32 KB | 2.69x NCCL default |
C_P3 achieves 2.69x the bandwidth of C_P0. The difference: C_P3 uses P2P transport (bypassing CPU) and splits the message into more partitions (NC=8 vs. 2) with smaller chunks (C=32 KB vs. 2 MB). The NCCL default uses SHM transport — suboptimal for this hardware.
AllGather, 80 MB, 8-GPU NVLink (Table 4):
| Config | A | P | T | NC | NT | C |
|---|---|---|---|---|---|---|
| C_N0 (NCCL default) | Ring | Simple | P2P | 8 | 512 | 2 MB |
| C_N3 (best) | Ring | Simple | SHM | 64 | 512 | 108 KB |
C_N3 achieves 1.28x C_N0. On NVLink, SHM transport outperforms P2P — the opposite of the PCIe case. This confirms that optimal transport is hardware-dependent and NCCL's default is not universally correct.
AllReduce, 8-GPU PCIe (Table 5):
| Task | Config | A | NC | NT | C | Bandwidth |
|---|---|---|---|---|---|---|
| 64 MB, 16 | C_0 (NCCL default) | Ring | 2 | 256 | 512 KB | 4.0 GB/s |
| 64 MB, 16 | C_2 | Tree | 8 | 160 | 59 KB | 8.9 GB/s (2.23x) |
| 15 MB, 8 | C_0 (NCCL default) | Ring | 2 | 256 | 512 KB | — |
| 15 MB, 8 | C_4 | Ring | 10 | 128 | 27 KB | 8.8 GB/s |
The NCCL default selects Ring for both tasks. Tree is better for the larger task (64 MB, group 16). The optimal algorithm is task-specific and NCCL's alpha-beta model does not always predict this correctly.
Key finding: NCCL's default is only 81.2% of the best achievable for AllGather on PCIe-8GPU; up to 2.69x improvement is achievable.
3.2 Computational Interference (Table 6)
AllGather, 80 MB, 8-GPU NVLink, with concurrent GEMM (A[3456,128] × B[128,3456] + sigmoid):
| Condition | NCCL Default BW | Tuned (AutoCCL) BW |
|---|---|---|
| Communication-only | 30.08 GB/s | 38.62 GB/s (+28.4%) |
| Light computation | 26.74 GB/s | 35.14 GB/s (+31.4%) |
| Heavy computation | 18.26 GB/s | 32.44 GB/s (+77.7%) |
Three important observations from this table:
- Even light computation degrades NCCL default bandwidth by 12.8%.
- Heavy computation degrades NCCL default by 39.3%.
- AutoCCL tuning recovers performance dramatically — under heavy interference, the tuned config achieves 77.7% more bandwidth than NCCL default, and is even 7.8–16.8% higher than NCCL default without any interference.
4. Communication Tuning Method (AutoCCL Algorithm)
4.1 Parameter Classification
AutoCCL divides the six parameters into two categories:
Implementation parameters {A, P, T}: These define the topology, communication algorithm, and physical transport mechanism. They have a small combined search space (2 × 3 × 3 = 18 combinations at most, reduced to feasible subsets by hardware). They are a prerequisite for resource parameter selection — the optimal resource parameters depend on which (A, P, T) subspace is used. The paper finds that prior guidelines for selecting A (Tree vs. Ring based on message size) and T (P2P vs. SHM based on interconnect) are not always correct.
Resource parameters {NC, NT, C}: These govern the degree of parallelism and data granularity. They have a large search space (NC up to 128, NT up to 640, C up to 2048 × 256 = 512 MB). The key analytical finding: their joint impact on bandwidth is unimodal — for any fixed pair, bandwidth as a function of the third parameter first rises then falls, with a single peak.
4.2 Performance Model for Resource Parameters
The collective execution has two serial phases:
Phase 0 (Transport): Reading data from peer GPUs into local buffer. NC parallel channels, each transferring a message of size C per serial step. Number of serial steps = M / (NC × C).
t_0(NC, C) = M/(NC*C) * (alpha_0 + NC*C/(beta_bar_0 * gamma))
beta_0(NC, C) = M / t_0(NC, C)
= (NC * C) / (alpha_0 + NC*C/(beta_bar_0 * gamma))
where alpha_0 = initialization latency, beta_bar_0 = peak transport bandwidth, gamma = congestion coefficient (monotonically increasing in NC, NT, C).
Phase 1 (Protocol): Loading data from buffer to SM, performing reduction, writing back. NT threads per channel.
beta_1(NC, NT) = (NC * NT) / (alpha_1 + NC*NT/(beta_bar_1 * gamma))
Since transport and protocol stages are serial and data-dependent:
beta(NC, NT, C) = min(beta_0(NC, C), beta_1(NC, NT)) ... (Equation 1)
Both beta_0 and beta_1 individually exhibit unimodal behavior: they increase then saturate then potentially decrease as NC or C or NT increases (due to growing gamma). Their minimum also exhibits unimodal behavior, enabling coordinate descent to find the global maximum.
4.3 Tuning Algorithm
Algorithm 1: Subspace-Directed Tuning
Input: Task w
optimum <- nil
for each subspace s in [A × P × T]:
config <- CoordinateDescentSearch(s) // Algorithm 2
if config.BwDelta(optimum) > 0:
optimum <- config
return optimum
The outer loop iterates through all (A, P, T) subspaces — at most 18 combinations. For each subspace, coordinate descent searches the optimal (NC, NT, C).
Algorithm 2: Coordinate Descent Search
Input: Subspace s
M <- 3 // number of resource parameters: NC, NT, C
p <- random config from subspace s
optimum <- p
dim, tuned_dim, lr <- 0, 0, 0.01
while tuned_dim <= M:
p.ProfileBw()
if p.BwDelta(optimum) > 0:
lr <- p.BwDelta(optimum) / p.Bw() // learning rate = relative improvement
optimum, tuned_dim <- p, 0
else:
tuned_dim <- tuned_dim + 1
dim, lr <- dim + 1, 0.01 // move to next dimension
p[dim] <- p[dim] + lr // step along current dimension
return optimum
This is gradient ascent along each resource parameter dimension, guided by measured bandwidth improvement. The learning rate adapts to the magnitude of improvement. When a dimension stops improving, the algorithm moves to the next. Termination when all M=3 dimensions have been tuned.
4.4 Online Tuning Architecture
Training Cluster
┌─────────────────────────────────────┐
│ Communication Group │
│ │
│ ┌──────────────────────────────┐ │
│ │ Leader Node │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Optimizer│ │Coordinator│ │ │
│ │ └──────┬───┘ └────┬─────┘ │ │
│ │ │ │ broadcast│ │
│ │ ┌──────▼──────────▼──────┐ │ │
│ │ │ Executor │ │ │
│ │ │ [Config Table: Default,│ │ │
│ │ │ Tuned] │ │ │
│ │ │ [History Table] │ │ │
│ └──┴────────────────────────┴──┘ │
│ │
│ ┌──────────────────────────────┐ │
│ │ Worker Node(s) │ │
│ │ ┌──────────────────────┐ │ │
│ │ │ Executor │ │ │
│ │ │ [Config Table: Default│ │ │
│ │ │ Tuned] │ │ │
│ │ └──────────────────────┘ │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘
Flow:
Iteration k: Leader Executor profiles current config -> Optimizer updates History Table
Optimizer runs one coordinate descent step -> generates candidate config
If candidate better: Coordinator broadcasts new config to all Workers
Workers switch to new config atomically for subsequent iterations
Iteration k+1..N-k: All nodes use tuned config (beneficial iterations)
Key design choices:
- Leader election: One GPU per communication group designated Leader; others are Workers. The Leader runs an additional thread for the Optimizer (using CUDA kernel asynchrony to measure wall-clock execution time, since CUDA kernels execute asynchronously).
- Atomic config update: The Coordinator uses a broadcast with synchronous semantics to update the Config Table on both Leader and all Workers simultaneously, preventing inconsistent states.
- History Table: The Leader maintains a History Table recording execution times per (task, config) pair across iterations. The Optimizer uses this to decide whether to initiate a new tuning round.
- Convergence behavior: For LLM training (thousands of repetitions of the same communication task per iteration due to microbatching), AutoCCL converges within just a few training iterations. For VGG-19 (one AllReduce per iteration, ~150–200 ms/iter), convergence takes no more than 10 minutes.
4.5 Implementation Details
- Built on NCCL v2.18.3 (9,176 lines of C++).
- Supports AllGather, AllReduce, ReduceScatter.
- Modified NCCL's configuration generation and transport initialization to allow flexible configuration selection.
- Leader spawns an additional C++ thread for Optimizer; Coordinator uses Linux sockets for inter-node broadcast.
- Loaded as a preloadable dynamic library (
LD_PRELOAD): zero code changes in PyTorch, MegatronLM, or user training scripts. - Open-sourced at https://github.com/gbxu/autoccl.
5. Evaluation
5.1 Hardware Setup
Cluster A (NVLink/InfiniBand):
- 2 nodes, 8x A40 GPU each (48 GB GDDR6, PCIe 4.0)
- Intra-node: 400 Gbps NVLink (pairs of 4 GPUs)
- Inter-node: 400 Gbps InfiniBand
- AMD EPYC 9654 CPU, 64 GB host memory
Cluster B (PCIe/InfiniBand):
- 4 nodes, 8x A40 GPU each
- Intra-node: PCIe 4.0 (no NVLink)
- Inter-node: 100 Gbps InfiniBand
- Intel Xeon Gold 5320 CPU, 504 GB host memory
- P2P communication unavailable between some GPUs
DNN models for end-to-end evaluation (Table 8):
| Model | MBS | GBS | TP | PP | DP |
|---|---|---|---|---|---|
| Phi-2-2B | 8 | 512 | 8 | 1 | 1–4 |
| Llama-3.1-8B | 2 | 256 | 8 | 1 | 1–4 |
| Yi-1.5-34B | 1 | 1,024 | 8 | 4 | 1 |
| VGG-19-0.14B | 32 | 32×[8,16,32] | 1 | 1 | 8–32 |
5.2 Communication Microbenchmarks
AllGather and ReduceScatter (PCIe-8GPU, no interference — Figure 7):
- AutoCCL: +22.66% vs. NCCL (AllGather), +27.52% vs. NCCL (ReduceScatter)
- AFNFA: nearly identical to NCCL (offline-tuned config does not adapt to per-task needs)
AllGather and ReduceScatter (NVLink-8GPU, no interference):
- AutoCCL: 1.38x (AllGather), 1.39x (ReduceScatter) vs. NCCL
- Even on NVLink — which NCCL is heavily optimized for — AutoCCL achieves significant gains
AllReduce (PCIe-8GPU through PCIe-32GPU — Figure 8):
- AutoCCL: 1.28x–1.15x vs. NCCL across setups
- NVLink-8GPU and NVLink-16GPU: 1.29x–1.22x vs. NCCL
With computational interference (GEMM, Figure 9 — 128 MB message):
- AllGather: AutoCCL 1.29x vs. NCCL; AFNFA 1.00x (no improvement)
- ReduceScatter: AutoCCL 1.50x vs. NCCL; AFNFA 1.00x
- AllReduce: AutoCCL 1.38x vs. NCCL; AFNFA 1.02x
- AFNFA's offline config is optimized for no-interference conditions; under interference, it either matches NCCL or degrades below it.
With variable interference (Figure 10 — scaling message size and GEMM size together):
- AutoCCL maintains 1.11–1.76x vs. NCCL for AllGather above 32 MB
- AutoCCL 1.16–1.39x vs. NCCL for AllReduce above 32 MB
- AFNFA shows benefits only for small messages (<16 MB) for AllGather/AllReduce
5.3 End-to-End Training Results (Figure 11)
| Model | Setup | vs. NCCL | vs. AFNFA |
|---|---|---|---|
| Phi-2-2B | PCIe-8GPU | 1.14x | 1.10x |
| Phi-2-2B | PCIe-16GPU | 1.00x | 1.07x |
| Phi-2-2B | NVLink-8GPU | 1.07x | 1.07x |
| Phi-2-2B | NVLink-16GPU | ~1.00x | ~1.00x |
| Llama-3.1-8B | PCIe-8GPU | 1.18x | 1.10x |
| Llama-3.1-8B | PCIe-16GPU | 1.00x | 1.10x |
| Llama-3.1-8B | NVLink-8GPU | ~1.00x | 1.07x |
| Yi-1.5-34B | PCIe-32GPU | 1.07x | 1.07x |
| VGG-19 | PCIe-8GPU | 1.05x | — |
Best case: 1.32x speedup (AutoCCL vs. NCCL and AFNFA, some PCIe configurations). On NVLink, gains are modest because: (a) NVLink provides much higher bandwidth, so communication is less the bottleneck, and (b) over-optimizing communication can slow computation in overlapped scenarios.
AFNFA sometimes degrades below NCCL in end-to-end training because its offline-tuned global configuration is not adapted to the specific computational interference patterns of individual training tasks.
5.4 Online Tuning Convergence (Figure 12)
For Phi-2-2B and Llama-3.1-8B, AutoCCL converges within the first 4–7 training iterations. This is because each training iteration executes thousands of identical AllGather/ReduceScatter operations (one per microbatch per layer), giving AutoCCL abundant profiling samples per iteration.
For VGG-19 (~150–200 ms/iteration, only 6 AllReduce per iteration), convergence takes up to 10 minutes. Tuning overhead per iteration is a slight extension of iteration time; since k << N total iterations, the amortized overhead is negligible.
6. Design Choices and Limitations
6.1 Parameter Selection and Scope (Section 8.1)
AutoCCL explicitly excludes transport-specific parameters like:
NCCL_IB_SPLIT_DATA_ON_QPS,NCCL_NET_GDR_LEVEL(InfiniBand-specific)P2P_USE_CUDA_MEMCPY(enables CUDA memcpy for P2P)
Reasons: enabling P2P CUDA memcpy can cause deadlocks in some configurations; IB-specific parameters depend on hardware and are typically already optimal for the NIC type; the study found these provide limited gains. These are left for future work.
6.2 Task Failures and Valid Configuration Bounds (Section 8.2)
Two types of failures observed:
- Implementation parameter failures (e.g., P2P on hardware that doesn't support it) — AutoCCL avoids by excluding P2P in Cluster B.
- Resource saturation failures (excessively large NC, NT, C values exhaust GPU memory or cause crashes) — mitigated by imposing upper bounds on resource parameters, which is safe due to the unimodal nature of the performance function (very large values never improve performance).
7. Related Work
| Category | Representative works | Relationship to AutoCCL |
|---|---|---|
| Compute-communication overlap scheduling | Horovod, Centauri, TicTac | Orthogonal; AutoCCL improves the underlying primitive that schedulers invoke |
| Gradient compression | HiPress, ZeRO++, Espresso | Complementary; AutoCCL can be used to tune the compressed collective |
| Collective algorithm design | Swing (NSDI 24), MCCLang, Blink | Orthogonal; these design new algorithms; AutoCCL selects among existing ones |
| Network tuning / NCCL configuration | AFNFA (APNet 2023), NCCL built-in cost model | Direct competitors; AutoCCL outperforms both by online profiling under real interference |
| Topology-aware training | Topoopt (NSDI 23), TCCL (ASPLOS 24) | Higher-level placement; AutoCCL optimizes the per-task communication config |
AFNFA (Wang et al., APNet 2023): The closest prior work to AutoCCL. AFNFA automates NCCL configuration exploration using a random forest trained on offline-sampled data. Key limitations: (1) offline-tuned global configuration does not account for per-task differences or computational interference, (2) training 1% of sampled data with random forest is expensive, (3) configurations degrade under interference since the model was trained without it. AutoCCL's online tuning approach explicitly addresses all three.
8. Relevance to DynamICCL
AutoCCL is the primary technical baseline and differentiation target for DynamICCL. The two systems share the same fundamental goal — automatic NCCL parameter selection to reduce collective communication latency during distributed DNN training — but differ fundamentally in mechanism, scope, and adaptability.
8.1 Parameter Space Alignment
AutoCCL tunes: {Algorithm A, Protocol P, Transport T, Nchannel NC, Nthread NT, Chunk size C}
DynamICCL Agent-2 action space: {algorithm (ring/tree/collnet_direct/collnet_chain/nvls/nvls_tree/pat), protocol (ll/ll128/simple), nChannels (1–8), numThreads}
The overlap is near-complete. DynamICCL adds more algorithm variants (collnet, nvls, nvls_tree, pat) not present in AutoCCL, reflecting more recent NCCL capabilities. DynamICCL omits Chunk size as a direct action, instead relying on NCCL's internal chunk selection given the other parameters.
8.2 Key Design Differences
| Dimension | AutoCCL | DynamICCL |
|---|---|---|
| Search method | Coordinate descent (analytical unimodal model) | DQN reinforcement learning |
| Congestion awareness | None — captures average interference via online profiling | Explicit: LSTM+CUSUM Agent-1 detects congestion events |
| Reconfiguration | Static: converges once, broadcasts fixed config | Dynamic: RL agent can reconfigure at any time |
| Performance model assumption | Unimodal bandwidth function | None — RL learns the mapping without analytical assumptions |
| Interference modeling | Captured implicitly via online profiling | Explicit congestion state feeds into Agent-2's state |
| Collective types | AllGather, ReduceScatter, AllReduce | AllReduce, AllGather, ReduceScatter |
| Algorithm choices | Tree, Ring | Ring, Tree, CollNet_Direct, CollNet_Chain, NVLS, NVLS_Tree, PAT |
| Deployment overhead | Near-zero (preload library, transparent) | Requires NCCL tuner plugin API integration |
| Tuning convergence | Fast (few iterations for LLMs) | Policy gradient requires more environment interaction |
8.3 AutoCCL Results as DynamICCL Motivation
AutoCCL's empirical Section 3 provides quantitative grounding for DynamICCL's design decisions:
- The fact that NCCL default achieves only 81.2% of optimal bandwidth validates DynamICCL's premise that tuning is necessary and impactful.
- The finding that optimal configuration changes under computational interference (Table 6) directly motivates DynamICCL's Agent-1 congestion detector and dynamic reconfiguration capability.
- The finding that AFNFA's offline-tuned configuration degrades below NCCL under interference (Figure 9, 10) validates DynamICCL's choice of online, reactive tuning via RL rather than offline policy learning.
- The different optimal configurations for the same task on PCIe vs. NVLink hardware (Tables 3, 4) validate DynamICCL's position that hardware-specific, workload-specific tuning is essential.
8.4 DynamICCL Advantages over AutoCCL
Congestion detection and reactive reconfiguration. AutoCCL tunes once and holds configuration static. DynamICCL's LSTM+CUSUM Agent-1 detects transient network congestion events and triggers Agent-2 to reconfigure NCCL in response. This is critical in multi-tenant HPC environments (Chameleon Cloud) where network conditions change during a training run.
No unimodal assumption. Coordinate descent requires the performance surface to be unimodal. Multi-tenant network contention, non-linear GPU cache effects, and concurrent pipeline parallelism can create multi-modal performance surfaces. DynamICCL's DQN agent makes no such assumption.
Extended algorithm space. CollNet, NVLS, and PAT algorithms are available in recent NCCL versions for specific hardware configurations (e.g., SHARP in-network compute, NVLink Switch). AutoCCL does not include these. DynamICCL's action space covers them.
Phase-aware optimization. DynamICCL can learn different configurations for different phases of training (warm-up, steady-state, large-batch finalization) as the RL policy learns temporal patterns. AutoCCL's one-time convergence cannot differentiate phases.
Citation
@inproceedings{xu2025autoccl,
title = {AutoCCL: Automated Collective Communication Tuning for
Accelerating Distributed and Parallel DNN Training},
author = {Xu, Guanbin and Le, Zhihao and Chen, Yinhe and Lin, Zhiqi and
Jin, Zewen and Miao, Youshan and Li, Cheng},
booktitle = {22nd USENIX Symposium on Networked Systems Design and Implementation
(NSDI 25)},
pages = {667--683},
year = {2025},
publisher = {USENIX Association}
}