AutoCCL — Block Diagram Analysis
Paper: "AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training" Authors: Xu, Le, Chen, Lin, Jin, Miao, Li — USTC + Microsoft Research Venue: USENIX NSDI 2025
Fig 1: System Overview — AutoCCL Architecture vs Vanilla NCCL
┌──────────────────────────────────────────────────────────────┐
│ Vanilla NCCL (peer-to-peer, no tuner) │
│ │
│ GPU 0 (Peer) GPU 1 (Peer) GPU N (Peer) │
│ ┌──────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Executor │ │ Executor │ │ Executor │ │
│ │ ① lookup │ │ ① lookup │ │ ① lookup │ │
│ │ Default Cfg │ │ Default Cfg│ │ Default Cfg│ │
│ └──────┬───────┘ └─────┬──────┘ └─────┬──────┘ │
│ └──────────── ② communicate ──────────┘ │
│ (all peers use identical cost-model config, independently) │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ AutoCCL (Leader/Worker model) │
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Leader GPU (one designated node per comm. group) │ │
│ │ │ │
│ │ ┌────────────┐ metrics ┌────────────────────────┐ │ │
│ │ │ Executor │══════════►│ Optimizer │ │ │
│ │ │ ① lookup │ │ - profiles History Tbl │ │ │
│ │ │ Tuned Cfg │◄═════════│ - runs Coord. Descent │ │ │
│ │ └────────────┘ config │ - initiates tuning │ │ │
│ │ ║ └─────────┬──────────────┬─┘ │ │
│ │ ② communicate notify│ update│ │ │
│ │ ║ ▼ ▼ │ │
│ │ ║ ┌──────────────────────────┐│ │
│ │ ║ │ Coordinator ││ │
│ │ ║ │ - broadcasts new config ││ │
│ │ ║ │ via async bcast ││ │
│ │ ║ └────────────┬─────────────┘│ │
│ └────────╫───────────────────────────────╫──────────────┘ │
│ ║ ║ broadcast │
│ ┌────────╨───────────────────────────────╨──────────────┐ │
│ │ Worker GPU(s) (all other nodes in comm. group) │ │
│ │ ┌──────────────────────────────────────────────────┐ │ │
│ │ │ Executor ① lookup Tuned Config Table │ │ │
│ │ │ (uses tuned config if present, else default) │ │ │
│ │ └──────────────────────────────────────────────────┘ │ │
│ │ ② communicate │ │
│ └───────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
▲ Fig 1: AutoCCL vs NCCL — one Leader per group runs the Optimizer
and Coordinator; Workers remain thin executors with a config table.
The asymmetric Leader/Worker split is the central architectural choice. NCCL's peer-to-peer design is symmetric — every node independently computes the same cost-model config. AutoCCL breaks this symmetry by centralizing tuning on one node, which enables online profiling and iterative refinement while keeping Workers unchanged. The consequence is that Workers get better configs without any code modification, but the Leader bears extra CPU/thread overhead.
Fig 2: Key Architecture Diagram — Parameter Space & Subspace Division
NCCL Configuration Space: <A, P, T, NC, NT, C>
════════════════════════════════════════════════
┌──────────────────────────────────────────────────────────────┐
│ 6 Parameters — classified into 2 categories │
│ │
│ Implementation Params Resource Params │
│ (define topology/logic) (define parallelism degree) │
│ ┌─────────────────────┐ ┌────────────────────────┐ │
│ │ A : Algorithm │ │ NC : Nchannel (1-128) │ │
│ │ {Ring, Tree} │ │ NT : Nthread (32x, │ │
│ │ │ │ i∈{1..20}) │ │
│ │ P : Protocol │ │ C : Chunk size │ │
│ │ {LL, LL128,Simple} │ │ (256x, 8K values) │ │
│ │ │ └────────────────────────┘ │
│ │ T : Transport │ ↑ search space: ~millions │
│ │ {P2P, SHM} │ ↑ joint impact: unimodal │
│ └─────────────────────┘ ↑ modeled analytically │
│ ↑ small # of combos (6) │
│ ↑ prerequisite for Resource params │
└──────────────────────────────────────────────────────────────┘
│
▼ Subspace-Directed Tuning (Algorithm 1)
┌──────────────────────────────────────────────────────────────┐
│ for each subspace s ∈ [A × P × T]: │
│ optimal(NC,NT,C) ← CoordinateDescentSearch(s) │
│ if better than current global best → update │
│ return global optimum <A,P,T,NC,NT,C> │
└──────────────────────────────────────────────────────────────┘
│
▼ Coordinate Descent Search (Algorithm 2)
┌──────────────────────────────────────────────────────────────┐
│ M = 3 (NC, NT, C are three dimensions) │
│ Random init config p in subspace s │
│ while tuned_dim < M: │
│ ProfileBw(p) │
│ if BwDelta(p, optimum) > 0: │
│ lr ← BwDelta / Bw(p); optimum ← p │
│ else: │
│ tuned_dim++; lr ← 0.01 │
│ p[dim] += lr ← gradient step along one axis │
│ return optimum │
└──────────────────────────────────────────────────────────────┘
▲ Fig 2: Two-level search — exhaustive over small impl. subspaces,
coordinate descent over large resource parameter space.
This design places exhaustive search on the small-cardinality implementation parameters (only 6 combinations of A×P×T), then uses coordinate descent on the three resource parameters within each subspace. The key enabling insight is that β(NC, NT, C) is unimodal — it rises to a peak then falls — which guarantees that coordinate descent finds the global maximum within a subspace without getting trapped in local optima.
Fig 3: Control Flow — One Collective Call, End-to-End
START: training framework calls AllReduce / AllGather / ReduceScatter
│
▼
① [Worker Executor: lookup Config Table for (type, size, group)]
│
├── config found (tuned)? ─────────────────────────────────┐
│ │
└── not found → use NCCL default config ──────────────────┤
│
② [Execute collective with selected config] │◄──┘
│ (runs concurrently with computation — GEMM etc.) │
▼ │
③ [Executor on Leader: record execution time → Optimizer] │
│ │
▼ │
④ [Optimizer: update History Table] │
│ │
├── subspace already converged? → idle │
│ │
└── not converged → run one step of CoordDescentSearch │
│ (one ProfileBw call in-band with training) │
▼ │
⑤ [If new config improves bandwidth → notify Coordinator] │
│ │
▼ │
⑥ [Coordinator: async broadcast new config to all Workers] │
│ (uses synchronous semantics → atomic update) │
▼ │
⑦ [All Workers update their Config Table] │
│ │
└── next identical collective → step ① with new config ───┘
▲ Fig 3: Control flow — tuning is interlaced with normal training
iterations; no separate offline profiling phase required.
Fig 4: Data Flow — History Table, Config Table, and Broadcast
┌──────────────────────────────────────────────────────────────┐
│ LEADER NODE │
│ │
│ ┌────────────────────────────────────┐ │
│ │ History Table (Leader-only) │ │
│ │ task_id │ config │ exec_time(ms) │ │
│ │ task_2 │ cfg_1 │ 16 ms │◄═ Executor sends │
│ │ task_2 │ cfg_2 │ 14 ms ← new │ measured time │
│ └────────────────┬───────────────────┘ │
│ │ historical exec times │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Optimizer │ │
│ │ CoordDescentSearch → new cfg │ │
│ └──────────────┬──────────────────┘ │
│ │ notify (new config) │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Coordinator │ │
│ └──────────────┬──────────────────┘ │
│ │ │
│ ┌──────────────▼──────────────────┐ │
│ │ Config Table (Leader + Worker) │ │
│ │ task_id │ default │ tuned │ │
│ │ task_2 │ cfg_1 │ cfg_2 │ │
│ └──────────────────────────────── ┘ │
└────────────────────╫────────────────────────────────────────┘
║ async broadcast (Linux socket)
┌────────────────────╨────────────────────────────────────────┐
│ WORKER NODE(S) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Config Table (replica, updated by broadcast) │ │
│ │ task_id │ default │ tuned │ │
│ │ task_2 │ cfg_1 │ cfg_2 ← updated atomically │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
▲ Fig 4: Data flow — History Table lives only on Leader; Config Table
is replicated to all Workers via atomic broadcast from Coordinator.
Fig 5: State Machine — AutoCCL Tuning Lifecycle per Subspace
new_task (subspace s, not yet tuned)
[IDLE] ──────────────────────────────────► [PROBING]
▲ │
│ one coord-descent step
│ per training iteration
│ │
│ BwDelta > 0?
│ ┌────┴────┐
│ yes no
│ │ │
│ ▼ ▼
│ [IMPROVING] [ADVANCE DIM]
│ │ │
│ update opt tuned_dim++
│ │ │
│ tuned_dim == M? │
│ └────┬───────────┘
│ yes
│ │
└────────── broadcast optimum ◄────┘
[CONVERGED]
(subspace done;
move to next
subspace or idle)
▲ Fig 5: Per-subspace tuning state machine — PROBING advances one
coordinate dimension per training iteration; convergence broadcasts
the optimum and idles until a new task arrives.
Fig 6: Layered Stack — AutoCCL Software Insertion Points
┌──────────────────────────────────────────────────────────┐
│ Training Framework (PyTorch / MegatronLM) │
│ calls AllReduce, AllGather, ReduceScatter — unchanged │
├──────────────────────────────────────────────────────────┤
│ AutoCCL Shim Layer (dynamic library preload) │
│ ┌─────────────────────┐ ┌──────────────────────────┐ │
│ │ Config Table Lookup│ │ Background tuning thread│ │
│ │ (per task key) │ │ (Optimizer + Coordinator│ │
│ └─────────────────────┘ └──────────────────────────┘ │
│ Intercepts NCCL calls; selects tuned or default config │
├──────────────────────────────────────────────────────────┤
│ NCCL 2.18.3 Core │
│ AllReduce / AllGather / ReduceScatter kernels │
│ Config generation module (modified to allow flex cfg) │
├──────────────────────────────────────────────────────────┤
│ Transport layer (P2P / SHM / NVLink / IB) │
├──────────────────────────────────────────────────────────┤
│ Hardware (NVLink, PCIe, InfiniBand 400 Gbps) │
└──────────────────────────────────────────────────────────┘
▲ Fig 6: AutoCCL inserts between the training framework and NCCL
core via dynamic library preload — zero framework code changes.
Fig 7: Sequence Diagram — Online Tuning Across Training Iterations
Iteration: k-1 k k+1 k+2 N
│ │ │ │ │
Leader │ │ │ │ │
Executor ─────┼── exec c1 ─┼── exec c1 ─┼── exec c2 ─┼─────┼─►
│ 16 ms │ 14 ms │ │ │
│ │ │ │ │
Optimizer ─────┼────────────┼ b1:update │ │ │
│ │ History Tbl│ │ │
│ │ b2:fetch │ │ │
│ │ history │ │ │
│ │ b3:tune → │ │ │
│ │ gen cfg_2 │ │ │
│ │ b4:notify │ │ │
│ │ Coord │ │ │
│ │ │ │ │
Coordinator────┼────────────┼────────────┼ b5:bcast │ │
│ │ │ cfg_2 → │ │
│ │ │ Workers │ │
│ │ │ │ │
Worker │ │ │ │ │
Executor ─────┼── exec c1 ─┼── exec c1 ─┼── exec c2 ─┼─────┼─►
│ │ │ (updated) │ │
◄─── k tuning iters (k<<N) ──►◄ N-k beneficial iters►
▲ Fig 7: Online tuning sequence — tuning runs during early training
iterations; once converged, all subsequent identical collectives
benefit from the tuned config with no overhead.
Fig 8: Performance Model — Bandwidth Derivation
Collective execution = two serial phases:
Phase 0: Transport (read data from other GPUs → local buffer)
Phase 1: Protocol (load buffer → SM → reduce → store back)
┌──────────────────────────────────────────────────────────┐
│ Transport bandwidth β₀(NC, C): │
│ │
│ Message M split into NC channels × NC×C serial steps │
│ t₀(NC,C) = M/(NC×C) × (α₀ + NC×C / β̄₀×γ) │
│ ↑ ↑ │
│ init latency transmission time │
│ per step under congestion γ │
│ β₀ = M / t₀(NC,C) │
│ = NC×C / (α₀ + NC×C/(β̄₀×γ)) │
└──────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ Protocol bandwidth β₁(NC, NT): │
│ │
│ β₁ = NC×NT / (α₁ + NC×NT/(β̄₁×γ)) │
│ (depends on NC and NT, not C — SMs process in cache) │
└──────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ Combined bandwidth (bottleneck): │
│ β(NC,NT,C) = min(β₀(NC,C), β₁(NC,NT)) │
│ ↑ This is unimodal — enables coordinate descent │
└──────────────────────────────────────────────────────────┘
▲ Fig 8: Analytic bandwidth model — the bottleneck of transport and
protocol stages; unimodality enables gradient-free coord descent.
Fig 9: Design Trade-off Analysis
| Decision | Alternative A | Alternative B (AutoCCL) | Winner | Why |
|---|---|---|---|---|
| Tuning timing | Offline (pre-training) | Online (first k iters) | B | Offline can't capture computational interference from training; runtime scheduling dynamics change optimal config |
| Search strategy | Exhaustive / brute-force | Subspace divide + coord descent | B | Exhaustive takes hours for AllGather 80MB alone; coord descent converges in <10 mins with 1-step-per-iter amortization |
| Config scope | Global env vars (NCCL default) | Per-task (type,size,group) key | B | Same primitive type with different sizes needs different configs; global config degrades small-message performance |
| Impl. vs resource params | Treat all 6 uniformly | Split: exhaustive impl, model resource | B | Impl space is small (6 combos), resource space is huge (millions); splits exploit structural asymmetry |
| Comp. interference model | Ignore (AFNFA / NCCL default) | Capture via online profiling | B | Offline tuners degrade under interference; online profiling captures actual SM/cache/bandwidth contention |
| Broadcast mechanism | P2P gossip per node | Leader-to-group bcast (Linux socket) | B | Single atomic broadcast ensures consistent config table; no inconsistency window within a training step |
| Transport selection | Fixed T (e.g., always P2P) | T as implementation param, searched | B | T=SHM outperforms T=P2P on NVLink by up to 1.28x; ignoring T wastes topology-specific bandwidth |
| Leader overhead | Inline with comm | Separate background thread | B | Async measurement + tuning avoids blocking the collective's critical path; overhead hidden in iteration slack |
For DynamICCL, prefer B across all rows because AutoCCL's online, per-task, asymmetric-role design directly mirrors what DynamICCL's RL agent needs: a regime where configs are discovered and refined during actual training rather than pre-characterized offline, and where computational interference is a first-class input to the selection mechanism rather than an afterthought.
What to Borrow for DynamICCL
Pattern 1 — Leader/Worker asymmetry as the control plane design. DynamICCL currently treats all ranks symmetrically for config selection. AutoCCL proves that designating one node as the tuning Leader and broadcasting configs to Workers via async atomic broadcast is both low-overhead and correctness-safe. DynamICCL's RL agent (Agent-2) should run on a single designated Leader rank, and its config decisions should propagate to Workers via a Coordinator broadcast rather than requiring every rank to independently query the agent — this halves the policy inference load.
Pattern 2 — Per-task keying: (type, message_size_bin, group_size). AutoCCL keys its Config Table on (primitive type, message size, comm. group). DynamICCL's state vector already includes message_size_bin; the group_size dimension should be added explicitly. Two AllReduce calls of the same size but different group sizes (e.g., TP group vs DP group) have different optimal configs and must be tracked separately.
Pattern 3 — Unimodal bandwidth surface enables gradient-based tuning. The β(NC, NT, C) model reveals that the bandwidth surface is unimodal over resource parameters given fixed implementation parameters. DynamICCL's RL action space discretizes nChannels (1-8) and numThreads — the RL agent can exploit this unimodality by biasing its exploration: once the policy has found a peak, it should only explore neighboring configurations (±1 step), not random exploration over the full action space.
Pattern 4 — Computational interference is a first-class tuning signal. AutoCCL's central insight is that NCCL's default cost model ignores SM and bandwidth contention from concurrent GEMM operations. DynamICCL's Agent-1 (LSTM+CUSUM) detects congestion as a binary signal. This should be promoted to a continuous interference level fed into Agent-2's state vector — matching AutoCCL's observation that interference changes the optimal NC, NT, C values significantly (up to 39% bandwidth degradation under heavy interference).
Pattern 5 — Amortize tuning over repeated identical collectives. DNN training repeats the same collective pattern thousands of times per epoch. AutoCCL exploits this by spending the first k<<N iterations on tuning, then benefiting for N-k iterations. DynamICCL should similarly implement a "warm-up exploration" phase per collective key during the first epoch, after which the policy freezes its config selection for that key and relies on CUSUM to trigger re-exploration only when congestion changes.
Pattern 6 — Separate transport T from resource parameters in the action space. AutoCCL shows that T (P2P vs SHM) is an implementation parameter that must be selected before tuning NC, NT, C — because NC's optimal direction reverses depending on T (more NC is better for SHM but not for P2P at the same scale). DynamICCL's action space should treat the algorithm+protocol+transport triple as a categorical choice (the impl subspace), and nChannels+numThreads as continuous resource choices within that subspace.