AutoCCL — Block Diagram Analysis

Paper: "AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training" Authors: Xu, Le, Chen, Lin, Jin, Miao, Li — USTC + Microsoft Research Venue: USENIX NSDI 2025


Fig 1: System Overview — AutoCCL Architecture vs Vanilla NCCL

  ┌──────────────────────────────────────────────────────────────┐
  │           Vanilla NCCL (peer-to-peer, no tuner)              │
  │                                                              │
  │   GPU 0 (Peer)          GPU 1 (Peer)     GPU N (Peer)        │
  │  ┌──────────────┐      ┌────────────┐   ┌────────────┐       │
  │  │  Executor    │      │  Executor  │   │  Executor  │       │
  │  │  ① lookup   │      │  ① lookup  │   │  ① lookup  │       │
  │  │  Default Cfg │      │ Default Cfg│   │ Default Cfg│       │
  │  └──────┬───────┘      └─────┬──────┘   └─────┬──────┘       │
  │         └──────────── ② communicate ──────────┘              │
  │  (all peers use identical cost-model config, independently)  │
  └──────────────────────────────────────────────────────────────┘

  ┌──────────────────────────────────────────────────────────────┐
  │                 AutoCCL (Leader/Worker model)                 │
  │                                                              │
  │  ┌───────────────────────────────────────────────────────┐   │
  │  │  Leader GPU (one designated node per comm. group)     │   │
  │  │                                                       │   │
  │  │  ┌────────────┐  metrics  ┌────────────────────────┐  │   │
  │  │  │  Executor  │══════════►│  Optimizer             │  │   │
  │  │  │  ① lookup  │          │  - profiles History Tbl │  │   │
  │  │  │  Tuned Cfg │◄═════════│  - runs Coord. Descent  │  │   │
  │  │  └────────────┘  config  │  - initiates tuning     │  │   │
  │  │        ║                 └─────────┬──────────────┬─┘  │   │
  │  │  ② communicate              notify│         update│    │   │
  │  │        ║                         ▼               ▼    │   │
  │  │        ║                  ┌──────────────────────────┐│   │
  │  │        ║                  │  Coordinator             ││   │
  │  │        ║                  │  - broadcasts new config ││   │
  │  │        ║                  │    via async bcast       ││   │
  │  │        ║                  └────────────┬─────────────┘│   │
  │  └────────╫───────────────────────────────╫──────────────┘   │
  │           ║                               ║ broadcast        │
  │  ┌────────╨───────────────────────────────╨──────────────┐   │
  │  │  Worker GPU(s) (all other nodes in comm. group)       │   │
  │  │  ┌──────────────────────────────────────────────────┐ │   │
  │  │  │  Executor    ① lookup Tuned Config Table         │ │   │
  │  │  │  (uses tuned config if present, else default)    │ │   │
  │  │  └──────────────────────────────────────────────────┘ │   │
  │  │                   ② communicate                       │   │
  │  └───────────────────────────────────────────────────────┘   │
  └──────────────────────────────────────────────────────────────┘
▲ Fig 1: AutoCCL vs NCCL — one Leader per group runs the Optimizer
  and Coordinator; Workers remain thin executors with a config table.

The asymmetric Leader/Worker split is the central architectural choice. NCCL's peer-to-peer design is symmetric — every node independently computes the same cost-model config. AutoCCL breaks this symmetry by centralizing tuning on one node, which enables online profiling and iterative refinement while keeping Workers unchanged. The consequence is that Workers get better configs without any code modification, but the Leader bears extra CPU/thread overhead.


Fig 2: Key Architecture Diagram — Parameter Space & Subspace Division

  NCCL Configuration Space: <A, P, T, NC, NT, C>
  ════════════════════════════════════════════════

  ┌──────────────────────────────────────────────────────────────┐
  │   6 Parameters — classified into 2 categories               │
  │                                                              │
  │  Implementation Params         Resource Params               │
  │  (define topology/logic)       (define parallelism degree)   │
  │  ┌─────────────────────┐       ┌────────────────────────┐    │
  │  │ A : Algorithm        │       │ NC : Nchannel (1-128)  │    │
  │  │   {Ring, Tree}       │       │ NT : Nthread (32x,     │    │
  │  │                      │       │      i∈{1..20})        │    │
  │  │ P : Protocol         │       │ C  : Chunk size        │    │
  │  │   {LL, LL128,Simple} │       │      (256x, 8K values) │    │
  │  │                      │       └────────────────────────┘    │
  │  │ T : Transport        │       ↑ search space: ~millions     │
  │  │   {P2P, SHM}         │       ↑ joint impact: unimodal      │
  │  └─────────────────────┘       ↑ modeled analytically         │
  │  ↑ small # of combos (6)                                      │
  │  ↑ prerequisite for Resource params                           │
  └──────────────────────────────────────────────────────────────┘
                        │
                        ▼ Subspace-Directed Tuning (Algorithm 1)
  ┌──────────────────────────────────────────────────────────────┐
  │  for each subspace s ∈ [A × P × T]:                         │
  │     optimal(NC,NT,C) ← CoordinateDescentSearch(s)           │
  │     if better than current global best → update             │
  │  return global optimum <A,P,T,NC,NT,C>                      │
  └──────────────────────────────────────────────────────────────┘
                        │
                        ▼ Coordinate Descent Search (Algorithm 2)
  ┌──────────────────────────────────────────────────────────────┐
  │  M = 3 (NC, NT, C are three dimensions)                     │
  │  Random init config p in subspace s                         │
  │  while tuned_dim < M:                                        │
  │     ProfileBw(p)                                             │
  │     if BwDelta(p, optimum) > 0:                              │
  │        lr ← BwDelta / Bw(p);  optimum ← p                   │
  │     else:                                                    │
  │        tuned_dim++;  lr ← 0.01                               │
  │     p[dim] += lr   ← gradient step along one axis           │
  │  return optimum                                              │
  └──────────────────────────────────────────────────────────────┘
▲ Fig 2: Two-level search — exhaustive over small impl. subspaces,
  coordinate descent over large resource parameter space.

This design places exhaustive search on the small-cardinality implementation parameters (only 6 combinations of A×P×T), then uses coordinate descent on the three resource parameters within each subspace. The key enabling insight is that β(NC, NT, C) is unimodal — it rises to a peak then falls — which guarantees that coordinate descent finds the global maximum within a subspace without getting trapped in local optima.


Fig 3: Control Flow — One Collective Call, End-to-End

  START: training framework calls AllReduce / AllGather / ReduceScatter
      │
      ▼
  ① [Worker Executor: lookup Config Table for (type, size, group)]
      │
      ├── config found (tuned)? ─────────────────────────────────┐
      │                                                          │
      └── not found → use NCCL default config ──────────────────┤
                                                                 │
  ② [Execute collective with selected config]                   │◄──┘
      │  (runs concurrently with computation — GEMM etc.)        │
      ▼                                                          │
  ③ [Executor on Leader: record execution time → Optimizer]     │
      │                                                          │
      ▼                                                          │
  ④ [Optimizer: update History Table]                           │
      │                                                          │
      ├── subspace already converged? → idle                     │
      │                                                          │
      └── not converged → run one step of CoordDescentSearch    │
               │  (one ProfileBw call in-band with training)     │
               ▼                                                 │
  ⑤ [If new config improves bandwidth → notify Coordinator]     │
      │                                                          │
      ▼                                                          │
  ⑥ [Coordinator: async broadcast new config to all Workers]    │
      │  (uses synchronous semantics → atomic update)            │
      ▼                                                          │
  ⑦ [All Workers update their Config Table]                     │
      │                                                          │
      └── next identical collective → step ① with new config ───┘
▲ Fig 3: Control flow — tuning is interlaced with normal training
  iterations; no separate offline profiling phase required.

Fig 4: Data Flow — History Table, Config Table, and Broadcast

  ┌──────────────────────────────────────────────────────────────┐
  │  LEADER NODE                                                 │
  │                                                              │
  │  ┌────────────────────────────────────┐                      │
  │  │  History Table (Leader-only)       │                      │
  │  │  task_id │ config │ exec_time(ms)  │                      │
  │  │  task_2  │ cfg_1  │ 16 ms          │◄═ Executor sends    │
  │  │  task_2  │ cfg_2  │ 14 ms  ← new  │   measured time     │
  │  └────────────────┬───────────────────┘                      │
  │                   │ historical exec times                    │
  │                   ▼                                          │
  │  ┌─────────────────────────────────┐                         │
  │  │  Optimizer                      │                         │
  │  │  CoordDescentSearch → new cfg   │                         │
  │  └──────────────┬──────────────────┘                         │
  │                 │ notify (new config)                        │
  │                 ▼                                            │
  │  ┌─────────────────────────────────┐                         │
  │  │  Coordinator                    │                         │
  │  └──────────────┬──────────────────┘                         │
  │                 │                                            │
  │  ┌──────────────▼──────────────────┐                         │
  │  │  Config Table (Leader + Worker) │                         │
  │  │  task_id │ default  │ tuned     │                         │
  │  │  task_2  │ cfg_1    │ cfg_2     │                         │
  │  └──────────────────────────────── ┘                         │
  └────────────────────╫────────────────────────────────────────┘
                       ║ async broadcast (Linux socket)
  ┌────────────────────╨────────────────────────────────────────┐
  │  WORKER NODE(S)                                              │
  │  ┌──────────────────────────────────────────────────────┐   │
  │  │  Config Table (replica, updated by broadcast)        │   │
  │  │  task_id │ default  │ tuned                          │   │
  │  │  task_2  │ cfg_1    │ cfg_2   ← updated atomically   │   │
  │  └──────────────────────────────────────────────────────┘   │
  └──────────────────────────────────────────────────────────────┘
▲ Fig 4: Data flow — History Table lives only on Leader; Config Table
  is replicated to all Workers via atomic broadcast from Coordinator.

Fig 5: State Machine — AutoCCL Tuning Lifecycle per Subspace

          new_task (subspace s, not yet tuned)
  [IDLE] ──────────────────────────────────► [PROBING]
     ▲                                            │
     │                               one coord-descent step
     │                               per training iteration
     │                                            │
     │                                    BwDelta > 0?
     │                                   ┌────┴────┐
     │                                  yes        no
     │                                   │          │
     │                                   ▼          ▼
     │                             [IMPROVING] [ADVANCE DIM]
     │                                   │          │
     │                              update opt  tuned_dim++
     │                                   │          │
     │                             tuned_dim == M?  │
     │                             └────┬───────────┘
     │                                 yes
     │                                  │
     └────────── broadcast optimum ◄────┘
                                  [CONVERGED]
                                  (subspace done;
                                   move to next
                                   subspace or idle)
▲ Fig 5: Per-subspace tuning state machine — PROBING advances one
  coordinate dimension per training iteration; convergence broadcasts
  the optimum and idles until a new task arrives.

Fig 6: Layered Stack — AutoCCL Software Insertion Points

┌──────────────────────────────────────────────────────────┐
│  Training Framework (PyTorch / MegatronLM)               │
│  calls AllReduce, AllGather, ReduceScatter — unchanged   │
├──────────────────────────────────────────────────────────┤
│  AutoCCL Shim Layer (dynamic library preload)            │
│  ┌─────────────────────┐  ┌──────────────────────────┐   │
│  │  Config Table Lookup│  │  Background tuning thread│   │
│  │  (per task key)     │  │  (Optimizer + Coordinator│   │
│  └─────────────────────┘  └──────────────────────────┘   │
│  Intercepts NCCL calls; selects tuned or default config  │
├──────────────────────────────────────────────────────────┤
│  NCCL 2.18.3 Core                                        │
│  AllReduce / AllGather / ReduceScatter kernels           │
│  Config generation module (modified to allow flex cfg)   │
├──────────────────────────────────────────────────────────┤
│  Transport layer (P2P / SHM / NVLink / IB)               │
├──────────────────────────────────────────────────────────┤
│  Hardware (NVLink, PCIe, InfiniBand 400 Gbps)            │
└──────────────────────────────────────────────────────────┘
▲ Fig 6: AutoCCL inserts between the training framework and NCCL
  core via dynamic library preload — zero framework code changes.

Fig 7: Sequence Diagram — Online Tuning Across Training Iterations

  Iteration:    k-1          k           k+1         k+2     N
                 │            │            │            │      │
  Leader         │            │            │            │      │
  Executor  ─────┼── exec c1 ─┼── exec c1 ─┼── exec c2 ─┼─────┼─►
                 │   16 ms    │   14 ms    │            │      │
                 │            │            │            │      │
  Optimizer ─────┼────────────┼ b1:update  │            │      │
                 │            │ History Tbl│            │      │
                 │            │ b2:fetch   │            │      │
                 │            │ history    │            │      │
                 │            │ b3:tune →  │            │      │
                 │            │ gen cfg_2  │            │      │
                 │            │ b4:notify  │            │      │
                 │            │ Coord      │            │      │
                 │            │            │            │      │
  Coordinator────┼────────────┼────────────┼ b5:bcast   │      │
                 │            │            │ cfg_2 →    │      │
                 │            │            │ Workers    │      │
                 │            │            │            │      │
  Worker         │            │            │            │      │
  Executor  ─────┼── exec c1 ─┼── exec c1 ─┼── exec c2 ─┼─────┼─►
                 │            │            │  (updated) │      │
                 ◄─── k tuning iters (k<<N) ──►◄ N-k beneficial iters►
▲ Fig 7: Online tuning sequence — tuning runs during early training
  iterations; once converged, all subsequent identical collectives
  benefit from the tuned config with no overhead.

Fig 8: Performance Model — Bandwidth Derivation

  Collective execution = two serial phases:

  Phase 0: Transport (read data from other GPUs → local buffer)
  Phase 1: Protocol  (load buffer → SM → reduce → store back)

  ┌──────────────────────────────────────────────────────────┐
  │  Transport bandwidth β₀(NC, C):                          │
  │                                                          │
  │  Message M split into NC channels × NC×C serial steps    │
  │  t₀(NC,C) = M/(NC×C) × (α₀ + NC×C / β̄₀×γ)             │
  │                          ↑          ↑                    │
  │                    init latency  transmission time       │
  │                    per step      under congestion γ      │
  │  β₀ = M / t₀(NC,C)                                      │
  │     = NC×C / (α₀ + NC×C/(β̄₀×γ))                        │
  └──────────────────────────────────────────────────────────┘
  ┌──────────────────────────────────────────────────────────┐
  │  Protocol bandwidth β₁(NC, NT):                          │
  │                                                          │
  │  β₁ = NC×NT / (α₁ + NC×NT/(β̄₁×γ))                      │
  │  (depends on NC and NT, not C — SMs process in cache)    │
  └──────────────────────────────────────────────────────────┘
  ┌──────────────────────────────────────────────────────────┐
  │  Combined bandwidth (bottleneck):                        │
  │  β(NC,NT,C) = min(β₀(NC,C), β₁(NC,NT))                  │
  │  ↑ This is unimodal — enables coordinate descent         │
  └──────────────────────────────────────────────────────────┘
▲ Fig 8: Analytic bandwidth model — the bottleneck of transport and
  protocol stages; unimodality enables gradient-free coord descent.

Fig 9: Design Trade-off Analysis

Decision Alternative A Alternative B (AutoCCL) Winner Why
Tuning timing Offline (pre-training) Online (first k iters) B Offline can't capture computational interference from training; runtime scheduling dynamics change optimal config
Search strategy Exhaustive / brute-force Subspace divide + coord descent B Exhaustive takes hours for AllGather 80MB alone; coord descent converges in <10 mins with 1-step-per-iter amortization
Config scope Global env vars (NCCL default) Per-task (type,size,group) key B Same primitive type with different sizes needs different configs; global config degrades small-message performance
Impl. vs resource params Treat all 6 uniformly Split: exhaustive impl, model resource B Impl space is small (6 combos), resource space is huge (millions); splits exploit structural asymmetry
Comp. interference model Ignore (AFNFA / NCCL default) Capture via online profiling B Offline tuners degrade under interference; online profiling captures actual SM/cache/bandwidth contention
Broadcast mechanism P2P gossip per node Leader-to-group bcast (Linux socket) B Single atomic broadcast ensures consistent config table; no inconsistency window within a training step
Transport selection Fixed T (e.g., always P2P) T as implementation param, searched B T=SHM outperforms T=P2P on NVLink by up to 1.28x; ignoring T wastes topology-specific bandwidth
Leader overhead Inline with comm Separate background thread B Async measurement + tuning avoids blocking the collective's critical path; overhead hidden in iteration slack

For DynamICCL, prefer B across all rows because AutoCCL's online, per-task, asymmetric-role design directly mirrors what DynamICCL's RL agent needs: a regime where configs are discovered and refined during actual training rather than pre-characterized offline, and where computational interference is a first-class input to the selection mechanism rather than an afterthought.


What to Borrow for DynamICCL

Pattern 1 — Leader/Worker asymmetry as the control plane design. DynamICCL currently treats all ranks symmetrically for config selection. AutoCCL proves that designating one node as the tuning Leader and broadcasting configs to Workers via async atomic broadcast is both low-overhead and correctness-safe. DynamICCL's RL agent (Agent-2) should run on a single designated Leader rank, and its config decisions should propagate to Workers via a Coordinator broadcast rather than requiring every rank to independently query the agent — this halves the policy inference load.

Pattern 2 — Per-task keying: (type, message_size_bin, group_size). AutoCCL keys its Config Table on (primitive type, message size, comm. group). DynamICCL's state vector already includes message_size_bin; the group_size dimension should be added explicitly. Two AllReduce calls of the same size but different group sizes (e.g., TP group vs DP group) have different optimal configs and must be tracked separately.

Pattern 3 — Unimodal bandwidth surface enables gradient-based tuning. The β(NC, NT, C) model reveals that the bandwidth surface is unimodal over resource parameters given fixed implementation parameters. DynamICCL's RL action space discretizes nChannels (1-8) and numThreads — the RL agent can exploit this unimodality by biasing its exploration: once the policy has found a peak, it should only explore neighboring configurations (±1 step), not random exploration over the full action space.

Pattern 4 — Computational interference is a first-class tuning signal. AutoCCL's central insight is that NCCL's default cost model ignores SM and bandwidth contention from concurrent GEMM operations. DynamICCL's Agent-1 (LSTM+CUSUM) detects congestion as a binary signal. This should be promoted to a continuous interference level fed into Agent-2's state vector — matching AutoCCL's observation that interference changes the optimal NC, NT, C values significantly (up to 39% bandwidth degradation under heavy interference).

Pattern 5 — Amortize tuning over repeated identical collectives. DNN training repeats the same collective pattern thousands of times per epoch. AutoCCL exploits this by spending the first k<<N iterations on tuning, then benefiting for N-k iterations. DynamICCL should similarly implement a "warm-up exploration" phase per collective key during the first epoch, after which the policy freezes its config selection for that key and relies on CUSUM to trigger re-exploration only when congestion changes.

Pattern 6 — Separate transport T from resource parameters in the action space. AutoCCL shows that T (P2P vs SHM) is an implementation parameter that must be selected before tuning NC, NT, C — because NC's optimal direction reverses depending on T (more NC is better for SHM but not for P2P at the same scale). DynamICCL's action space should treat the algorithm+protocol+transport triple as a categorical choice (the impl subspace), and nChannels+numThreads as continuous resource choices within that subspace.