Architecture & Design Analysis

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

Source: Hu, Shen, Bonato et al., ETH Zürich / NVIDIA / Broadcom, arXiv:2507.04786v2, July 2025


1. System Overview Block Diagram

┌──────────────────────────────────────────────────────────────────────┐
│                    NCCL System Architecture                          │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                   User API Layer                             │   │
│  │  ncclAllReduce  ncclAllGather  ncclReduceScatter             │   │
│  │  ncclBroadcast  ncclReduce     ncclSend / ncclRecv           │   │
│  │  ncclGroupStart / ncclGroupEnd  (batched launch)             │   │
│  └────────────────────────┬─────────────────────────────────────┘   │
│                           │ collective invocation                    │
│                           ▼                                          │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │            Algorithm & Protocol Selection                    │   │
│  │  ┌─────────────────────────┐  ┌────────────────────────────┐│   │
│  │  │  Algorithm selector     │  │  Protocol selector         ││   │
│  │  │  Ring / Tree /          │  │  Simple / LL / LL128       ││   │
│  │  │  CollNet / NVLS / PAT   │  │  (based on msg size,       ││   │
│  │  │  (based on collective   │  │   topology, hw caps)       ││   │
│  │  │   type, topology, size) │  │                            ││   │
│  │  └────────────┬────────────┘  └────────────┬───────────────┘│   │
│  └───────────────┼───────────────────────────┼─────────────────┘   │
│                  │                           │                       │
│                  ▼                           ▼                       │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │            Channel Management Layer                          │   │
│  │  nChannels CUDA blocks ─► each block = 1 communication      │   │
│  │  channel, runs on its own SM, operates on disjoint data      │   │
│  │  chunk partition (workOffset + channelCount elements)        │   │
│  │                                                              │   │
│  │  Within each channel:                                        │   │
│  │  ┌──────────────────────────────────────────────────────┐   │   │
│  │  │  NCCL_STEPS slots (pipeline stages, default 8)       │   │   │
│  │  │  Each slot: nclConnFifo (mode, offset, size, ptr)    │   │   │
│  │  │  Slots cycle: computing → queued → in-flight → done  │   │   │
│  │  └──────────────────────────────────────────────────────┘   │   │
│  └───────────────────────────────┬──────────────────────────────┘   │
│                                  │                                   │
│                                  ▼                                   │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │            Transport Layer (data movement)                   │   │
│  │                                                              │   │
│  │   Intra-Node                    Inter-Node                   │   │
│  │  ┌───────────────────┐        ┌────────────────────────┐    │   │
│  │  │  P2P (NVLink)     │        │  IB Verbs (RDMA)       │    │   │
│  │  │  P2P_DIRECT mode  │        │  - forward QP (bulk)   │    │   │
│  │  │  (no FIFO buffer) │        │  - reverse QP (CTS)    │    │   │
│  │  │                   │        │  GPUDirect RDMA        │    │   │
│  │  │  SHM (host mem    │        │  (if GPU+NIC same PCIe)│    │   │
│  │  │  fallback for     │        │                        │    │   │
│  │  │  poor PCIe P2P)   │        │  Socket (TCP fallback) │    │   │
│  │  └───────────────────┘        └────────────────────────┘    │   │
│  └──────────────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────────────┘
▲ Fig 1: NCCL four-layer architecture — API → algorithm+protocol
         selection → channel management → transport execution

Interpretation. The layered design separates concerns cleanly: algorithm selection (Ring vs. Tree) is independent of protocol selection (Simple vs. LL vs. LL128), which in turn is independent of the transport (P2P vs. IB vs. SHM). This means DynamICCL's Config Agent can treat these as orthogonal dimensions in its action space — choosing algorithm, protocol, and nChannels independently — because NCCL itself composes them.


2. Key Architecture Diagram — Protocol Comparison & Channel Execution Model

2a. Protocol Characteristics

┌────────────────────────────────────────────────────────────────────┐
│              NCCL Protocol Trade-off Space                         │
│                                                                    │
│  Small messages ◄────────────────────────────► Large messages      │
│                                                                    │
│  ┌─────────────┐   ┌────────────────┐   ┌──────────────────────┐  │
│  │  LL          │   │  LL128         │   │  Simple              │  │
│  │ (Low Latency)│   │ (Low Lat +     │   │ (High Bandwidth)     │  │
│  │              │   │  High BW)      │   │                      │  │
│  │ Payload:     │   │ Payload:       │   │ Payload:             │  │
│  │ 4B data +    │   │ 120B data +    │   │ large chunks         │  │
│  │ 4B flag      │   │ 8B flag        │   │                      │  │
│  │ (8B atomic)  │   │ (128B atomic)  │   │ Sync: memory fences  │  │
│  │              │   │                │   │ (high overhead)      │  │
│  │ Sync: flag   │   │ Sync: flag     │   │                      │  │
│  │ poll (CPU)   │   │ poll (CPU)     │   │ Latency/hop: ~6 µs   │  │
│  │              │   │                │   │                      │  │
│  │ BW: 25-50%   │   │ BW: ~95%       │   │ BW: near peak        │  │
│  │ of peak      │   │ of peak        │   │                      │  │
│  │              │   │                │   │ Disables RDMA for    │  │
│  │ Latency/hop: │   │ Latency/hop:   │   │ GPU direct path      │  │
│  │ ~1 µs        │   │ ~2 µs          │   │                      │  │
│  │              │   │ Requires       │   │ Best for: inter-node │  │
│  │ Forces host  │   │ atomic 128B    │   │ large AllReduce      │  │
│  │ mem buffer   │   │ writes (hw     │   │                      │  │
│  │ (no RDMA)    │   │ constraint)    │   │                      │  │
│  └─────────────┘   └────────────────┘   └──────────────────────┘  │
│        ▲                  ▲                        ▲               │
│   best: intra-node   best: NVLink              best: inter-node    │
│   small messages     all sizes                 large messages      │
└────────────────────────────────────────────────────────────────────┘
▲ Fig 2: NCCL protocol trade-off space — LL/LL128 for small/latency-
         critical, Simple for large bandwidth-critical messages

2b. Channel Execution Model (CUDA Hierarchy Mapping)

  NCCL Kernel Grid: (nChannels, 1, 1) blocks
  │
  ├── Block 0 (Channel 0) → SM 0
  │   ├── Warp 0: load ncclDevComm (communicator metadata)
  │   ├── Warp 1: load ncclDevChannel (channel-specific data)
  │   └── Warps 2..nWarps: communication + computation work
  │       │
  │       └── NCCL_STEPS pipeline slots within channel buffer:
  │           Slot 0 [computing] → Slot 1 [queued] →
  │           Slot 2 [in-flight] → Slot 3 [ready] → recycle
  │
  ├── Block 1 (Channel 1) → SM 1
  │   └── (same structure, different workOffset)
  │
  └── Block N-1 (Channel N-1) → SM N-1

  Data partitioning across channels:
  ┌────────────────────────────────────────────────────────┐
  │  Total count elements                                  │
  │  ├── Channel 0: [workOffset_0, workOffset_0 +          │
  │  │               channelCount_0]                       │
  │  │   └── processed in loopCount outer iterations      │
  │  │       each iteration: chunkCount elements           │
  │  └── Channel 1: [workOffset_1, workOffset_1 +          │
  │                  channelCount_1]                       │
  │      └── (independent of Channel 0)                   │
  └────────────────────────────────────────────────────────┘
▲ Fig 3: CUDA hierarchy mapping — one block per channel, warps
         specialize into roles, slot-based pipeline within channel

3. Control Flow & Data Flow Diagrams

3a. Control Flow — Algorithm and Protocol Selection

  START: ncclAllReduce(sendbuff, recvbuff, count, dtype, op, comm)
    │
    ▼
① [Read communicator topology established at ncclCommInit]
    │
    ▼
② [Evaluate algorithm candidates:]
    │
    ├── Ring?  ─────► supports: AllReduce(all proto),
    │                           AllGather(all proto),
    │                           ReduceScatter(all proto),
    │                           Broadcast(Simple only),
    │                           Reduce(Simple only)
    │
    ├── Tree?  ─────► supports: AllReduce(all proto) only
    │                 (double binary tree: 2 mirrored trees)
    │
    ├── CollNet? ───► requires: NVIDIA SHARP switches
    │                           (hardware-assisted reduction)
    │
    └── NVLS?  ─────► requires: NVLink Switch topology
    │
    ▼
③ [Select protocol based on message size + hw caps:]
    │
    ├── count * sizeof(dtype) < threshold_small
    │   AND hw supports LL → SELECT LL
    │
    ├── count * sizeof(dtype) < threshold_medium
    │   AND hw supports 128B atomic → SELECT LL128
    │
    └── else → SELECT Simple
    │
    ▼
④ [Determine nChannels via internal tuning model:]
    │
    │   (considers: selected algo+proto, msg_size,
    │    available bandwidth, topology)
    │   → larger msg_size tends to fewer channels
    │     (avoid under-filling 512KiB FIFO per channel)
    │
    ▼
⑤ [Determine numThreads (nWarps) per channel block]
    │
    ▼
⑥ [Launch kernel: grid=(nChannels,1,1), each block
    executes channel-specific collective primitives]
    │
    ▼
  DONE: result in recvbuff
▲ Fig 4: Control flow for NCCL collective dispatch —
         algorithm → protocol → channels → threads → kernel

3b. Data Flow — Ring AllReduce (k GPUs, one loop iteration)

  ReduceScatter phase (steps 0 to k-1):
  ─────────────────────────────────────
  GPU 0          GPU 1          GPU 2         GPU k-1
    │              │              │               │
①  ─┼── send ──►  │              │               │
    │ recvReduceSend             │               │
    │ (recv+reduce+fwd) ────────►│               │
    │              │ recvReduceSend ─────────────►│
    │              │              │  recvReduceSend
    │  ...         │  ...         │  ...          │
    │◄─────────────────────────────── recvReduceCopySend
    │  (step k-1: final reduce + copy to output)  │

  AllGather phase (steps k to 2k-2):
  ──────────────────────────────────
  GPU 0          GPU 1          GPU 2         GPU k-1
    │              │              │               │
    ─┼── recvCopySend ──►│        │               │
    │              │── recvCopySend ──►│           │
    │              │              │── recvCopySend►│
    │◄──────────────────────────────── recv (step 2k-2)
    │              │              │               │
  [all GPUs now hold fully reduced result]

  Total steps per loop iteration: 2k - 1
  Bandwidth: optimal O(2(k-1)/k * B) as k → ∞
▲ Fig 5: Ring AllReduce data flow — ReduceScatter then AllGather,
         2k-1 steps, bandwidth-optimal for large messages

3c. Data Flow — Tree AllReduce (per loop iteration)

  Reduce phase (leaf → root):
  ───────────────────────────
  Leaf GPU ── send ──► Middle GPU
                          │── recvReduceSend ──► Root GPU
                                                    │
                                          recvReduceCopySend
                                          (reduce + copy to outbuf)

  Broadcast phase (root → leaf):
  ───────────────────────────────
  Root GPU ── recvCopySend ──► Middle GPU
                                    │── recvCopySend ──► Leaf GPU
                                                           (recv)

  Phases often run concurrently on partitioned SMs:
  ┌──────────────────────────────────────────────────────┐
  │  SM group A (more threads): Reduce phase toward root │
  │  SM group B (fewer threads): Broadcast from root     │
  │  Asymmetric allocation → more BW to reduction phase  │
  └──────────────────────────────────────────────────────┘

  Total steps per loop iteration: O(log k)
  Bandwidth: sub-optimal (root is bottleneck)
  Best for: small messages, low-latency requirement
▲ Fig 6: Tree AllReduce data flow — reduce up then broadcast down,
         latency-optimal O(log k) but bandwidth-limited at root

3d. Data Flow — Intra-node vs. Inter-node Transport Paths

  INTRA-NODE (GPUs on same physical machine):
  ┌──────────────────────────────────────────────────────────┐
  │  GPU 0 sendBuff                    GPU 1 recvBuff        │
  │      │                                   ▲              │
  │      │  Path 1: NVLink (preferred)        │              │
  │      └─── P2P_DIRECT (no FIFO) ──────────┘              │
  │                                                          │
  │  Path 2: PCIe P2P (if NVLink absent)                    │
  │      GPU 0 → [Intermediate FIFO Buffer] → GPU 1         │
  │                  (DMA Engine)                            │
  │                                                          │
  │  Path 3: SHM (if PCIe P2P degrades)                     │
  │      GPU 0 → [Host Memory segment] → GPU 1              │
  └──────────────────────────────────────────────────────────┘

  INTER-NODE (GPUs on different machines):
  ┌──────────────────────────────────────────────────────────┐
  │  Node 0 GPU 0                            Node 1 GPU 1    │
  │  sendBuff                                recvBuff        │
  │      │                                       ▲           │
  │      ▼                                       │           │
  │  [GPU kernel: copy to intermediate buffer]   │           │
  │      │                                       │           │
  │      ▼  Proxy Thread                         │           │
  │  [Host Memory] ── IB RDMA_WRITE ──► [Host Memory]       │
  │  (or GPU memory if GPUDirect RDMA available)             │
  │                                                          │
  │  Forward QP: bulk data (RDMA_WRITE)                      │
  │  Reverse QP: CTS control message (tiny, low-latency)     │
  │  2 QPs per GPU pair per NIC (NCHANNELS_PER_NET_PEER=2)  │
  └──────────────────────────────────────────────────────────┘
▲ Fig 7: Transport path selection — NVLink > PCIe P2P > SHM
         intra-node; IB RDMA > Socket inter-node

4. Design Trade-off Analysis

Design Decision Alternative A Alternative B (NCCL) Winner Rationale
Small-message synchronization Memory fences (Simple protocol) Flag-based polling (LL/LL128) B Memory fences require full cache coherence round-trips (~6µs/hop); flag polling enables ~1µs/hop latency by piggybacking validity bit with data payload
Bandwidth efficiency for NVLink LL (25-50% peak BW) LL128 (95% peak BW) B LL128's 128-byte units fully exploit NVLink's cache-line-aligned transfer units; LL's 8-byte units under-utilize link width. LL128 requires hw atomic 128B writes — falls back to LL on PCIe
Parallelism mechanism Single SM handles entire collective Multiple channels, each on its own SM B Single SM cannot saturate NVLink/IB bandwidth; multi-channel parallelism saturates all NIC ports and balances load across PCIe lanes via ECMP
Intra-node P2P path Routing through intermediate FIFO buffer P2P_DIRECT: direct source→destination pointer B Eliminating the FIFO buffer removes one DMA copy; direct pointer access within same process eliminates IPC handle overhead
Inter-node RDMA QP layout Single bidirectional QP Separate forward QP (bulk) + reverse QP (CTS) B Separating CTS control from bulk data eliminates head-of-line blocking; small CTS messages never wait behind large data transfers
Tree topology structure Single binary tree Double binary tree (two mirrored trees) B Doubles bandwidth utilization: GPU that is a leaf in tree 1 is internal in tree 2; each GPU participates in both reduction and broadcast paths simultaneously
Algorithm selection granularity One algorithm per collective type Runtime selection per (collective, size, topology) B Ring is optimal for large AllReduce; Tree is optimal for small AllReduce; using ring for small messages wastes steps, using tree for large messages wastes bandwidth
Channel count under ECMP Fixed high channel count Adaptive: reduce nChannels for small messages B Each channel requires 512KiB FIFO buffer; under-filling buffers with small messages wastes network QP bandwidth and ECMP bucket utilization. calcP2pChunkSize heuristic balances channel parallelism vs. fill ratio

For DynamICCL, prefer B in all cases because these are precisely the NCCL-internal decisions that DynamICCL's Config Agent must learn to reproduce externally. The agent should encode these same heuristics in its learned policy: LL for small messages, LL128 for NVLink workloads, Simple for large inter-node transfers, and adaptive nChannels based on message size.


5. What to Borrow for DynamICCL

5.1 The Algorithm-Protocol Orthogonality Constraint

NCCL treats algorithm selection (Ring/Tree) and protocol selection (Simple/LL/LL128) as independently composable choices — but with hard constraints (Table III). Tree only supports Simple protocol; CollNet only supports Simple; LL128 requires 128B atomic writes. DynamICCL's Config Agent action space must encode these constraints as hard masks. An action (tree, ll128) is architecturally invalid and should never be selected. Implementing these as action masks in the DQN forward pass eliminates invalid state-action pairs and dramatically reduces the effective action space.

Concrete constraint table for action masking:

  ┌─────────────┬────────┬────────┬──────────┐
  │ Algorithm   │ Simple │ LL     │ LL128    │
  ├─────────────┼────────┼────────┼──────────┤
  │ Ring        │  YES   │  YES   │  YES     │
  │ Tree        │  YES   │  NO    │  NO      │
  │ CollNet     │  YES   │  NO    │  NO      │
  │ NVLS        │  YES   │  NO    │  NO      │
  │ NVLS Tree   │  YES   │  NO    │  NO      │
  └─────────────┴────────┴────────┴──────────┘
▲ Action masking table — invalid (algo, proto) pairs

5.2 Message Size Bin as Primary State Feature

This paper establishes empirically that protocol performance is fundamentally determined by message size: LL/LL128 dominate below 64 KiB (inter-node) or across all sizes (intra-node on NVLink); Simple dominates above 64 KiB inter-node. DynamICCL's state vector must include the log2(message_size_bytes) as a primary feature. The paper's Fig. 6 transition point (~64 KiB for inter-node) should seed the initial discretization boundaries for the message size bin feature in DynamICCL's state space.

State encoding recommendation: Use 8 logarithmically spaced bins: [<4KiB, 4-16KiB, 16-64KiB, 64-256KiB, 256KiB-1MiB, 1-4MiB, 4-16MiB, >16MiB]. These align with the LL→LL128→Simple protocol transition points observed in Fig. 6.

5.3 Intra-node vs. Inter-node Topology Flag

LL128 outperforms Simple for intra-node NVLink at ALL message sizes (5% slower at large sizes), but lags behind Simple for inter-node large messages (due to cumulative flag-sync cost at scale). DynamICCL's state must include a binary topology flag: is_intra_node (True if collective spans only GPUs within a single node). This single bit changes the optimal protocol selection from LL128 (intra) to Simple (inter) for large messages — a decision that cannot be learned without this feature in the state.

Concrete state feature: Add topology_class ∈ {intra_node, inter_node_ib, inter_node_roce} as a categorical state feature, one-hot encoded as 3 dimensions.

5.4 Channel Count Lower Bound for Under-Filling Prevention

NCCL's calcP2pChunkSize heuristic reduces nChannels when per-channel chunk size drops below 512 KiB (the FIFO buffer size). This prevents sending partially-filled buffers that waste NIC throughput. DynamICCL's Config Agent should enforce a soft constraint: nChannels ≤ floor(message_size / 512KiB). For a 1 MiB AllReduce, the maximum useful nChannels is 2; for 4 MiB, it is 8. Actions violating this constraint should receive a negative reward penalty proportional to the under-fill ratio, teaching the agent to avoid excessive channel counts for small messages.

Reward shaping term: penalty = max(0, nChannels - floor(msg_size_bytes / 524288)) * λ_channel

5.5 Proxy Thread Architecture for Config Monitoring

NCCL's inter-node transport uses a CPU proxy thread that coordinates RDMA operations separately from the GPU compute stream. This proxy thread has visibility into buffer fullness, QP state, and transfer completion in real time. DynamICCL's Trigger Agent (LSTM+CUSUM) should be instrumented analogously — as a lightweight CPU-side monitor thread that continuously observes NCCL proxy thread metrics (buffer fill ratios, QP retry counts, completion queue depths) as congestion signals. These are more direct indicators of network stress than end-to-end latency alone.

Instrumentation target: Hook into NCCL's ncclIbSendComm structure (per the paper's description of sendComm handles) to extract QP retry counts and CTS round-trip times as real-time congestion features for the Trigger Agent.

5.6 Slot-Based Pipelining as a Config Sensitivity Model

NCCL's NCCL_STEPS=8 slot pipeline within each channel means that protocol overhead is amortized across 8 concurrent in-flight chunks. This implies that for very small messages (fewer than 8 chunks per channel), the pipeline is under-utilized and protocol overhead dominates. DynamICCL can use num_pipeline_slots_utilized = ceil(total_elements / (nChannels * chunkCount)) as a feature: if this value is less than 4 (less than half the pipeline full), the agent should prefer LL (lower per-slot overhead) over Simple (higher per-slot setup cost). This is a directly computable feature from the message count and current config.

5.7 Double Binary Tree as Bandwidth Doubling Pattern

The paper explains that NCCL's Tree AllReduce uses a double binary tree — each GPU participates in two trees simultaneously, functioning as a leaf in one and an internal node in the other. This doubles effective bandwidth compared to a single tree. For DynamICCL, this is evidence that the Tree algorithm is not simply "latency-optimal but bandwidth-poor" — it achieves good bandwidth at small scales through this structural trick. The state representation should track num_ranks separately, because Tree performs well at small rank counts (4-8 GPUs) where ring latency is high but tree depth is shallow.

State feature: Add log2(num_ranks) as a continuous state feature so the Config Agent can learn that Tree is preferred at low rank counts (< 8) for small messages, while Ring dominates at high rank counts for all message sizes.