Detailed Summary: Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

Citation: Zhiyi Hu*, Siyuan Shen*, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, Torsten Hoefler. ETH Zurich / NVIDIA Corporation / Broadcom Inc. arXiv:2507.04786v2, July 23, 2025. (*equal contribution)


Abstract (Paraphrased)

NCCL is the critical collective communication library enabling high-performance GPU-to-GPU communication at scale. Despite being open source with a documented API, its internal design — channel management, protocol selection, intra/inter-node data movement, and collective algorithms — is largely opaque. This paper provides a comprehensive, systematic analysis of NCCL v2.19.1, covering: (1) API structure and channel management, (2) the three communication protocols (Simple, LL, LL128), (3) data-transfer models and transport layers, and (4) all major collective algorithms at the NCCL primitive level. Insights from this analysis were used to build ATLAHS, a network simulation toolchain achieving <5% simulation error on large-scale multi-GPU AI workloads.


Motivation

NCCL is critical infrastructure for distributed AI training, yet its GitHub repository receives frequent questions about ring AllReduce performance discrepancies, broadcast topology logic, and RDMA write behavior — signs that even expert practitioners lack a clear mental model of its internals. Existing empirical studies compare NCCL against other libraries (MPI, Gloo, oneCCL) at a black-box level. This paper takes the first systematic approach to opening the black box: analyzing NCCL source code to explain why performance behaves as it does, not just what it does.

The practical motivation is the ATLAHS simulator: to accurately model NCCL communication patterns in large-scale training workloads, one needs to know exactly which primitives execute in which order, how pipeline stages overlap, and how channels map to CUDA blocks.


Section II: NCCL Overview

API Structure (Four Categories)

  1. Communicator Management: ncclCommInitAll (single process), ncclCommInitRank (multi-process). Each GPU gets a unique rank 0 to n-1. Communicators are destroyed with ncclCommDestroy (graceful) or ncclCommAbort (immediate, for error recovery).

  2. Collective Operations: Five collectives: ncclAllReduce, ncclBroadcast, ncclReduce, ncclAllGather, ncclReduceScatter. The legacy ncclBcast (in-place) is deprecated.

  3. Point-to-Point: ncclSend and ncclRecv for directed transfers.

  4. Group Calls: ncclGroupStart / ncclGroupEnd bracket multiple operations to reduce launch overhead, enabling all-to-one, one-to-all, and all-to-all patterns as a single NCCL launch.

Launching Strategies (Three Models)

Model Description Trade-off
One CPU process per GPU Each GPU in its own process; NUMA-optimal data locality Best control, highest overhead
One CPU thread per GPU Single process, multiple threads; intra-process memory sharing Good for intra-node memory sharing
One CPU thread for multiple GPUs Single-threaded; manages all GPUs Simplest; lowest concurrency

Communication Channels

NCCL subdivides each collective into communication channels — independent CUDA blocks, each running on its own SM, operating on disjoint data chunks. Benefits:

Trade-off: Too many channels reduce per-channel chunk size below the 512 KiB NIC FIFO buffer threshold, causing partially-filled packet transmission and degrading PCIe/IB efficiency. NCCL heuristically reduces nChannels for smaller messages (via calcP2pChunkSize in enqueue.cc).

Channel count is set per communicator during initialization. Earlier NCCL versions allowed NCCL_NTHREADS environment variable override; in recent versions this setting is ignored or causes incorrect behavior — channel management is fully internal.

Logical topologies per channel:


Section III: Communication Protocols

NCCL implements three protocols to trade off bandwidth vs. latency:

Protocol Comparison (Table I)

Property Simple LL LL128
Design goal High bandwidth Low latency Low latency + high bandwidth
Sync mechanism Memory fences (high overhead) Flag-based (8-byte = 4B data + 4B flag) Flag-based (128-byte = 120B data + 8B flag)
Bandwidth utilization Near peak ~25-50% of peak ~95% of peak
Latency per hop ~6 us ~1 us ~2 us
Buffer location GPU memory Host memory (CPU polling) GPU memory
RDMA compatible Yes No (CPU polling prevents RDMA) Yes (with atomic 128-byte write support)

Simple Protocol

Divides data into large chunks dispatched across channels. Uses memory fences to enforce correct ordering — a receiver must wait for the full chunk before accessing it. Memory fences are low-frequency relative to data volume for large messages but dominate for small ones. Achieves near-peak bandwidth but high per-message latency.

LL (Low Latency) Protocol

Uses flag-based synchronization: each 8-byte unit = 4 bytes of data + 4 bytes of flag, sent atomically. The receiver checks the flag to know the data is valid, without waiting for a full chunk. Forces intermediate buffers into host memory so the CPU proxy thread can poll flags directly (polling GPU memory via PCIe is slow). This sacrifices RDMA compatibility and limits bandwidth to 25-50% of peak due to the 50% overhead from embedding flags inline with data.

LL128 Protocol

An improved LL variant: 128-byte units = 120 bytes data + 8 bytes flag. Reduces the flag-to-data ratio from 50% to 6.25%, achieving ~95% bandwidth utilization. Uses flag-based synchronization like LL (no memory fences), so it retains low latency. Requires atomic 128-byte writes — the interconnect must not split or reorder these writes. On PCIe-limited systems where this guarantee cannot be met, NCCL disables LL128 to avoid data corruption. This makes LL128 hardware-dependent; it excels on NVLink but is more restricted on PCIe/RoCE environments.

Protocol Selection Logic

NCCL dynamically selects the protocol at runtime based on: user setting (NCCL_PROTO), collective algorithm, system topology, GPU architecture, message size, and available protocol-specific buffer resources. Default behavior: LL/LL128 for small messages (to minimize latency), Simple for large messages (to maximize throughput).


Section IV: Data-Transfer Methods and Transport Layer

Intra-Node Transports (Table II)

Intra-Node Transport Hierarchy:
+------------------------------------------+
| Priority 1: P2P (p2p.cc)                |
|   - NVLink: GPUDirect P2P (preferred)   |
|   - PCIe: GPUDirect P2P (fallback)      |
|   - P2P_DIRECT mode: bypasses IPC       |
|     handles; uses direct GPU pointers;  |
|     eliminates intermediate FIFO buffer |
|     copy via directSend/directRecv      |
+------------------------------------------+
| Priority 2: SHM (shm.cc)                |
|   - Routes via system memory            |
|   - PCIe-to-memory, memory-to-PCIe      |
|   - Avoids cross-socket P2P degradation |
+------------------------------------------+
| Priority 3: NVLS (nvls.cc)              |
|   - NVLink Switch-based                 |
|   - For NVLS/NVLS Tree algorithms only  |
+------------------------------------------+

P2P_DIRECT optimization: When communicating ranks are in the same process, NCCL uses direct GPU memory pointers (no IPC handles), eliminating an intermediate FIFO buffer copy. Synchronization via atomic head and tail counters in ncclSendMem and ncclRecvMem.

SHM transport: Used when P2P over PCIe degrades (cross-socket poor CPU handling of P2P packets). Routes via shared memory segment — one GPU's process writes, the other reads.

Inter-Node Transports

Inter-Node Transport Selection:
+------------------------------------------+
| Socket (net_socket.cc)                  |
|   - No RDMA hardware                    |
|   - GPU -> host pinned memory -> NIC    |
|   - NIC -> host buffer -> GPU           |
|   - Rendezvous handshake before transfer |
+------------------------------------------+
| IB Verbs (net_ib.cc)                    |
|   - InfiniBand or RoCE                  |
|   - Two RC QPs per peer per direction   |
|     (forward QP: bulk data via RDMA_WRITE|
|      reverse QP: clear-to-send (CTS))  |
|   - GDRDMA: if NIC and GPU on same PCIe |
|     switch, intermediate buffer in GPU  |
|     memory; NIC DMA directly to GPU    |
|   - Without GDRDMA: buffer in host     |
|     memory; GPU proxy manages copy     |
+------------------------------------------+
| CollNet (coll_net.cc)                   |
|   - NVIDIA SHARP: network-assisted      |
|   - In-network reduction               |
+------------------------------------------+

Key IB optimizations:


Section V: NCCL Collective Algorithms

Algorithm × Protocol Support (Table III, NCCL v2.19)

Algorithm AllReduce protocols AllGather ReduceScatter Broadcast Reduce
Ring Simple, LL, LL128 Simple, LL, LL128 Simple, LL, LL128 Simple, LL, LL128 Simple, LL, LL128
Tree Simple, LL, LL128 x x x x
CollNet Direct Simple only x x x x
CollNet Chain Simple only x x x x
NVLS Simple only Simple only Simple only x x
NVLS Tree Simple only x x x x

Note: Tree algorithm supports only AllReduce. CollNet and NVLS require specialized hardware (SHARP, NVSwitch).

Communication Primitives (Building Blocks)

NCCL decomposes all collectives from these low-level device-side operations:

"Direct" variants (directSend, directRecv) bypass the intermediate FIFO buffer when P2P_DIRECT is active.

CUDA Execution Hierarchy for Collectives

CUDA Grid: (nChannels, 1, 1)
  -> Each block = one communication channel
  -> blockIdx.x maps to channel via channelMask (bitmask population count)

Within each block:
  Warp 0: loads communicator metadata (ncclDevComm) into shared memory
  Warp 1: loads channel-specific data (ncclDevChannel)
  Remaining warps: actual communication + computation

Thread count per block: NCCL_MIN_NTHREADS to NCCL_MAX_NTHREADS
  (set by autotuner, stored in plan->threadPerBlock)

Within each channel's buffer:
  NCCL_STEPS slots (typically 8)
  Each slot = one nclConnFifo entry (mode, offset, size, data pointer)
  Slots independently advance through pipeline stages:
    [being computed] -> [queued for TX] -> [in-flight] -> [ready for RX]

Buffer Sizes per Protocol (Table IV)

Protocol Total Channel Buffer Capacity per Slot Effective Data per Slot
Simple 4 MiB 512 KiB 512 KiB
LL 256 KiB 32 KiB 16 KiB (50% flag overhead)
LL128 ~4800 KiB 600 KiB 562.5 KiB (~95% data)

Data Partitioning (Fig. 3)

NCCL partitions input data across channels, each channel handling a contiguous segment of size count / nChannels. Within each channel, data is processed in outer loop iterations, each covering loopCount elements, subdivided into chunkCount-element chunks per elementary step. This multi-level chunking enables pipelining of compute and communication across pipeline slots.

Ring AllReduce (2k-1 steps per loop, k = GPU count)

Two phases: ReduceScatter followed by AllGather.

Steps in one loop iteration (Table V): | Step Index | NCCL Primitive | |------------|----------------| | 0 | send | | 1 to k-2 | recvReduceSend | | k-1 | recvReduceCopySend | | k to 2k-3 | recvCopySend | | 2k-2 | recv |

Ring AllGather (k steps per loop, Table VI)

Ring ReduceScatter (k steps per loop, Table VII)

Tree AllReduce (pipelined, Table VIII)

Two interleaved phases executed concurrently by partitioning SMs asymmetrically:

Ring Broadcast (pipelined, Table IX)

Directed chain from root along ring topology:

Ring Reduce (pipelined, Table X)

Directed chain converging to root:

Pipelined vs. Non-Pipelined Classification


Section VI: Benchmarking Results

Experimental setup: Alps supercomputer at CSCS (Swiss National Supercomputing Center), 16 nodes with NVIDIA GH200 (Grace Hopper Superchips). Intra-node: 150 GB/s NVLink high-bandwidth interconnect. Inter-node: 25 GB/s per direction Cray Slingshot (RoCE). 20 runs per data point with warmup.

AllReduce Results (Fig. 6)

Inter-node

Three Key Takeaways

  1. LL and LL128 are best for small messages (especially inter-node); Simple consistently wins for large distributed transfers.
  2. Intra-node vs. inter-node matters greatly: different transports (NVLink vs. RoCE) change protocol performance rankings.
  3. Relying on NCCL autotuning is generally better than manual protocol override in most cases; manual tuning is useful only for targeted scenarios.

Other Collectives (Fig. 7)

Broadcast, Reduce, ReduceScatter, AllGather all follow the same protocol-performance trends as AllReduce.


Section VI: Integration into ATLAHS

ATLAHS is a GPU-cluster network simulator built using the insights from this paper. Key capabilities enabled by the NCCL analysis:


Limitations

  1. NCCL version specificity: Analysis is for v2.19.1. Internal mechanisms may change across releases (though core concepts are expected to remain stable).
  2. CollNet and NVLS excluded: Both require hardware-specific features (SHARP-enabled switches, NVSwitch) making them less generalizable; their implementations are not analyzed.
  3. PAT algorithm not covered: Parallel Aggregated Trees (introduced in v2.23) is acknowledged but not analyzed.
  4. Qualitative algorithm analysis only: No quantitative complexity bounds are derived due to the many interacting variables (GPU count, topology, node count, message size, protocol) that make closed-form models impractical.
  5. Tuner plugin API not addressed: The paper does not discuss NCCL's external tuner plugin interface or how third-party tuners (such as the one DynamICCL would implement) interact with NCCL's internal algorithm/protocol selection.


RL Formulation Table

This paper does not use reinforcement learning. No RL formulation table applies.


Relevance to DynamICCL

Extremely high relevance — the most directly applicable paper in this reading list to DynamICCL's design.

1. Action Space Ground Truth

DynamICCL's Config Agent selects among algorithms (Ring, Tree, CollNet Direct, CollNet Chain, NVLS, NVLS Tree, PAT), protocols (LL, LL128, Simple), nChannels (1-8), and numThreads. Table III of this paper defines the exact valid (algorithm, protocol, collective) combinations — essential for pruning invalid actions from the Config Agent's action space.

2. Protocol Selection Decision Boundaries

The benchmarks establish concrete decision rules that DynamICCL's RL policy must learn:

3. nChannels Tuning Rationale

The paper explains that increasing nChannels reduces per-channel chunk size, potentially underutilizing PCIe/IB queue pairs when chunk size drops below the 512 KiB FIFO buffer threshold. DynamICCL's nChannels action dimension (1-8) directly controls this trade-off. The optimal nChannels is message-size dependent — a critical feature for the Config Agent's state space.

4. numThreads Context

The paper establishes that thread count per block (controlled by the autotuner, stored in plan->threadPerBlock) affects warp-level parallelism within each channel. DynamICCL's numThreads action dimension corresponds to this parameter. The paper confirms that different algorithms allocate warps differently (NVLS subdivides warps by phase; P2P operations split by concurrent transfer count), explaining why numThreads interacts with algorithm choice.

5. State Space Design Guidance

The paper's finding that LL128 performs very differently in intra-node (NVLink) vs. inter-node (RoCE) settings implies DynamICCL's state space must distinguish intra-node vs. inter-node communication context — not just message size and congestion level. Agent-2 should observe whether the collective spans nodes or is confined to a single node.

6. Congestion Impact on Protocols

The paper notes that LL128 can lag behind LL under heavy inter-node contention at large message sizes, because stalls affect larger 128-byte data units. This congestion-protocol interaction is precisely what DynamICCL's Agent-1 (Trigger Agent) detects and Agent-2 must respond to. Under high congestion, DynamICCL should consider switching to LL (smaller units, faster retransmission) or Simple (fewer synchronization events per byte) depending on message size.

7. Simulation Environment

ATLAHS achieves <5% simulation error on NCCL workloads. DynamICCL could use ATLAHS as a high-fidelity offline training environment for the Config Agent, avoiding the cost and measurement noise of live cluster interactions during initial RL training.

8. Channel-to-CUDA-Block Mapping

The 1:1 mapping of communication channels to CUDA blocks (via channelMask) means that nChannels directly determines the number of active SMs used by NCCL. This has implications for DynamICCL: increasing nChannels during congestion may actually compete with the training workload for SM resources, introducing a GPU compute vs. communication parallelism trade-off that DynamICCL's reward function should account for.