Detailed Summary: Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

Citation: Zhiyi Hu*, Siyuan Shen*, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, Torsten Hoefler. ETH Zurich / NVIDIA Corporation / Broadcom Inc. arXiv:2507.04786v2, July 23, 2025. (*equal contribution)

Abstract (Paraphrased)

NCCL is the critical collective communication library enabling high-performance GPU-to-GPU communication at scale. Despite being open source with a documented API, its internal design — channel management, protocol selection, intra/inter-node data movement, and collective algorithms — is largely opaque. This paper provides a comprehensive, systematic analysis of NCCL v2.19.1, covering: (1) API structure and channel management, (2) the three communication protocols (Simple, LL, LL128), (3) data-transfer models and transport layers, and (4) all major collective algorithms at the NCCL primitive level. Insights from this analysis were used to build ATLAHS, a network simulation toolchain achieving <5% simulation error on large-scale multi-GPU AI workloads.

Motivation

NCCL is critical infrastructure for distributed AI training, yet its GitHub repository receives frequent questions about ring AllReduce performance discrepancies, broadcast topology logic, and RDMA write behavior — signs that even expert practitioners lack a clear mental model of its internals. Existing empirical studies compare NCCL against other libraries (MPI, Gloo, oneCCL) at a black-box level. This paper takes the first systematic approach to opening the black box: analyzing NCCL source code to explain why performance behaves as it does, not just what it does.

The practical motivation is the ATLAHS simulator: to accurately model NCCL communication patterns in large-scale training workloads, one needs to know exactly which primitives execute in which order, how pipeline stages overlap, and how channels map to CUDA blocks.

Section II: NCCL Overview

API Structure (Four Categories)

Communicator Management: ncclCommInitAll (single process), ncclCommInitRank (multi-process). Each GPU gets a unique rank 0 to n-1. Communicators are destroyed with ncclCommDestroy (graceful) or ncclCommAbort (immediate, for error recovery).
Collective Operations: Five collectives: ncclAllReduce, ncclBroadcast, ncclReduce, ncclAllGather, ncclReduceScatter. The legacy ncclBcast (in-place) is deprecated.
Point-to-Point: ncclSend and ncclRecv for directed transfers.
Group Calls: ncclGroupStart / ncclGroupEnd bracket multiple operations to reduce launch overhead, enabling all-to-one, one-to-all, and all-to-all patterns as a single NCCL launch.

Launching Strategies (Three Models)

Model	Description	Trade-off
One CPU process per GPU	Each GPU in its own process; NUMA-optimal data locality	Best control, highest overhead
One CPU thread per GPU	Single process, multiple threads; intra-process memory sharing	Good for intra-node memory sharing
One CPU thread for multiple GPUs	Single-threaded; manages all GPUs	Simplest; lowest concurrency

Communication Channels

NCCL subdivides each collective into communication channels — independent CUDA blocks, each running on its own SM, operating on disjoint data chunks. Benefits:

Saturates multiple NICs on NVLink platforms (each channel exits via a different NIC).
Balances load across PCIe, NVLink, and InfiniBand links.
Enables fine-grained parallelism for large payloads.

Trade-off: Too many channels reduce per-channel chunk size below the 512 KiB NIC FIFO buffer threshold, causing partially-filled packet transmission and degrading PCIe/IB efficiency. NCCL heuristically reduces nChannels for smaller messages (via calcP2pChunkSize in enqueue.cc).

Channel count is set per communicator during initialization. Earlier NCCL versions allowed NCCL_NTHREADS environment variable override; in recent versions this setting is ignored or causes incorrect behavior — channel management is fully internal.

Logical topologies per channel:

Ring: Each GPU identifies its immediate predecessor and successor → unidirectional ring.
Tree: Double binary tree (Sanders et al., 2007; Hoefler and Moor, 2014) — no non-leaf in both trees simultaneously; second tree is a 1-position-shifted mirror of the first. Both trees established at communicator init and reused across all collectives.

Section III: Communication Protocols

NCCL implements three protocols to trade off bandwidth vs. latency:

Protocol Comparison (Table I)

Property	Simple	LL	LL128
Design goal	High bandwidth	Low latency	Low latency + high bandwidth
Sync mechanism	Memory fences (high overhead)	Flag-based (8-byte = 4B data + 4B flag)	Flag-based (128-byte = 120B data + 8B flag)
Bandwidth utilization	Near peak	~25-50% of peak	~95% of peak
Latency per hop	~6 us	~1 us	~2 us
Buffer location	GPU memory	Host memory (CPU polling)	GPU memory
RDMA compatible	Yes	No (CPU polling prevents RDMA)	Yes (with atomic 128-byte write support)

Simple Protocol

Divides data into large chunks dispatched across channels. Uses memory fences to enforce correct ordering — a receiver must wait for the full chunk before accessing it. Memory fences are low-frequency relative to data volume for large messages but dominate for small ones. Achieves near-peak bandwidth but high per-message latency.

LL (Low Latency) Protocol

Uses flag-based synchronization: each 8-byte unit = 4 bytes of data + 4 bytes of flag, sent atomically. The receiver checks the flag to know the data is valid, without waiting for a full chunk. Forces intermediate buffers into host memory so the CPU proxy thread can poll flags directly (polling GPU memory via PCIe is slow). This sacrifices RDMA compatibility and limits bandwidth to 25-50% of peak due to the 50% overhead from embedding flags inline with data.

LL128 Protocol

An improved LL variant: 128-byte units = 120 bytes data + 8 bytes flag. Reduces the flag-to-data ratio from 50% to 6.25%, achieving ~95% bandwidth utilization. Uses flag-based synchronization like LL (no memory fences), so it retains low latency. Requires atomic 128-byte writes — the interconnect must not split or reorder these writes. On PCIe-limited systems where this guarantee cannot be met, NCCL disables LL128 to avoid data corruption. This makes LL128 hardware-dependent; it excels on NVLink but is more restricted on PCIe/RoCE environments.

Protocol Selection Logic

NCCL dynamically selects the protocol at runtime based on: user setting (NCCL_PROTO), collective algorithm, system topology, GPU architecture, message size, and available protocol-specific buffer resources. Default behavior: LL/LL128 for small messages (to minimize latency), Simple for large messages (to maximize throughput).

Section IV: Data-Transfer Methods and Transport Layer

Intra-Node Transports (Table II)

Intra-Node Transport Hierarchy:
+------------------------------------------+
| Priority 1: P2P (p2p.cc)                |
|   - NVLink: GPUDirect P2P (preferred)   |
|   - PCIe: GPUDirect P2P (fallback)      |
|   - P2P_DIRECT mode: bypasses IPC       |
|     handles; uses direct GPU pointers;  |
|     eliminates intermediate FIFO buffer |
|     copy via directSend/directRecv      |
+------------------------------------------+
| Priority 2: SHM (shm.cc)                |
|   - Routes via system memory            |
|   - PCIe-to-memory, memory-to-PCIe      |
|   - Avoids cross-socket P2P degradation |
+------------------------------------------+
| Priority 3: NVLS (nvls.cc)              |
|   - NVLink Switch-based                 |
|   - For NVLS/NVLS Tree algorithms only  |
+------------------------------------------+

P2P_DIRECT optimization: When communicating ranks are in the same process, NCCL uses direct GPU memory pointers (no IPC handles), eliminating an intermediate FIFO buffer copy. Synchronization via atomic head and tail counters in ncclSendMem and ncclRecvMem.

SHM transport: Used when P2P over PCIe degrades (cross-socket poor CPU handling of P2P packets). Routes via shared memory segment — one GPU's process writes, the other reads.

Inter-Node Transports

Inter-Node Transport Selection:
+------------------------------------------+
| Socket (net_socket.cc)                  |
|   - No RDMA hardware                    |
|   - GPU -> host pinned memory -> NIC    |
|   - NIC -> host buffer -> GPU           |
|   - Rendezvous handshake before transfer |
+------------------------------------------+
| IB Verbs (net_ib.cc)                    |
|   - InfiniBand or RoCE                  |
|   - Two RC QPs per peer per direction   |
|     (forward QP: bulk data via RDMA_WRITE|
|      reverse QP: clear-to-send (CTS))  |
|   - GDRDMA: if NIC and GPU on same PCIe |
|     switch, intermediate buffer in GPU  |
|     memory; NIC DMA directly to GPU    |
|   - Without GDRDMA: buffer in host     |
|     memory; GPU proxy manages copy     |
+------------------------------------------+
| CollNet (coll_net.cc)                   |
|   - NVIDIA SHARP: network-assisted      |
|   - In-network reduction               |
+------------------------------------------+

Key IB optimizations:

GDRDMA: When NIC and GPU share a PCIe switch, the NIC DMA-writes directly to GPU memory, bypassing host memory entirely.
Per-peer multi-channel connections: IB transport instantiates 2 logical channels per remote GPU per NIC (controlled by NCHANNELS_PER_NET_PEER). Host proxy alternates between two sendComm handles, splitting traffic across QP sets for ECMP-aware path diversity.
QP layout: Forward QP carries bulk data via RDMA_WRITE + RDMA_WRITE_WITH_IMM. Reverse QP carries a tiny CTS message (single RDMA_WRITE), isolating latency-critical control from bandwidth-hungry data.
Local flush with loop-back RDMA_READ: After all PCIe writes complete, NCCL issues a dummy RDMA_READ to a "flush" QP connected to itself to ensure all outstanding PCIe writes reach GPU memory before the kernel consumes data.

Section V: NCCL Collective Algorithms

Algorithm × Protocol Support (Table III, NCCL v2.19)

Algorithm	AllReduce protocols	AllGather	ReduceScatter	Broadcast	Reduce
Ring	Simple, LL, LL128	Simple, LL, LL128	Simple, LL, LL128	Simple, LL, LL128	Simple, LL, LL128
Tree	Simple, LL, LL128	x	x	x	x
CollNet Direct	Simple only	x	x	x	x
CollNet Chain	Simple only	x	x	x	x
NVLS	Simple only	Simple only	Simple only	x	x
NVLS Tree	Simple only	x	x	x	x

Note: Tree algorithm supports only AllReduce. CollNet and NVLS require specialized hardware (SHARP, NVSwitch).

Communication Primitives (Building Blocks)

NCCL decomposes all collectives from these low-level device-side operations:

send: transmit data to next GPU
recv: receive data from previous GPU
recvReduceSend: recv + reduce with local data + send forward
recvCopySend: recv + copy to output buffer + send forward
recvReduceCopySend: recv + reduce + copy to output + send forward

"Direct" variants (directSend, directRecv) bypass the intermediate FIFO buffer when P2P_DIRECT is active.

CUDA Execution Hierarchy for Collectives

CUDA Grid: (nChannels, 1, 1)
  -> Each block = one communication channel
  -> blockIdx.x maps to channel via channelMask (bitmask population count)

Within each block:
  Warp 0: loads communicator metadata (ncclDevComm) into shared memory
  Warp 1: loads channel-specific data (ncclDevChannel)
  Remaining warps: actual communication + computation

Thread count per block: NCCL_MIN_NTHREADS to NCCL_MAX_NTHREADS
  (set by autotuner, stored in plan->threadPerBlock)

Within each channel's buffer:
  NCCL_STEPS slots (typically 8)
  Each slot = one nclConnFifo entry (mode, offset, size, data pointer)
  Slots independently advance through pipeline stages:
    [being computed] -> [queued for TX] -> [in-flight] -> [ready for RX]

Buffer Sizes per Protocol (Table IV)

Protocol	Total Channel Buffer	Capacity per Slot	Effective Data per Slot
Simple	4 MiB	512 KiB	512 KiB
LL	256 KiB	32 KiB	16 KiB (50% flag overhead)
LL128	~4800 KiB	600 KiB	562.5 KiB (~95% data)

Data Partitioning (Fig. 3)

NCCL partitions input data across channels, each channel handling a contiguous segment of size count / nChannels. Within each channel, data is processed in outer loop iterations, each covering loopCount elements, subdivided into chunkCount-element chunks per elementary step. This multi-level chunking enables pipelining of compute and communication across pipeline slots.

Ring AllReduce (2k-1 steps per loop, k = GPU count)

Two phases: ReduceScatter followed by AllGather.

Steps in one loop iteration (Table V): | Step Index | NCCL Primitive | |------------|----------------| | 0 | send | | 1 to k-2 | recvReduceSend | | k-1 | recvReduceCopySend | | k to 2k-3 | recvCopySend | | 2k-2 | recv |

Step 0: GPU i sends its local data segment to GPU (i+1)%k.
Steps 1 to k-2: Each GPU receives a partially reduced block, reduces it with its local data, forwards.
Step k-1: Final ReduceScatter step — receives, reduces, copies to output buffer, sends.
Steps k to 2k-3: AllGather phase — each GPU receives a fully reduced segment, copies to correct output position, forwards.
Step 2k-2: Final recv — all GPUs have complete result.

Ring AllGather (k steps per loop, Table VI)

Step 0: Each GPU i prepares its local block (in-place or copySend), sends to right neighbor.
Steps 1 to k-2: Each GPU receives a block, stores in correct output segment position, forwards (recvCopySend).
Step k-1: recv — collects final missing block.

Ring ReduceScatter (k steps per loop, Table VII)

Step 0: Each GPU i sends one local data block to GPU (i+1)%k.
Steps 1 to k-2: recvReduceSend — receives partially reduced block, combines with local block, forwards.
Step k-1: recvReduceCopy — receives final partial result, applies final reduction, stores to output buffer.

Tree AllReduce (pipelined, Table VIII)

Two interleaved phases executed concurrently by partitioning SMs asymmetrically:

Reduce phase (upward): Leaf GPUs send → Middle GPUs recvReduceSend → Root GPU recvReduceCopySend.
Broadcast phase (downward): Root recvCopySend → Middle GPUs recvCopySend → Leaf GPUs recv.
NCCL assigns more threads/warps to the Reduce phase (bandwidth-intensive) vs. Broadcast phase.

Ring Broadcast (pipelined, Table IX)

Directed chain from root along ring topology:

Root GPU: send (in-place) or copySend (separate buffers).
Middle GPUs: recvCopySend (receive, copy to output, forward).
Last GPU: recv.

Ring Reduce (pipelined, Table X)

Directed chain converging to root:

Initiator GPU: send.
Middle GPUs: recvReduceSend.
Root GPU: recvReduceCopy.

Pipelined vs. Non-Pipelined Classification

Non-pipelined: Ring AllReduce, Ring AllGather, Ring ReduceScatter — each GPU must complete all k tasks in one loop iteration before starting the next ring pass.
Pipelined: Tree AllReduce, Ring Broadcast, Ring Reduce — GPU roles are asymmetric; independent loop iterations can overlap across the directed chain.

Section VI: Benchmarking Results

Experimental setup: Alps supercomputer at CSCS (Swiss National Supercomputing Center), 16 nodes with NVIDIA GH200 (Grace Hopper Superchips). Intra-node: 150 GB/s NVLink high-bandwidth interconnect. Inter-node: 25 GB/s per direction Cray Slingshot (RoCE). 20 runs per data point with warmup.

AllReduce Results (Fig. 6)

Inter-node

Small messages (<64 KiB): LL and LL128 outperform Simple for both Ring and Tree.
Large messages (>64 KiB): Simple dominates; LL and LL128 degrade sharply.
- Cause: The cumulative synchronization overhead of flag-based sync (one sync event per 8 or 128 bytes) becomes significant at gigabyte scale — many small sync operations per large transfer.
- LL128 can lag behind LL at large inter-node sizes due to its larger synchronization units being more sensitive to network stalls under contention.
Algorithm: Ring excels for large messages; Tree performs better for small messages.

Intra-node (NVLink)

LL128 dominates: Consistent high performance across all message sizes (near-peak bandwidth, only ~5% below Simple at large messages).
LL: Best for very small messages (lowest latency).
Simple: Best for very large messages; high latency for small ones.

Three Key Takeaways

LL and LL128 are best for small messages (especially inter-node); Simple consistently wins for large distributed transfers.
Intra-node vs. inter-node matters greatly: different transports (NVLink vs. RoCE) change protocol performance rankings.
Relying on NCCL autotuning is generally better than manual protocol override in most cases; manual tuning is useful only for targeted scenarios.

Other Collectives (Fig. 7)

Broadcast, Reduce, ReduceScatter, AllGather all follow the same protocol-performance trends as AllReduce.

Section VI: Integration into ATLAHS

ATLAHS is a GPU-cluster network simulator built using the insights from this paper. Key capabilities enabled by the NCCL analysis:

Accurately decomposes collective communication into fine-grained send, recv, recvReduceSend, etc. primitives with correct timing.
Models pipelined vs. non-pipelined concurrency correctly.
Handles collective channel partitioning and CUDA stream mapping.
Achieves <5% simulation error on large-scale multi-GPU workloads.
Outperforms AstraSim 2.0 in runtime prediction accuracy for large-model LLM training.

Limitations

NCCL version specificity: Analysis is for v2.19.1. Internal mechanisms may change across releases (though core concepts are expected to remain stable).
CollNet and NVLS excluded: Both require hardware-specific features (SHARP-enabled switches, NVSwitch) making them less generalizable; their implementations are not analyzed.
PAT algorithm not covered: Parallel Aggregated Trees (introduced in v2.23) is acknowledged but not analyzed.
Qualitative algorithm analysis only: No quantitative complexity bounds are derived due to the many interacting variables (GPU count, topology, node count, message size, protocol) that make closed-form models impractical.
Tuner plugin API not addressed: The paper does not discuss NCCL's external tuner plugin interface or how third-party tuners (such as the one DynamICCL would implement) interact with NCCL's internal algorithm/protocol selection.

Lee and Lee (2024): Empirical comparison of NCCL, MPI, Gloo in distributed deep learning — black-box comparison; NCCL best for intra-node GPU-to-GPU.
Weingram et al. (2023): Survey of collective communication library ecosystem (NCCL, RCCL, oneCCL, Gloo, Blink).
Blink: Dynamically builds multiple trees using network topology awareness; outperforms NCCL ring and tree for large heterogeneous clusters.
SCCL/MSCCL: Synthesizes optimal custom collective algorithms for specific hardware (ref. 28 = Cai et al., SCCL; paper 0015 in this reading list).
AstraSim 2.0: Existing collective communication simulator; outperformed by ATLAHS in runtime prediction accuracy.

RL Formulation Table

This paper does not use reinforcement learning. No RL formulation table applies.

Relevance to DynamICCL

Extremely high relevance — the most directly applicable paper in this reading list to DynamICCL's design.

1. Action Space Ground Truth

DynamICCL's Config Agent selects among algorithms (Ring, Tree, CollNet Direct, CollNet Chain, NVLS, NVLS Tree, PAT), protocols (LL, LL128, Simple), nChannels (1-8), and numThreads. Table III of this paper defines the exact valid (algorithm, protocol, collective) combinations — essential for pruning invalid actions from the Config Agent's action space.

2. Protocol Selection Decision Boundaries

The benchmarks establish concrete decision rules that DynamICCL's RL policy must learn:

Small messages (< ~64 KiB): prefer LL or LL128.
Large messages (> ~64 KiB, inter-node): prefer Simple.
NVLink-dominant (intra-node): LL128 is the best general-purpose choice.
Under network congestion (which DynamICCL specifically targets): Simple's larger chunks may be more robust to jitter than LL128's fine-grained synchronization; this is an open question DynamICCL could empirically answer.

3. nChannels Tuning Rationale

The paper explains that increasing nChannels reduces per-channel chunk size, potentially underutilizing PCIe/IB queue pairs when chunk size drops below the 512 KiB FIFO buffer threshold. DynamICCL's nChannels action dimension (1-8) directly controls this trade-off. The optimal nChannels is message-size dependent — a critical feature for the Config Agent's state space.

4. numThreads Context

The paper establishes that thread count per block (controlled by the autotuner, stored in plan->threadPerBlock) affects warp-level parallelism within each channel. DynamICCL's numThreads action dimension corresponds to this parameter. The paper confirms that different algorithms allocate warps differently (NVLS subdivides warps by phase; P2P operations split by concurrent transfer count), explaining why numThreads interacts with algorithm choice.

5. State Space Design Guidance

The paper's finding that LL128 performs very differently in intra-node (NVLink) vs. inter-node (RoCE) settings implies DynamICCL's state space must distinguish intra-node vs. inter-node communication context — not just message size and congestion level. Agent-2 should observe whether the collective spans nodes or is confined to a single node.

6. Congestion Impact on Protocols

The paper notes that LL128 can lag behind LL under heavy inter-node contention at large message sizes, because stalls affect larger 128-byte data units. This congestion-protocol interaction is precisely what DynamICCL's Agent-1 (Trigger Agent) detects and Agent-2 must respond to. Under high congestion, DynamICCL should consider switching to LL (smaller units, faster retransmission) or Simple (fewer synchronization events per byte) depending on message size.

7. Simulation Environment

ATLAHS achieves <5% simulation error on NCCL workloads. DynamICCL could use ATLAHS as a high-fidelity offline training environment for the Config Agent, avoiding the cost and measurement noise of live cluster interactions during initial RL training.

8. Channel-to-CUDA-Block Mapping

The 1:1 mapping of communication channels to CUDA blocks (via channelMask) means that nChannels directly determines the number of active SMs used by NCCL. This has implications for DynamICCL: increasing nChannels during congestion may actually compete with the training workload for SM resources, introducing a GPU compute vs. communication parallelism trade-off that DynamICCL's reward function should account for.