Detailed Summary: Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
Citation: Zhiyi Hu*, Siyuan Shen*, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, Torsten Hoefler. ETH Zurich / NVIDIA Corporation / Broadcom Inc. arXiv:2507.04786v2, July 23, 2025. (*equal contribution)
Abstract (Paraphrased)
NCCL is the critical collective communication library enabling high-performance GPU-to-GPU communication at scale. Despite being open source with a documented API, its internal design — channel management, protocol selection, intra/inter-node data movement, and collective algorithms — is largely opaque. This paper provides a comprehensive, systematic analysis of NCCL v2.19.1, covering: (1) API structure and channel management, (2) the three communication protocols (Simple, LL, LL128), (3) data-transfer models and transport layers, and (4) all major collective algorithms at the NCCL primitive level. Insights from this analysis were used to build ATLAHS, a network simulation toolchain achieving <5% simulation error on large-scale multi-GPU AI workloads.
Motivation
NCCL is critical infrastructure for distributed AI training, yet its GitHub repository receives frequent questions about ring AllReduce performance discrepancies, broadcast topology logic, and RDMA write behavior — signs that even expert practitioners lack a clear mental model of its internals. Existing empirical studies compare NCCL against other libraries (MPI, Gloo, oneCCL) at a black-box level. This paper takes the first systematic approach to opening the black box: analyzing NCCL source code to explain why performance behaves as it does, not just what it does.
The practical motivation is the ATLAHS simulator: to accurately model NCCL communication patterns in large-scale training workloads, one needs to know exactly which primitives execute in which order, how pipeline stages overlap, and how channels map to CUDA blocks.
Section II: NCCL Overview
API Structure (Four Categories)
Communicator Management:
ncclCommInitAll(single process),ncclCommInitRank(multi-process). Each GPU gets a unique rank 0 to n-1. Communicators are destroyed withncclCommDestroy(graceful) orncclCommAbort(immediate, for error recovery).Collective Operations: Five collectives:
ncclAllReduce,ncclBroadcast,ncclReduce,ncclAllGather,ncclReduceScatter. The legacyncclBcast(in-place) is deprecated.Point-to-Point:
ncclSendandncclRecvfor directed transfers.Group Calls:
ncclGroupStart/ncclGroupEndbracket multiple operations to reduce launch overhead, enabling all-to-one, one-to-all, and all-to-all patterns as a single NCCL launch.
Launching Strategies (Three Models)
| Model | Description | Trade-off |
|---|---|---|
| One CPU process per GPU | Each GPU in its own process; NUMA-optimal data locality | Best control, highest overhead |
| One CPU thread per GPU | Single process, multiple threads; intra-process memory sharing | Good for intra-node memory sharing |
| One CPU thread for multiple GPUs | Single-threaded; manages all GPUs | Simplest; lowest concurrency |
Communication Channels
NCCL subdivides each collective into communication channels — independent CUDA blocks, each running on its own SM, operating on disjoint data chunks. Benefits:
- Saturates multiple NICs on NVLink platforms (each channel exits via a different NIC).
- Balances load across PCIe, NVLink, and InfiniBand links.
- Enables fine-grained parallelism for large payloads.
Trade-off: Too many channels reduce per-channel chunk size below the
512 KiB NIC FIFO buffer threshold, causing partially-filled packet
transmission and degrading PCIe/IB efficiency. NCCL heuristically
reduces nChannels for smaller messages (via
calcP2pChunkSize in enqueue.cc).
Channel count is set per communicator during initialization. Earlier
NCCL versions allowed NCCL_NTHREADS environment variable
override; in recent versions this setting is ignored or causes incorrect
behavior — channel management is fully internal.
Logical topologies per channel:
- Ring: Each GPU identifies its immediate predecessor and successor → unidirectional ring.
- Tree: Double binary tree (Sanders et al., 2007; Hoefler and Moor, 2014) — no non-leaf in both trees simultaneously; second tree is a 1-position-shifted mirror of the first. Both trees established at communicator init and reused across all collectives.
Section III: Communication Protocols
NCCL implements three protocols to trade off bandwidth vs. latency:
Protocol Comparison (Table I)
| Property | Simple | LL | LL128 |
|---|---|---|---|
| Design goal | High bandwidth | Low latency | Low latency + high bandwidth |
| Sync mechanism | Memory fences (high overhead) | Flag-based (8-byte = 4B data + 4B flag) | Flag-based (128-byte = 120B data + 8B flag) |
| Bandwidth utilization | Near peak | ~25-50% of peak | ~95% of peak |
| Latency per hop | ~6 us | ~1 us | ~2 us |
| Buffer location | GPU memory | Host memory (CPU polling) | GPU memory |
| RDMA compatible | Yes | No (CPU polling prevents RDMA) | Yes (with atomic 128-byte write support) |
Simple Protocol
Divides data into large chunks dispatched across channels. Uses memory fences to enforce correct ordering — a receiver must wait for the full chunk before accessing it. Memory fences are low-frequency relative to data volume for large messages but dominate for small ones. Achieves near-peak bandwidth but high per-message latency.
LL (Low Latency) Protocol
Uses flag-based synchronization: each 8-byte unit = 4 bytes of data + 4 bytes of flag, sent atomically. The receiver checks the flag to know the data is valid, without waiting for a full chunk. Forces intermediate buffers into host memory so the CPU proxy thread can poll flags directly (polling GPU memory via PCIe is slow). This sacrifices RDMA compatibility and limits bandwidth to 25-50% of peak due to the 50% overhead from embedding flags inline with data.
LL128 Protocol
An improved LL variant: 128-byte units = 120 bytes data + 8 bytes flag. Reduces the flag-to-data ratio from 50% to 6.25%, achieving ~95% bandwidth utilization. Uses flag-based synchronization like LL (no memory fences), so it retains low latency. Requires atomic 128-byte writes — the interconnect must not split or reorder these writes. On PCIe-limited systems where this guarantee cannot be met, NCCL disables LL128 to avoid data corruption. This makes LL128 hardware-dependent; it excels on NVLink but is more restricted on PCIe/RoCE environments.
Protocol Selection Logic
NCCL dynamically selects the protocol at runtime based on: user
setting (NCCL_PROTO), collective algorithm, system
topology, GPU architecture, message size, and available
protocol-specific buffer resources. Default behavior: LL/LL128 for small
messages (to minimize latency), Simple for large messages (to maximize
throughput).
Section IV: Data-Transfer Methods and Transport Layer
Intra-Node Transports (Table II)
Intra-Node Transport Hierarchy:
+------------------------------------------+
| Priority 1: P2P (p2p.cc) |
| - NVLink: GPUDirect P2P (preferred) |
| - PCIe: GPUDirect P2P (fallback) |
| - P2P_DIRECT mode: bypasses IPC |
| handles; uses direct GPU pointers; |
| eliminates intermediate FIFO buffer |
| copy via directSend/directRecv |
+------------------------------------------+
| Priority 2: SHM (shm.cc) |
| - Routes via system memory |
| - PCIe-to-memory, memory-to-PCIe |
| - Avoids cross-socket P2P degradation |
+------------------------------------------+
| Priority 3: NVLS (nvls.cc) |
| - NVLink Switch-based |
| - For NVLS/NVLS Tree algorithms only |
+------------------------------------------+
P2P_DIRECT optimization: When communicating ranks
are in the same process, NCCL uses direct GPU memory pointers (no IPC
handles), eliminating an intermediate FIFO buffer copy. Synchronization
via atomic head and tail counters in
ncclSendMem and ncclRecvMem.
SHM transport: Used when P2P over PCIe degrades (cross-socket poor CPU handling of P2P packets). Routes via shared memory segment — one GPU's process writes, the other reads.
Inter-Node Transports
Inter-Node Transport Selection:
+------------------------------------------+
| Socket (net_socket.cc) |
| - No RDMA hardware |
| - GPU -> host pinned memory -> NIC |
| - NIC -> host buffer -> GPU |
| - Rendezvous handshake before transfer |
+------------------------------------------+
| IB Verbs (net_ib.cc) |
| - InfiniBand or RoCE |
| - Two RC QPs per peer per direction |
| (forward QP: bulk data via RDMA_WRITE|
| reverse QP: clear-to-send (CTS)) |
| - GDRDMA: if NIC and GPU on same PCIe |
| switch, intermediate buffer in GPU |
| memory; NIC DMA directly to GPU |
| - Without GDRDMA: buffer in host |
| memory; GPU proxy manages copy |
+------------------------------------------+
| CollNet (coll_net.cc) |
| - NVIDIA SHARP: network-assisted |
| - In-network reduction |
+------------------------------------------+
Key IB optimizations:
- GDRDMA: When NIC and GPU share a PCIe switch, the NIC DMA-writes directly to GPU memory, bypassing host memory entirely.
- Per-peer multi-channel connections: IB transport
instantiates 2 logical channels per remote GPU per NIC (controlled by
NCHANNELS_PER_NET_PEER). Host proxy alternates between twosendCommhandles, splitting traffic across QP sets for ECMP-aware path diversity. - QP layout: Forward QP carries bulk data via
RDMA_WRITE+RDMA_WRITE_WITH_IMM. Reverse QP carries a tiny CTS message (singleRDMA_WRITE), isolating latency-critical control from bandwidth-hungry data. - Local flush with loop-back
RDMA_READ: After all PCIe writes complete, NCCL issues a dummyRDMA_READto a "flush" QP connected to itself to ensure all outstanding PCIe writes reach GPU memory before the kernel consumes data.
Section V: NCCL Collective Algorithms
Algorithm × Protocol Support (Table III, NCCL v2.19)
| Algorithm | AllReduce protocols | AllGather | ReduceScatter | Broadcast | Reduce |
|---|---|---|---|---|---|
| Ring | Simple, LL, LL128 | Simple, LL, LL128 | Simple, LL, LL128 | Simple, LL, LL128 | Simple, LL, LL128 |
| Tree | Simple, LL, LL128 | x | x | x | x |
| CollNet Direct | Simple only | x | x | x | x |
| CollNet Chain | Simple only | x | x | x | x |
| NVLS | Simple only | Simple only | Simple only | x | x |
| NVLS Tree | Simple only | x | x | x | x |
Note: Tree algorithm supports only AllReduce. CollNet and NVLS require specialized hardware (SHARP, NVSwitch).
Communication Primitives (Building Blocks)
NCCL decomposes all collectives from these low-level device-side operations:
send: transmit data to next GPUrecv: receive data from previous GPUrecvReduceSend: recv + reduce with local data + send forwardrecvCopySend: recv + copy to output buffer + send forwardrecvReduceCopySend: recv + reduce + copy to output + send forward
"Direct" variants (directSend, directRecv)
bypass the intermediate FIFO buffer when P2P_DIRECT is active.
CUDA Execution Hierarchy for Collectives
CUDA Grid: (nChannels, 1, 1)
-> Each block = one communication channel
-> blockIdx.x maps to channel via channelMask (bitmask population count)
Within each block:
Warp 0: loads communicator metadata (ncclDevComm) into shared memory
Warp 1: loads channel-specific data (ncclDevChannel)
Remaining warps: actual communication + computation
Thread count per block: NCCL_MIN_NTHREADS to NCCL_MAX_NTHREADS
(set by autotuner, stored in plan->threadPerBlock)
Within each channel's buffer:
NCCL_STEPS slots (typically 8)
Each slot = one nclConnFifo entry (mode, offset, size, data pointer)
Slots independently advance through pipeline stages:
[being computed] -> [queued for TX] -> [in-flight] -> [ready for RX]
Buffer Sizes per Protocol (Table IV)
| Protocol | Total Channel Buffer | Capacity per Slot | Effective Data per Slot |
|---|---|---|---|
| Simple | 4 MiB | 512 KiB | 512 KiB |
| LL | 256 KiB | 32 KiB | 16 KiB (50% flag overhead) |
| LL128 | ~4800 KiB | 600 KiB | 562.5 KiB (~95% data) |
Data Partitioning (Fig. 3)
NCCL partitions input data across channels, each channel handling a
contiguous segment of size count / nChannels. Within each
channel, data is processed in outer loop iterations, each covering
loopCount elements, subdivided into
chunkCount-element chunks per elementary step. This
multi-level chunking enables pipelining of compute and communication
across pipeline slots.
Ring AllReduce (2k-1 steps per loop, k = GPU count)
Two phases: ReduceScatter followed by AllGather.
Steps in one loop iteration (Table V): | Step Index | NCCL Primitive | |------------|----------------| | 0 | send | | 1 to k-2 | recvReduceSend | | k-1 | recvReduceCopySend | | k to 2k-3 | recvCopySend | | 2k-2 | recv |
- Step 0: GPU i sends its local data segment to GPU (i+1)%k.
- Steps 1 to k-2: Each GPU receives a partially reduced block, reduces it with its local data, forwards.
- Step k-1: Final ReduceScatter step — receives, reduces, copies to output buffer, sends.
- Steps k to 2k-3: AllGather phase — each GPU receives a fully reduced segment, copies to correct output position, forwards.
- Step 2k-2: Final recv — all GPUs have complete result.
Ring AllGather (k steps per loop, Table VI)
- Step 0: Each GPU i prepares its local block (in-place or
copySend), sends to right neighbor. - Steps 1 to k-2: Each GPU receives a block, stores in correct output
segment position, forwards (
recvCopySend). - Step k-1:
recv— collects final missing block.
Ring ReduceScatter (k steps per loop, Table VII)
- Step 0: Each GPU i sends one local data block to GPU (i+1)%k.
- Steps 1 to k-2:
recvReduceSend— receives partially reduced block, combines with local block, forwards. - Step k-1:
recvReduceCopy— receives final partial result, applies final reduction, stores to output buffer.
Tree AllReduce (pipelined, Table VIII)
Two interleaved phases executed concurrently by partitioning SMs asymmetrically:
- Reduce phase (upward): Leaf GPUs
send→ Middle GPUsrecvReduceSend→ Root GPUrecvReduceCopySend. - Broadcast phase (downward): Root
recvCopySend→ Middle GPUsrecvCopySend→ Leaf GPUsrecv. - NCCL assigns more threads/warps to the Reduce phase (bandwidth-intensive) vs. Broadcast phase.
Ring Broadcast (pipelined, Table IX)
Directed chain from root along ring topology:
- Root GPU:
send(in-place) orcopySend(separate buffers). - Middle GPUs:
recvCopySend(receive, copy to output, forward). - Last GPU:
recv.
Ring Reduce (pipelined, Table X)
Directed chain converging to root:
- Initiator GPU:
send. - Middle GPUs:
recvReduceSend. - Root GPU:
recvReduceCopy.
Pipelined vs. Non-Pipelined Classification
- Non-pipelined: Ring AllReduce, Ring AllGather, Ring ReduceScatter — each GPU must complete all k tasks in one loop iteration before starting the next ring pass.
- Pipelined: Tree AllReduce, Ring Broadcast, Ring Reduce — GPU roles are asymmetric; independent loop iterations can overlap across the directed chain.
Section VI: Benchmarking Results
Experimental setup: Alps supercomputer at CSCS (Swiss National Supercomputing Center), 16 nodes with NVIDIA GH200 (Grace Hopper Superchips). Intra-node: 150 GB/s NVLink high-bandwidth interconnect. Inter-node: 25 GB/s per direction Cray Slingshot (RoCE). 20 runs per data point with warmup.
AllReduce Results (Fig. 6)
Inter-node
- Small messages (<64 KiB): LL and LL128 outperform Simple for both Ring and Tree.
- Large messages (>64 KiB): Simple dominates; LL
and LL128 degrade sharply.
- Cause: The cumulative synchronization overhead of flag-based sync (one sync event per 8 or 128 bytes) becomes significant at gigabyte scale — many small sync operations per large transfer.
- LL128 can lag behind LL at large inter-node sizes due to its larger synchronization units being more sensitive to network stalls under contention.
- Algorithm: Ring excels for large messages; Tree performs better for small messages.
Intra-node (NVLink)
- LL128 dominates: Consistent high performance across all message sizes (near-peak bandwidth, only ~5% below Simple at large messages).
- LL: Best for very small messages (lowest latency).
- Simple: Best for very large messages; high latency for small ones.
Three Key Takeaways
- LL and LL128 are best for small messages (especially inter-node); Simple consistently wins for large distributed transfers.
- Intra-node vs. inter-node matters greatly: different transports (NVLink vs. RoCE) change protocol performance rankings.
- Relying on NCCL autotuning is generally better than manual protocol override in most cases; manual tuning is useful only for targeted scenarios.
Other Collectives (Fig. 7)
Broadcast, Reduce, ReduceScatter, AllGather all follow the same protocol-performance trends as AllReduce.
Section VI: Integration into ATLAHS
ATLAHS is a GPU-cluster network simulator built using the insights from this paper. Key capabilities enabled by the NCCL analysis:
- Accurately decomposes collective communication into fine-grained
send,recv,recvReduceSend, etc. primitives with correct timing. - Models pipelined vs. non-pipelined concurrency correctly.
- Handles collective channel partitioning and CUDA stream mapping.
- Achieves <5% simulation error on large-scale multi-GPU workloads.
- Outperforms AstraSim 2.0 in runtime prediction accuracy for large-model LLM training.
Limitations
- NCCL version specificity: Analysis is for v2.19.1. Internal mechanisms may change across releases (though core concepts are expected to remain stable).
- CollNet and NVLS excluded: Both require hardware-specific features (SHARP-enabled switches, NVSwitch) making them less generalizable; their implementations are not analyzed.
- PAT algorithm not covered: Parallel Aggregated Trees (introduced in v2.23) is acknowledged but not analyzed.
- Qualitative algorithm analysis only: No quantitative complexity bounds are derived due to the many interacting variables (GPU count, topology, node count, message size, protocol) that make closed-form models impractical.
- Tuner plugin API not addressed: The paper does not discuss NCCL's external tuner plugin interface or how third-party tuners (such as the one DynamICCL would implement) interact with NCCL's internal algorithm/protocol selection.
Related Work
- Lee and Lee (2024): Empirical comparison of NCCL, MPI, Gloo in distributed deep learning — black-box comparison; NCCL best for intra-node GPU-to-GPU.
- Weingram et al. (2023): Survey of collective communication library ecosystem (NCCL, RCCL, oneCCL, Gloo, Blink).
- Blink: Dynamically builds multiple trees using network topology awareness; outperforms NCCL ring and tree for large heterogeneous clusters.
- SCCL/MSCCL: Synthesizes optimal custom collective algorithms for specific hardware (ref. 28 = Cai et al., SCCL; paper 0015 in this reading list).
- AstraSim 2.0: Existing collective communication simulator; outperformed by ATLAHS in runtime prediction accuracy.
RL Formulation Table
This paper does not use reinforcement learning. No RL formulation table applies.
Relevance to DynamICCL
Extremely high relevance — the most directly applicable paper in this reading list to DynamICCL's design.
1. Action Space Ground Truth
DynamICCL's Config Agent selects among algorithms (Ring, Tree, CollNet Direct, CollNet Chain, NVLS, NVLS Tree, PAT), protocols (LL, LL128, Simple), nChannels (1-8), and numThreads. Table III of this paper defines the exact valid (algorithm, protocol, collective) combinations — essential for pruning invalid actions from the Config Agent's action space.
2. Protocol Selection Decision Boundaries
The benchmarks establish concrete decision rules that DynamICCL's RL policy must learn:
- Small messages (< ~64 KiB): prefer LL or LL128.
- Large messages (> ~64 KiB, inter-node): prefer Simple.
- NVLink-dominant (intra-node): LL128 is the best general-purpose choice.
- Under network congestion (which DynamICCL specifically targets): Simple's larger chunks may be more robust to jitter than LL128's fine-grained synchronization; this is an open question DynamICCL could empirically answer.
3. nChannels Tuning Rationale
The paper explains that increasing nChannels reduces per-channel chunk size, potentially underutilizing PCIe/IB queue pairs when chunk size drops below the 512 KiB FIFO buffer threshold. DynamICCL's nChannels action dimension (1-8) directly controls this trade-off. The optimal nChannels is message-size dependent — a critical feature for the Config Agent's state space.
4. numThreads Context
The paper establishes that thread count per block (controlled by the
autotuner, stored in plan->threadPerBlock) affects
warp-level parallelism within each channel. DynamICCL's numThreads
action dimension corresponds to this parameter. The paper confirms that
different algorithms allocate warps differently (NVLS subdivides warps
by phase; P2P operations split by concurrent transfer count), explaining
why numThreads interacts with algorithm choice.
5. State Space Design Guidance
The paper's finding that LL128 performs very differently in intra-node (NVLink) vs. inter-node (RoCE) settings implies DynamICCL's state space must distinguish intra-node vs. inter-node communication context — not just message size and congestion level. Agent-2 should observe whether the collective spans nodes or is confined to a single node.
6. Congestion Impact on Protocols
The paper notes that LL128 can lag behind LL under heavy inter-node contention at large message sizes, because stalls affect larger 128-byte data units. This congestion-protocol interaction is precisely what DynamICCL's Agent-1 (Trigger Agent) detects and Agent-2 must respond to. Under high congestion, DynamICCL should consider switching to LL (smaller units, faster retransmission) or Simple (fewer synchronization events per byte) depending on message size.
7. Simulation Environment
ATLAHS achieves <5% simulation error on NCCL workloads. DynamICCL could use ATLAHS as a high-fidelity offline training environment for the Config Agent, avoiding the cost and measurement noise of live cluster interactions during initial RL training.
8. Channel-to-CUDA-Block Mapping
The 1:1 mapping of communication channels to CUDA blocks (via
channelMask) means that nChannels directly determines the
number of active SMs used by NCCL. This has implications for DynamICCL:
increasing nChannels during congestion may actually compete with the
training workload for SM resources, introducing a GPU compute vs.
communication parallelism trade-off that DynamICCL's reward function
should account for.