NCCLX: Collective Communication for 100k+ GPUs

Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, et al. | Meta | arXiv:2510.20171 (2026)


Problem

Training and serving frontier LLMs (Llama4-class) demands collective communication across clusters exceeding 100,000 GPUs. Stock NCCL exhibits four bottlenecks at this scale. (1) Its kernel-driven execution model couples communication to GPU SMs, forcing 4 thread blocks (640 threads) per P2P operation, contending with compute kernels and consuming HBM bandwidth. (2) Its copy-based transfer path stages every message through FIFO buffers, doubling memory traffic. (3) Its bootstrap and connection-establishment phases scale super-linearly: at 96k GPUs, initialization takes ~265 seconds, and eager allocation of channels, protocols (LL/LL128/Simple), and algorithm state can waste 10+ GB HBM per GPU when many communicators coexist. (4) At 100k+ scale, hardware faults (NICs, GPUs, links) are routine, but stock NCCL lacks fine-grained fault localization or in-place recovery, forcing whole-job restarts.


Core Insight

Replace NCCL's kernel-driven, copy-based, eagerly-initialized data plane with a host-driven, zero-copy, lazily-initialized custom transport (CTran) that issues RDMA directly between user buffers, while layering large-scale-specific mechanisms (recursive doubling/halving for long-distance hops, fault-tolerant AllReduce, dynamic QP load balancing) on top of a unified API surface that still plugs into PyTorch as a drop-in NCCL replacement.


Method

NCCLX exposes three execution modes through the same framework: host-initiated APIs (CPU schedules RDMA), host-initiated with GPU-resident metadata (for dynamic MoE AllToAllv), and device-initiated APIs (in progress).

CTran's transport layer is the key building block.

For training-scale collectives, NCCLX adds:

For inference, NCCLX targets multi-node decode AllReduce / AllToAllv with GPU-resident routing metadata to handle dynamic MoE expert dispatch.

Tooling: a Fault Analyzer driven by CollTrace (per-collective + per-RDMA instrumentation streamed to a remote DB) infers inter-collective dependencies to separate root-cause failures from cascaded hangs.

Key tunables exposed: NCCL_LAZY_CONNECT, NCCL_LAZY_SETUP_CHANNEL, NCCL_MEM_USE_SLAB_ALLOCATOR, NCCL_P2P_NET_CHUNKSIZE, NCCL_NCHANNELS_PER_NET_PEER, plus algorithm choice (Ring / Tree / Recursive Doubling / Recursive Halving), protocol choice (zero-copy CTran vs. LL / LL128 / Simple), number of QPs per topology tier, and per-connection max_outstanding_messages and max_segment_size.


Results

Cluster: 100k+ NVIDIA H100 GPUs on a 3-layer Clos RoCE fabric (Mellanox CX-7 NICs, 1:2.8 oversubscription).


Limitations


Relevance to DynamICCL

NCCLX is highly relevant: it expands NCCL's tunable surface in exactly the dimensions DynamICCL's RL agent operates on, and explicitly identifies the heuristics it uses to make those choices.

  1. Action space expansion: NCCLX adds Recursive Doubling and Recursive Halving as new algorithm choices alongside Ring/Tree, plus zero-copy CTran as a new "protocol" alongside LL/LL128/Simple. DynamICCL's action tuple (algorithm, protocol, nChannels, numThreads) becomes (algorithm in {Ring, Tree, RD, RH}, protocol in {Simple, LL, LL128, CTran}, nChannels, numThreads, nQPsPerTier).
  2. New observables: CollTrace exports per-collective + per-RDMA telemetry (timing, QP state, registration spikes) to a remote DB. This is exactly the kind of state vector an RL agent can consume - similar in role to Pensieve's throughput history.
  3. Static heuristics that an RL agent can replace: the paper explicitly chooses algorithms based on topology tier and message size (e.g., Ring for DP groups, Tree for small AllReduce, RD/RH for long-distance), and chooses QP count and outstanding-message caps per tier. These piecewise rules are precisely the targets DynamICCL displaces with a learned policy.
  4. Reward signal alignment: NCCLX optimizes wall-clock step latency and end-to-end training throughput, matching DynamICCL's collective completion time objective.
  5. Scale-aware features: lazy connection, channel-on-first-use, and slab allocation are state DynamICCL can observe (HBM pressure, channel-cache warmth) and indirectly control through algorithm/channel selection.