NCCLX: Collective Communication for 100k+ GPUs

Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, et al. | Meta | arXiv:2510.20171 (2026)

Problem

Training and serving frontier LLMs (Llama4-class) demands collective communication across clusters exceeding 100,000 GPUs. Stock NCCL exhibits four bottlenecks at this scale. (1) Its kernel-driven execution model couples communication to GPU SMs, forcing 4 thread blocks (640 threads) per P2P operation, contending with compute kernels and consuming HBM bandwidth. (2) Its copy-based transfer path stages every message through FIFO buffers, doubling memory traffic. (3) Its bootstrap and connection-establishment phases scale super-linearly: at 96k GPUs, initialization takes ~265 seconds, and eager allocation of channels, protocols (LL/LL128/Simple), and algorithm state can waste 10+ GB HBM per GPU when many communicators coexist. (4) At 100k+ scale, hardware faults (NICs, GPUs, links) are routine, but stock NCCL lacks fine-grained fault localization or in-place recovery, forcing whole-job restarts.

Core Insight

Replace NCCL's kernel-driven, copy-based, eagerly-initialized data plane with a host-driven, zero-copy, lazily-initialized custom transport (CTran) that issues RDMA directly between user buffers, while layering large-scale-specific mechanisms (recursive doubling/halving for long-distance hops, fault-tolerant AllReduce, dynamic QP load balancing) on top of a unified API surface that still plugs into PyTorch as a drop-in NCCL replacement.

Method

NCCLX exposes three execution modes through the same framework: host-initiated APIs (CPU schedules RDMA), host-initiated with GPU-resident metadata (for dynamic MoE AllToAllv), and device-initiated APIs (in progress).

CTran's transport layer is the key building block.

A dedicated CPU progress thread per communicator drives RDMA writes, freeing GPU SMs entirely from communication work.
Zero-copy: IBV_WR_RDMA_WRITE_WITH_IMM pushes data directly from source to destination tensors. The 32-bit immediate field encodes a sequence number (bits 0-23), a fast-path bit (30), and a final-write bit (31) to provide ordered delivery over an out-of-order RDMA fabric.
DQPLB (Dynamic Queue Pair Load Balancing) configures multiple data QPs per peer and caps outstanding messages based on topology tier (Rack / Zone / Datacenter, with relative latencies of roughly 1x / 15x / 30x), keeping every link saturated at its bandwidth-delay product without congesting the fabric.

For training-scale collectives, NCCLX adds:

Recursive doubling / halving for AllGather and ReduceScatter over long paths, using a farthest-first peer ordering to overlap long-hop latency.
FTAR (Fault Tolerant AllReduce) coordinates "shrink/grow" phases with a global coordinator, allowing a step to complete when a replica subgroup fails rather than aborting the job.
Lazy connection (NCCL_LAZY_CONNECT=1), lazy channel setup (NCCL_LAZY_SETUP_CHANNEL=1), and a slab allocator (NCCL_MEM_USE_SLAB_ALLOCATOR=1) that packs metadata for 3000+ peers into a single 2 MB GPU page.

For inference, NCCLX targets multi-node decode AllReduce / AllToAllv with GPU-resident routing metadata to handle dynamic MoE expert dispatch.

Tooling: a Fault Analyzer driven by CollTrace (per-collective + per-RDMA instrumentation streamed to a remote DB) infers inter-collective dependencies to separate root-cause failures from cascaded hangs.

Key tunables exposed: NCCL_LAZY_CONNECT, NCCL_LAZY_SETUP_CHANNEL, NCCL_MEM_USE_SLAB_ALLOCATOR, NCCL_P2P_NET_CHUNKSIZE, NCCL_NCHANNELS_PER_NET_PEER, plus algorithm choice (Ring / Tree / Recursive Doubling / Recursive Halving), protocol choice (zero-copy CTran vs. LL / LL128 / Simple), number of QPs per topology tier, and per-connection max_outstanding_messages and max_segment_size.

Results

Cluster: 100k+ NVIDIA H100 GPUs on a 3-layer Clos RoCE fabric (Mellanox CX-7 NICs, 1:2.8 oversubscription).

End-to-end Llama4 training: up to 12% reduction in steady-state step latency.
Initialization: 265 s -> 24 s at 96k GPUs (~11x speedup).
P2P microbenchmarks: 1.09x-2.7x over copy-based NCCL; CTran saturates peak network bandwidth for medium-sized messages.
Llama4 Maverick decoding latency: 15%-80% improvement.
HBM footprint: ~2x reduction in NCCL-attributable memory (~4-5 GB saved per GPU).

Limitations

Device-initiated API path is not yet fully production-ready; only AllToAllv with dynamic routing currently uses GPU-resident metadata. AllGather and AllToAll device primitives are future work.
The system mitigates NVIDIA RDMA driver lock contention (up to 100 ms spikes on registration) only via lazy registration and a memory-pool mode, not fundamentally fixing the underlying driver issue.
Algorithm and resource decisions (Ring vs. Tree vs. Recursive Doubling, QP count per tier, channel count) remain governed by static topology- and message-size-based heuristics.
Evaluation is on a single Meta production fabric (RoCE Clos); generalization to other interconnects (InfiniBand fat-tree, NVLink-only, slim-fly) is not characterized.

Relevance to DynamICCL

NCCLX is highly relevant: it expands NCCL's tunable surface in exactly the dimensions DynamICCL's RL agent operates on, and explicitly identifies the heuristics it uses to make those choices.

Action space expansion: NCCLX adds Recursive Doubling and Recursive Halving as new algorithm choices alongside Ring/Tree, plus zero-copy CTran as a new "protocol" alongside LL/LL128/Simple. DynamICCL's action tuple (algorithm, protocol, nChannels, numThreads) becomes (algorithm in {Ring, Tree, RD, RH}, protocol in {Simple, LL, LL128, CTran}, nChannels, numThreads, nQPsPerTier).
New observables: CollTrace exports per-collective + per-RDMA telemetry (timing, QP state, registration spikes) to a remote DB. This is exactly the kind of state vector an RL agent can consume - similar in role to Pensieve's throughput history.
Static heuristics that an RL agent can replace: the paper explicitly chooses algorithms based on topology tier and message size (e.g., Ring for DP groups, Tree for small AllReduce, RD/RH for long-distance), and chooses QP count and outstanding-message caps per tier. These piecewise rules are precisely the targets DynamICCL displaces with a learned policy.
Reward signal alignment: NCCLX optimizes wall-clock step latency and end-to-end training throughput, matching DynamICCL's collective completion time objective.
Scale-aware features: lazy connection, channel-on-first-use, and slab allocation are state DynamICCL can observe (HBM pressure, channel-cache warmth) and indirectly control through algorithm/channel selection.