Brief Summary: GPU-Initiated Networking for NCCL (GIN)
Citation: Khaled Hamidouche, John Bachan, Pak Markthub, Peter-Jan Gootzen, Elena Agostini, Sylvain Jeaugey, Aamir Shafi, Georgios Theodorakis, Manjunath Gorentla Venkata. NVIDIA Corporation. arXiv:2511.15076v2, November 24, 2025.
Problem
Traditional GPU communication follows a host-initiated model: GPU kernels queue communication descriptors, and CPU proxy threads execute network operations via RDMA. This model works well for large-scale collective communication (AllReduce, AllGather) but is suboptimal for workloads requiring tight computation-communication integration. Specifically, Mixture-of-Experts (MoE) architectures require fine-grained, irregular all-to-all token routing with dynamically varying message sizes — patterns that benefit from the GPU directly initiating RDMA operations from within CUDA kernels, without the CPU coordination overhead. Libraries like NVSHMEM provide device-initiated communication but operate as a separate runtime outside NCCL's ecosystem, preventing them from leveraging NCCL's topology-aware algorithms, hierarchical communicators, and fault-tolerance infrastructure.
Core Insight
Extend NCCL with a Device API that supports three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe intra-node, Multimem for NVLink SHARP hardware multicast, and GPU-Initiated Networking (GIN) for inter-node RDMA. GIN allows GPU threads to issue one-sided RDMA operations (put, signal, wait) directly from CUDA kernels, eliminating CPU coordination overhead for fine-grained communication patterns while remaining within NCCL's production infrastructure. A dual-backend architecture (GDAKI for direct GPU-to-NIC via DOCA GPUNetIO; Proxy for CPU-assisted fallback) ensures broad hardware support.
Method
GIN is built on a three-layer architecture:
Layer 1 — NCCL Core (host-side): Extends NCCL's
communicator initialization to support GIN contexts. Key additions:
ncclDevCommCreate (creates a device communicator with GIN
resources), ncclCommWindowRegister (collectively registers
memory buffers across all ranks, returns window handles with remote
access metadata). Each GIN context abstracts a communication channel
with queue pairs (QPs) to the NIC, supporting 4 contexts per
communicator.
Layer 2 — Device GIN API (GPU-callable): The
ncclGin class provides GPU kernel-callable methods:
put(team, peer, dstWindow, dstOffset, srcWindow, srcOffset, bytes)
for one-sided RDMA writes, signal(peer, signalId) for
remote notification, waitSignal(signalId, expected) for
receiving synchronization, flush() for local completion,
and resetSignal/resetCounter for reuse. Ordering guarantee:
all put operations to a given peer on the same context are
guaranteed visible before a subsequent signal to that peer
completes.
Layer 3 — Network Plugin (GDAKI or Proxy):
- GDAKI backend: GPU threads directly write RDMA Work Queue Entries (WQEs) to NIC doorbell registers via DOCA GPUNetIO. NIC hardware polls GPU memory, executes RDMA transactions, and updates Completion Queues in GPU-visible memory. Zero CPU involvement. Requires ConnectX-6 Dx or newer and CUDA 12.2+.
- Proxy backend: GPU threads write 64-byte
descriptors to lock-free GPU-to-CPU queues. Dedicated CPU proxy thread
(pinned near GPU/NIC NUMA) polls queues, calls
iput/iput_signalon the network plugin, and updates completion state in GPU-visible memory via GDRCopy. Supports any RDMA NIC and any CUDA version. Adds latency but is universal.
Key Results
Evaluated on NVIDIA EOS cluster (576 nodes × 8 H100 80GB, NVLink4, 8×400 Gb/s IB per node). NCCL 2.28, NVSHMEM 3.4.5.
Point-to-point microbenchmark (ping-pong, 4B–4MB):
- NCCL GIN GDAKI: 16.7 µs round-trip for small messages (4–128 bytes)
- NCCL GIN Proxy: 18.0 µs
- NVSHMEM IBRC: 16.0 µs (best baseline)
- NVSHMEM IBGDA: 24.3 µs (worse than GDAKI)
- At large messages: all four converge (bandwidth-limited regime)
DeepEP High-Throughput (HT) kernels (training / prefill, 4096 tokens, 2–8 nodes):
- At 2 nodes (16 GPUs), BF16 dispatch: NCCL GIN 84.36 GB/s vs. NVSHMEM 84.97 GB/s — within 0.7%
- At 8 nodes (64 GPUs): both sustain ~53–54 GB/s RDMA dispatch bandwidth — within 1–2%
- All HT results are within 1–2% across scales, precisions (FP8/BF16), and operations (dispatch/combine)
DeepEP Low-Latency (LL) kernels (inference decode, 1–128 tokens, 1–8 nodes, NVLink+RDMA):
- At 1 node (8 GPUs): NCCL GIN dispatch 185.28 GB/s / 40.62 µs vs. NVSHMEM 182.15 GB/s / 41.43 µs — 1.7% higher bandwidth, 2% lower latency
- At 2 nodes: NCCL GIN 9% lower latency (142.51 µs vs. 157.00 µs) for dispatch
- Combine operations: within 1–3% across all scales
Pure RDMA (NVLink disabled, 1–8 nodes):
- At 1 node: NCCL GIN 47.00 GB/s / 160.82 µs vs. NVSHMEM 46.79 GB/s / 160.67 µs — essentially identical
- At 8 nodes: both ~34–35 GB/s bandwidth, 219–225 µs latency
Limitations
- NCCL 2.28 enforces symmetric window sizes (current implementation requires all ranks to register the same buffer size). Asymmetric capacity planned for future releases; needed for disaggregated serving where prefill and decode nodes require different buffer sizes.
- 4 contexts per communicator is the current limit, requiring multiple communicators for workloads needing many QPs (DeepEP HT needs 24 QPs → 6 communicators; LL needs 8–16 QPs → 2–4 communicators).
- GDAKI requires modern hardware: ConnectX-6 Dx or newer NICs and CUDA 12.2+. Systems with older NICs must use Proxy backend with higher latency.
- Under active development: Batching WQEs and amortizing doorbell costs across multiple operations are planned optimizations that could further reduce GDAKI overhead.
- No comparison to host-initiated NCCL collectives for the same workloads — the paper focuses on GIN vs. NVSHMEM rather than GIN vs. traditional NCCL, making it hard to quantify GIN's benefit over the status quo for standard collective workloads.
- MoE focus: Evaluation is exclusively on MoE communication patterns (DeepEP). General collective communication (AllReduce, AllGather) performance is not evaluated with GIN.
Relevance to DynamICCL
High relevance as deep NCCL internals context; low direct applicability to current DynamICCL scope.
NCCL 2.28 Device API architecture. GIN is part of NCCL 2.28's Device API, which introduces a fundamentally new execution model alongside the traditional host-initiated collective API. DynamICCL currently targets NCCL tuner plugin API (for collective parameter selection). GIN represents a different optimization axis: replacing the host-initiated execution model with device-initiated execution. Understanding GIN's architecture is important for DynamICCL's long-term evolution.
Proxy architecture mirrors NCCL's existing proxy thread model. NCCL traditionally uses CPU proxy threads to orchestrate network operations. GIN Proxy is a lock-free GPU-to-CPU queue variant of this same model. The NCCL internals described here (proxy threads, request queues, network plugin API) are the same internals that DynamICCL's tuner plugin interacts with.
Network plugin extensibility. GIN introduces a dual-backend network plugin architecture (GDAKI and Proxy). The NCCL network plugin API allows external vendors to extend NCCL. DynamICCL's tuner plugin uses a parallel plugin API (the NCCL tuner plugin) to inject algorithm/protocol/channel selection. Understanding how NCCL's plugin architecture works for GIN informs how DynamICCL's tuner plugin hooks into the same infrastructure.
MoE all-to-all as a new NCCL workload. GIN is motivated by MoE inference workloads, which are increasingly common in production LLM deployment (DeepSeek-V3, Mixtral). DynamICCL's current focus is on AllReduce/AllGather/ReduceScatter — but if MoE workloads become a target, DynamICCL's RL agent would need to handle the irregular all-to-all communication pattern that GIN is designed for.
Not directly applicable to DynamICCL's current mechanism. DynamICCL uses the NCCL tuner plugin API to select
(algo, proto, nChannels, nThreads)for each collective call. GIN is a new Device API that bypasses the traditional tuner-callable execution path entirely. GIN operations are called from within CUDA kernels, not orchestrated by the CPU tuner. DynamICCL's RL agent currently has no visibility into GIN-based communication.