Brief Summary: GPU-Initiated Networking for NCCL (GIN)

Citation: Khaled Hamidouche, John Bachan, Pak Markthub, Peter-Jan Gootzen, Elena Agostini, Sylvain Jeaugey, Aamir Shafi, Georgios Theodorakis, Manjunath Gorentla Venkata. NVIDIA Corporation. arXiv:2511.15076v2, November 24, 2025.


Problem

Traditional GPU communication follows a host-initiated model: GPU kernels queue communication descriptors, and CPU proxy threads execute network operations via RDMA. This model works well for large-scale collective communication (AllReduce, AllGather) but is suboptimal for workloads requiring tight computation-communication integration. Specifically, Mixture-of-Experts (MoE) architectures require fine-grained, irregular all-to-all token routing with dynamically varying message sizes — patterns that benefit from the GPU directly initiating RDMA operations from within CUDA kernels, without the CPU coordination overhead. Libraries like NVSHMEM provide device-initiated communication but operate as a separate runtime outside NCCL's ecosystem, preventing them from leveraging NCCL's topology-aware algorithms, hierarchical communicators, and fault-tolerance infrastructure.

Core Insight

Extend NCCL with a Device API that supports three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe intra-node, Multimem for NVLink SHARP hardware multicast, and GPU-Initiated Networking (GIN) for inter-node RDMA. GIN allows GPU threads to issue one-sided RDMA operations (put, signal, wait) directly from CUDA kernels, eliminating CPU coordination overhead for fine-grained communication patterns while remaining within NCCL's production infrastructure. A dual-backend architecture (GDAKI for direct GPU-to-NIC via DOCA GPUNetIO; Proxy for CPU-assisted fallback) ensures broad hardware support.

Method

GIN is built on a three-layer architecture:

Layer 1 — NCCL Core (host-side): Extends NCCL's communicator initialization to support GIN contexts. Key additions: ncclDevCommCreate (creates a device communicator with GIN resources), ncclCommWindowRegister (collectively registers memory buffers across all ranks, returns window handles with remote access metadata). Each GIN context abstracts a communication channel with queue pairs (QPs) to the NIC, supporting 4 contexts per communicator.

Layer 2 — Device GIN API (GPU-callable): The ncclGin class provides GPU kernel-callable methods: put(team, peer, dstWindow, dstOffset, srcWindow, srcOffset, bytes) for one-sided RDMA writes, signal(peer, signalId) for remote notification, waitSignal(signalId, expected) for receiving synchronization, flush() for local completion, and resetSignal/resetCounter for reuse. Ordering guarantee: all put operations to a given peer on the same context are guaranteed visible before a subsequent signal to that peer completes.

Layer 3 — Network Plugin (GDAKI or Proxy):

Key Results

Evaluated on NVIDIA EOS cluster (576 nodes × 8 H100 80GB, NVLink4, 8×400 Gb/s IB per node). NCCL 2.28, NVSHMEM 3.4.5.

Point-to-point microbenchmark (ping-pong, 4B–4MB):

DeepEP High-Throughput (HT) kernels (training / prefill, 4096 tokens, 2–8 nodes):

DeepEP Low-Latency (LL) kernels (inference decode, 1–128 tokens, 1–8 nodes, NVLink+RDMA):

Pure RDMA (NVLink disabled, 1–8 nodes):

Limitations

Relevance to DynamICCL

High relevance as deep NCCL internals context; low direct applicability to current DynamICCL scope.

  1. NCCL 2.28 Device API architecture. GIN is part of NCCL 2.28's Device API, which introduces a fundamentally new execution model alongside the traditional host-initiated collective API. DynamICCL currently targets NCCL tuner plugin API (for collective parameter selection). GIN represents a different optimization axis: replacing the host-initiated execution model with device-initiated execution. Understanding GIN's architecture is important for DynamICCL's long-term evolution.

  2. Proxy architecture mirrors NCCL's existing proxy thread model. NCCL traditionally uses CPU proxy threads to orchestrate network operations. GIN Proxy is a lock-free GPU-to-CPU queue variant of this same model. The NCCL internals described here (proxy threads, request queues, network plugin API) are the same internals that DynamICCL's tuner plugin interacts with.

  3. Network plugin extensibility. GIN introduces a dual-backend network plugin architecture (GDAKI and Proxy). The NCCL network plugin API allows external vendors to extend NCCL. DynamICCL's tuner plugin uses a parallel plugin API (the NCCL tuner plugin) to inject algorithm/protocol/channel selection. Understanding how NCCL's plugin architecture works for GIN informs how DynamICCL's tuner plugin hooks into the same infrastructure.

  4. MoE all-to-all as a new NCCL workload. GIN is motivated by MoE inference workloads, which are increasingly common in production LLM deployment (DeepSeek-V3, Mixtral). DynamICCL's current focus is on AllReduce/AllGather/ReduceScatter — but if MoE workloads become a target, DynamICCL's RL agent would need to handle the irregular all-to-all communication pattern that GIN is designed for.

  5. Not directly applicable to DynamICCL's current mechanism. DynamICCL uses the NCCL tuner plugin API to select (algo, proto, nChannels, nThreads) for each collective call. GIN is a new Device API that bypasses the traditional tuner-callable execution path entirely. GIN operations are called from within CUDA kernels, not orchestrated by the CPU tuner. DynamICCL's RL agent currently has no visibility into GIN-based communication.