Detailed Summary: GPU-Initiated Networking for NCCL (GIN)

Citation: Khaled Hamidouche, John Bachan, Pak Markthub, Peter-Jan Gootzen, Elena Agostini, Sylvain Jeaugey, Aamir Shafi, Georgios Theodorakis, Manjunath Gorentla Venkata. NVIDIA Corporation. arXiv:2511.15076v2, November 24, 2025.

PDF: 0016_GPU_Initiated_net_NCCL.pdf

Abstract

GIN (GPU-Initiated Networking) is part of NCCL 2.28's new Device API, which extends NCCL's host-initiated collective model with device-side communication primitives. GIN enables GPU threads to issue one-sided RDMA operations (put, signal, wait) directly from CUDA kernels for inter-node communication, eliminating CPU coordination overhead. GIN uses a three-layer architecture: NCCL Core host APIs for resource setup, Device GIN API for kernel-callable primitives, and a dual-backend network plugin (GDAKI via DOCA GPUNetIO for direct GPU-to-NIC; Proxy via CPU-assisted queues for broad compatibility). Validated via integration with DeepEP (a production MoE communication library), GIN achieves 16.7 µs round-trip latency for small messages (GDAKI) and competitive performance with NVSHMEM across HT and LL MoE kernels on 8–64 NVIDIA H100 GPUs.

1. Motivation

1.1 The Host-Initiated Model's Limitations

Traditional GPU communication follows a host-initiated model dictated by the CUDA runtime:

GPU kernel executes computation, queues communication descriptors into host-visible buffers.
CPU proxy threads pick up descriptors and invoke RDMA operations (ibv_post_send, etc.).
GPU kernel waits for CPU to signal completion before proceeding.

This model requires explicit host-device synchronization (kernel boundaries) and separate kernel launches for each communication call. For large-scale collective operations with millions of bytes, this overhead is negligible relative to transfer time. But modern AI workloads increasingly need:

Low-latency point-to-point operations for inference token generation (one token at a time, dozens of bytes).
Custom communication patterns for MoE architectures: dynamic token routing where each token is dispatched to one of N experts on potentially remote GPUs, with token counts per expert varying dynamically. This creates irregular all-to-all patterns where message sizes and destinations are not known until runtime.
Computation-communication fusion in compiler-generated kernels (JAX, Triton): the kernel itself needs to issue communication operations inline, without breaking out to a host-side collective call.

The CUDA runtime model forces a kernel launch boundary between computation and communication, introducing overhead that dominates small-message latency.

1.2 Why NVSHMEM Alone Is Insufficient

NVSHMEM (NVIDIA OpenSHMEM) already provides device-callable one-sided operations (put, get, atomics from CUDA kernels). However, NVSHMEM operates as a standalone runtime separate from NCCL. This means NVSHMEM applications cannot use:

NCCL's topology-aware collective algorithms (ring, tree, NVLS)
NCCL's hierarchical communicator infrastructure (multi-dimensional parallelism)
NCCL's fault tolerance and elasticity features
NCCL's production deployment infrastructure used by PyTorch, TensorRT-LLM, vLLM, SGLang

GIN brings device-initiated communication capabilities inside NCCL's framework, enabling applications to use both modalities (host-initiated collectives and device-initiated point-to-point) within a single runtime.

2. Background

2.1 GPUDirect Technologies Evolution

GPUDirect RDMA (2013): RDMA NICs can directly access GPU memory via PCIe BAR mappings, eliminating CPU and host memory from the data path. The NIC's DMA engine performs PCIe peer-to-peer transactions to GPU BARs via the nvidia_p2p kernel module. Limitation: consistency guarantees only at kernel boundaries — GPU's relaxed memory ordering and write-back caching prevent safe concurrent access to RDMA-registered memory from executing kernels.

GPUDirect Async (2016): Partial control-path offload. GPU threads write to NIC doorbell registers mapped into the GPU address space, triggering pre-configured network operations. CPU must pre-construct communication descriptors. Limits GPU autonomy to pre-configured operations.

DOCA GPUNetIO: Production-grade device-side RDMA APIs for IB and RoCE. Exposes device functions enabling GPU kernels to directly program NICs (both GPUDirect RDMA for data movement and GPUDirect Async Kernel-Initiated for GPU NIC control). Forms the foundation for GIN's GDAKI backend.

2.2 NVSHMEM

Extends OpenSHMEM semantics to GPU clusters. Device-callable put, get, atomics invokable from CUDA kernels. Transport backends: IBGDA (InfiniBand with GPUDirect Async) for inter-node, symmetric memory for intra-node. Enables computation-communication interleaving without kernel launch overhead. Standalone runtime — separate from NCCL.

2.3 Hardware Requirements for GPU-Initiated Communication

InfiniBand: native RDMA, ~130 ns port-to-port latency, credit-based flow control. Adapters expose memory-mapped queue pairs (QPs), completion queues (CQs), and doorbell registers via PCIe BARs for direct GPU access with GPUDirect RDMA.
RoCE: RDMA over Ethernet. RoCEv2 (UDP/IP), ~400 ns latency, requires PFC + ECN (lossless Ethernet). Shares InfiniBand verbs API.
Hardware requirement for GIN: NIC support for device-accessible control structures. NVIDIA ConnectX-6 Dx and later; BlueField DPUs.
System requirement: GPU and NIC co-located on same PCIe root complex for minimal peer-to-peer latency. Multi-socket systems incur inter-socket traversal penalties.

2.4 NCCL Proxy Thread Architecture

NCCL uses CPU proxy threads to orchestrate network operations. GPU kernels write communication descriptors into host-visible queues; CPU threads execute them via network plugins. NCCL's network plugin architecture decouples the core library from specific network implementations: built-in plugins (Socket, InfiniBand) plus external plugins (libnccl-net.so). External plugins dynamically loaded via NCCL_NET_PLUGIN environment variable.

2.5 MoE Communication Patterns

MoE architectures route each input token to one of N expert networks. In distributed MoE inference, experts are distributed across GPUs. Dynamic token routing creates irregular all-to-all communication: each GPU sends tokens to variable numbers of expert GPUs, with message sizes proportional to token count (unknown until runtime). Traditional collective operations (AllToAll with fixed message sizes) are not well-suited. DeepEP and pplx-kernels provide specialized CUDA-optimized libraries for this pattern, using NVSHMEM/IBGDA.

3. NCCL Device API and GIN Architecture

3.1 NCCL 2.28 Device API Overview

NCCL 2.28 introduces the Device API with three operation modes:

NCCL Device API
├── Load/Store Accessible (LSA)
│   └── Intra-node via NVLink/PCIe memory-mapped access
├── Multimem
│   └── NVLink SHARP hardware multicast via multimem PTX instructions
└── GPU-Initiated Networking (GIN)   ◄── This paper
    └── Inter-node RDMA over InfiniBand/RoCE

Applications can either:

Use NCCL-provided single-shot collective algorithms implemented with the Device API (operating over collective symmetric memory), or
Directly invoke Device API primitives from their own GPU kernels to implement custom communication patterns.

This contrasts with traditional NCCL's host-initiated collectives that use pipeline primitives (Simple, LL, LL128) over regular memory.

3.2 GIN Three-Layer Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│ Layer 1: NCCL Core (Host-Side)                                          │
│  • ncclDevCommCreate — creates device communicator with GIN resources   │
│  • ncclCommWindowRegister — collective buffer registration across ranks  │
│  • Returns window handles with remote access metadata (RDMA rkeys, etc.)│
└─────────────────────────────────────┬───────────────────────────────────┘
                                      │
┌─────────────────────────────────────▼───────────────────────────────────┐
│ Layer 2: Device GIN API (GPU-Callable, from CUDA kernels)               │
│  ncclGin class:                                                         │
│  • put(team, peer, dstWin, dstOff, srcWin, srcOff, bytes, [actions])   │
│  • putValue(team, peer, dstWin, dstOff, value, ...)                     │
│  • signal(team, peer, signalId)                                         │
│  • flush(coop)          — local completion (safe buffer reuse)          │
│  • waitSignal(coop, id, expected) — remote completion (data visible)    │
│  • readCounter(id)      — poll local counter                            │
│  • resetSignal/resetCounter — for round reuse                           │
│  ncclGinBarrierSession — global network barrier                         │
└─────────────────────────────────────┬───────────────────────────────────┘
                                      │ dispatches to backend
                    ┌─────────────────┴──────────────────┐
                    │                                     │
┌───────────────────▼──────────────────┐   ┌─────────────▼──────────────────┐
│ GDAKI Backend                        │   │ Proxy Backend                  │
│ (GPUDirect Async Kernel-Initiated)   │   │ (CPU-assisted)                 │
│ • GPU threads construct RDMA WQEs    │   │ • GPU writes 64-byte descs to  │
│ • Direct NIC doorbell register write │   │   lock-free GPU-to-CPU queues  │
│ • NIC polls GPU memory for WQEs      │   │ • Dedicated CPU proxy thread   │
│ • NIC executes RDMA, updates CQ      │   │   polls queues, calls iput     │
│ • Zero CPU involvement               │   │ • CPU updates completion state  │
│ • Requires: ConnectX-6 Dx+, CUDA 12 │   │ • Requires: any RDMA NIC       │
└──────────────────────────────────────┘   └────────────────────────────────┘

3.3 Key Design Elements

One-sided communication semantics. put allows GPU threads to write to remote peer memory without receiver coordination. No handshaking. Initiator unilaterally issues transfers and independently tracks completion. Eliminates overhead of two-sided handshaking protocols. Particularly effective for MoE's unpredictable traffic patterns.

Memory windows (collective registration). Communication buffers must be collectively registered across all ranks via ncclCommWindowRegister. Each registration produces a window handle encapsulating remote access metadata (RDMA rkeys, peer addresses). All ranks can address each other's registered memory. Current NCCL 2.28 enforces symmetric buffer sizes (all ranks same size); asymmetric sizes planned for future releases to support disaggregated serving (prefill nodes need larger buffers than decode nodes).

GIN contexts. Each context abstracts a channel between the GPU and NIC, encapsulating network resources and QP connections. Multiple contexts per communicator enable parallelism across NICs/ports/QPs. Current limit: 4 contexts per communicator. A single context can address every rank in the communicator. Multiple contexts allow independent concurrent communication streams.

Asynchronous completion tracking — two mechanisms:

Counters (local): Track completion on sender side. Per-operation tracking (counterID parameter in put). Allows fine-grained pipeline algorithm implementation. flush() for full context drain; waitCounter() for specific operation completion.
Signals (remote symmetric objects): Confirm data arrival and visibility at destination. ID-based (integer ID, not memory address). When a signal operation completes at destination, all preceding put operations to that peer on the same context are guaranteed visible.

Ordering semantics. GIN operations are unordered by default (maximizes throughput). Exception: all put operations to a given peer on the same context are guaranteed visible before a subsequent signal to that peer completes — lightweight per-peer ordering without global fence. Applications batch multiple puts and attach signal to final operation.

4. Device API Details

4.1 Initialization Workflow

Create device communicator: ncclDevCommCreate(comm, flags) — allocates GIN contexts with QP connections to all peer ranks.
Register memory: ncclCommWindowRegister(devComm, ptr, size, &windowHandle) — collective registration across all ranks; returns handle usable in GPU kernels.
Launch kernel with devComm and window handles.
Inside kernel: instantiate ncclGin(devComm, contextIndex), issue put/signal operations.
Synchronize using waitSignal before consuming received data.
Reset signals/counters for next communication round.

4.2 API Summary (from Listing 1)

class ncclGin {
    ncclGin(ncclDevComm comm, int contextIndex);

    // Data Movement
    void put(team, peer, dstWin, dstOff, srcWin, srcOff, bytes, ...);
    void putValue(team, peer, dstWin, dstOff, value, ...);
    void signal(team, peer, signalId);

    // Local completion (safe buffer reuse)
    void flush(coop);
    uint64_t readCounter(counterId);
    void waitCounter(coop, counterId, expected);
    void resetCounter(counterId);

    // Remote completion (data visible at destination)
    uint64_t readSignal(signalId);
    void waitSignal(coop, signalId, expected);
    void resetSignal(signalId);
};

class ncclGinBarrierSession {
    ncclGinBarrierSession(coop, gin, team, handle, index);
    void sync(coop);  // Global barrier
};

// Completion actions attached to put:
put(..., ncclGin_SignalInc{signalId});   // increment remote signal
put(..., ncclGin_CounterInc{counterId}); // increment local counter

4.3 Usage Example — Unidirectional Ring Exchange (Listing 2)

__global__ void ringExchange(ncclDevComm devComm,
                             ncclWindow_t sendWin, ncclWindow_t recvWin,
                             size_t dataSize, int myRank) {
    ncclGin gin(devComm, 0);                    // Use context 0
    int peer = (myRank + 1) % devComm.nRanks;  // Next rank in ring

    // Send data to peer; atomically increment peer's signal 0 on completion
    gin.put(ncclTeamWorld(devComm), peer,
            recvWin, myRank * dataSize,         // dst: peer's recvWin at offset
            sendWin, peer * dataSize, dataSize, // src: local sendWin at offset
            ncclGin_SignalInc{0});              // attach signal increment action

    // Wait for predecessor's data to arrive (local signal 0 == 1)
    gin.waitSignal(ncclCoopCta(), 0, 1);
    gin.resetSignal(0);                         // Reset for next round
}

This implements a common ring step: each rank sends to successor and waits for its predecessor.

5. Backend Implementations

5.1 GDAKI Backend — Direct GPU-to-NIC

When put is called in a GDAKI-backend kernel:

GPU threads construct an RDMA Work Queue Entry (WQE) in device memory.
GPU threads write directly to the NIC's doorbell register (memory-mapped into GPU address space via PCIe BAR).
NIC hardware polls GPU memory for new WQEs, executes RDMA transactions over IB/RoCE, updates Completion Queue entries in GPU-visible memory.
Completion is polled by GPU threads (readCounter / waitSignal).

No CPU involvement. For small messages, eliminates the GPU-CPU-NIC round-trip. Requires: ConnectX-6 Dx or newer, CUDA 12.2+, nv_peer_mem or dmabuf kernel module, properly configured GPUDirect RDMA (co-located GPU-NIC PCIe).

5.2 Proxy Backend — CPU-Assisted

When put is called in a Proxy-backend kernel:

GPU thread writes a 64-byte descriptor to lock-free queue in CPU-visible memory (fire-and-forget).
Descriptor contains: source/destination window handles, inline value (optional), offsets, sizes, completion actions.
Dedicated CPU proxy thread (one per communicator, NUMA-pinned near GPU and NIC) polls queue and calls network plugin's iput/iput_signal interface.
Plugin maps to standard IB verbs or other RDMA API.
Proxy thread polls completions via plugin's test interface and updates completion state in GPU-visible memory (GPU or CPU-resident, depending on GDRCopy availability).

CPU overhead: one proxy thread per communicator continuously polling. Latency overhead vs. GDAKI: measured as ~18.0 µs vs. 16.7 µs for small messages. Advantage: supports any RDMA NIC, any CUDA version, Volta-or-newer GPUs.

5.3 Backend Comparison

Feature	GDAKI	Proxy
Communication path	Direct GPU ↔︎ NIC	GPU → CPU ↔︎ NIC
CPU involvement	None	Required (1 thread/comm)
Progress model	NIC polls GPU memory	CPU polls queue, posts to NIC
Operation posting	GPU rings NIC doorbell	GPU writes descriptor; CPU posts
Hardware req.	ConnectX-6 Dx+, CUDA 12.2+	Any RDMA NIC, any CUDA
Small msg latency	16.7 µs (H100, HDR IB)	18.0 µs
Debugging	Device-side tools only	Host-side inspection + tracing
Portability	GPU-NIC direct access required	Universal
Use case	Production HPC/AI clusters	Development, legacy, multi-vendor

Backend selection: automatic at ncclCommInitRank time via capability probe for DOCA GPUNetIO support. Override via NCCL_GIN_BACKEND environment variable.

6. DeepEP Integration

6.1 Integration Requirements

DeepEP (DeepSeek's MoE communication library) uses NVSHMEM IBGDA for device-initiated all-to-all communication. Integrating GIN into DeepEP requires:

High QP parallelism: HT kernels need 24 QPs; LL kernels need 8–16 QPs.
Heterogeneous topology: HT uses symmetric rank-to-rank RDMA + NVLink forwarding; LL uses full all-to-all RDMA mesh.
Fine-grained synchronization: atomic head/tail pointer updates for circular buffer flow control.
Backend coexistence: NVSHMEM IBGDA and GIN must coexist for user preference flexibility.

6.2 Integration Strategy

Multi-communicator mapping: GIN provides 4 contexts per communicator. For 24 QPs (HT): ⌈24/4⌉ = 6 communicators, with channel selection: comm_id = id/4; ctx_id = id%4. Memory management: registered buffers store device-accessible window handles in GPU memory; kernels translate pointer arithmetic to (window, offset) pairs. Synchronization: pre-allocated signal layouts map memory-based atomics to signal primitives (HT: 2 signals/channel for head/tail; LL: 1 signal/expert).

6.3 NVSHMEM vs. GIN API Mapping (Table II)

Aspect	NVSHMEM / IBGDA	NCCL GIN
Memory model	PGAS with symmetric heap; pointer-based	Window-based; offset addressing
Data transfer	`put_nbi(dst_ptr, src_ptr, count, pe)`	`put(team, peer, dstWin, dstOff, srcWin, srcOff, bytes)`
Synchronization	Memory atomics: `atomic_add`, `atomic_fetch` on remote memory	Signal atomics: `signal(peer, id)`, `readSignal(id)`
Completion	`quiet()` per QP; blocks until complete	`flush()` per context
Barrier	`barrier(team)` / `barrier_all()`	`ncclGinBarrierSession.sync()`

Key semantic translation: zero-byte puts with atomic signals emulate NVSHMEM's release-acquire semantics (all prior data transfers visible before signal completes).

6.4 High-Throughput Kernel

HT kernels: 4096 tokens, hierarchical RDMA + NVLink. SM specialization: odd-numbered SMs as Senders (transmit to remote ranks) and NVLink Receivers (final destinations); even-numbered SMs as Forwarders (receive RDMA tokens, forward via NVLink). Separate channels for data/tail updates vs. head-pointer updates to reduce contention.

Signal usage: remote tail signals via gin.signal(SignalAdd, 1); head-pointer flow control via gin.readSignal(signal_id) polling.

6.5 Low-Latency Kernel

LL kernels: 1–128 tokens, full all-to-all RDMA mesh. Per-expert signal allocation. Hybrid NVLink-RDMA: checks NVLink availability via nccl_get_p2p_ptr; uses warp memory ops for NVLink, put() for RDMA. Expert allocation: G = ⌈N/S⌉ warp groups per SM; expert_idx = sm_id × G + warp_group_id. FP8 quantization applied during token copy into send buffers. Zero-byte puts with SignalAdd implement release-acquire for count delivery.

7. Performance Evaluation

7.1 Cluster

NVIDIA EOS cluster: 576 DGX H100 nodes.

GPU: 8× H100 80GB HBM3 per node
GPU memory bandwidth: 3.2 TB/s
NVLink: 4th generation, 900 GB/s bidirectional, 18 links per GPU
InfiniBand: 8×400 Gb/s per node (compute), 2×400 Gb/s (storage)
CPU: Intel Xeon Platinum 8480CL, 112 cores, 2 TB RAM

Baselines: NVSHMEM 3.4.5 (IBGDA for DeepEP kernels), NCCL 2.28. DeepEP version 1.2.1. DeepEP benchmarks: 24 SMs per GPU.

7.2 Point-to-Point Microbenchmarks (Figure 4)

Ping-pong test between 2 H100 GPUs, 4 bytes to 4 MB:

Backend	Small msg latency (4–128B)
NCCL GIN GDAKI	16.7 µs
NCCL GIN Proxy	18.0 µs
NVSHMEM IBRC	16.0 µs
NVSHMEM IBGDA	24.3 µs

Notable: NCCL GIN GDAKI (16.7 µs) is comparable to NVSHMEM IBRC (16.0 µs) and better than NVSHMEM IBGDA (24.3 µs). At large messages: all four converge — bandwidth-limited, no algorithmic differentiation.

7.3 HT Kernel Results (Figure 5)

BF16 precision, 2/4/8 nodes (16/32/64 GPUs), 4096 tokens:

Config	NCCL GIN RDMA BW	NVSHMEM RDMA BW	Delta
2 nodes, BF16 dispatch	84.36 GB/s	84.97 GB/s	-0.7%
8 nodes, BF16 dispatch	~53–54 GB/s	~53–54 GB/s	<2%
FP8 variants	similar	similar	1–2%

All HT results within 1–2% across all scales, precisions, and operations (dispatch/combine). GIN successfully preserves HT throughput while enabling standardization on NCCL infrastructure.

7.4 LL Kernel Results with NVLink (Figures 6, 7)

BF16 precision, 1/2/4/8 nodes (8/16/32/64 GPUs), 1–128 tokens, hidden dim 7168, message size ~14 KB per token transfer:

Config	NCCL GIN dispatch	NVSHMEM dispatch	BW delta	Latency delta
1 node (8 GPUs)	185.28 GB/s / 40.62 µs	182.15 GB/s / 41.43 µs	+1.7%	-2.0%
2 nodes (16 GPUs)	~160 GB/s / 142.51 µs	~155 GB/s / 157.00 µs	similar	-9.2%
4 nodes (32 GPUs)	comparable	comparable	within 2%	within 2–5%

NCCL GIN consistently shows lower latency at 2-node scale (9% better). Combine operations within 1–3% across all scales.

7.5 LL Kernel Results — Pure RDMA (Figures 8, 9)

NVLink disabled; all communication via RDMA only:

Config	NCCL GIN dispatch	NVSHMEM dispatch
1 node	47.00 GB/s / 160.82 µs	46.79 GB/s / 160.67 µs
8 nodes	~34–35 GB/s / 219–225 µs	~34–35 GB/s / 219–225 µs

Essentially identical across all configurations in pure RDMA mode. NCCL GIN functional correctness confirmed without NVLink.

OpenSHMEM: Established PGAS semantics and one-sided communication for HPC. Early GPU extensions were CPU-mediated. NVSHMEM extended OpenSHMEM to CUDA kernels, achieving 60–75% speedups by eliminating kernel launch overhead (cited from related work).

GPUrdma, GIO (2016, 2020): Early device-initiated RDMA prototypes. GPUrdma: GPU-side IB verbs. GIO: correct and efficient intra-kernel networking for discrete GPUs. Both faced GPU-NIC memory consistency challenges. GIO achieved 44% improvement for irregular applications via kernel driver extensions.

DOCA GPUNetIO: Foundation for GIN's GDAKI backend. Production-grade, used by DeepEP's IBGDA transport.

MoE Libraries: DeepSpeed-MoE (hierarchical parallelism), FasterMoE (expert scheduling optimization), Tutel (adaptive MoE), DeepEP and pplx-kernels (low-latency GPU-initiated primitives outside NCCL ecosystem). GIN is the first to bring device-initiated primitives inside NCCL.

NCCL (prior work): Provides topology-aware collective algorithms and production infrastructure but used CPU proxy threads for all network operations prior to NCCL 2.28.

Positioning: GIN uniquely integrates device-initiated network primitives into NCCL's production infrastructure via dual backends. Unlike NVSHMEM (standalone runtime), GIN preserves NCCL ecosystem compatibility (hierarchical communicators, fault tolerance, topology-aware collectives).

9. Limitations

Current NCCL 2.28 symmetric window size constraint. All ranks must register the same buffer size. Disaggregated serving architectures where prefill nodes need larger buffers than decode nodes require the planned asymmetric capability. This is not a fundamental limitation — it's an implementation simplification that will be lifted.

4 contexts per communicator. DeepEP HT kernels require 24 QPs → 6 communicators needed. This creates code complexity for QP management and increases resource usage. Future work may increase the context limit.

GDAKI hardware requirements. ConnectX-6 Dx or newer NICs, CUDA 12.2+. Many existing deployments with older hardware cannot use GDAKI and must fall back to Proxy.

No comparison to host-initiated NCCL. The paper evaluates GIN vs. NVSHMEM for MoE workloads but does not compare GIN to traditional NCCL host-initiated AllToAll for the same workloads. This makes it difficult to quantify GIN's benefit over the status quo for standard workloads.

MoE-centric evaluation. All application-level results are on DeepEP (MoE kernels). Performance for other device-initiated use cases (inference token generation, JAX/Triton fused kernels) is not evaluated.

Under-optimized GDAKI backend. Planned optimizations include: batching WQEs across multiple operations, amortizing doorbell costs, which could further reduce GDAKI overhead at small message sizes.

No non-RDMA network support for GDAKI. GDAKI requires hardware-supported device verbs (currently InfiniBand/RoCE via DOCA GPUNetIO). GPU-to-CPU Ethernet deployments cannot use GDAKI.

10. Section-by-Section Paragraph Summaries

Section I — Introduction

Motivates device-initiated communication for MoE architectures, inference token generation, and compiler-generated fusion kernels. Introduces NCCL 2.28's Device API (LSA, Multimem, GIN). Presents GIN's three-layer architecture and dual backend design. Lists four contributions: GIN design+implementation with host/device APIs and dual backends; DeepEP integration; comprehensive benchmarking; comparative analysis with NVSHMEM.

Section II — Background

Reviews GPUDirect RDMA (2013), GPUDirect Async (2016), DOCA GPUNetIO. Describes NVSHMEM's device-callable one-sided operations and IBGDA transport. Discusses InfiniBand vs. RoCE hardware requirements for GPU-initiated communication. Explains NCCL proxy thread architecture and network plugin API. Introduces MoE communication patterns (dynamic token routing, irregular all-to-all) and existing specialized libraries (DeepEP, pplx-kernels).

Section III-A — Core Principles

One-sided communication semantics eliminating receiver coordination. Symmetric memory windows (MPI RMA model) with current symmetric size enforcement. GIN contexts as network parallelism abstraction (4 per communicator; one context per channel → QPs to NIC). Asynchronous completion tracking (Counters for local, Signals for remote). Ordering semantics (unordered by default; put-then-signal guarantees per-peer ordering without global fence).

Section III-B — Device API

Three-phase workflow: initialization (ncclDevCommCreate + ncclCommWindowRegister), execution (ncclGin instantiation + put/signal), synchronization (waitSignal + reset). API summary (Listing 1) with put, putValue, signal, flush, readCounter, waitCounter, readSignal, waitSignal, reset primitives. Unidirectional ring exchange example (Listing 2): put with SignalInc action, waitSignal, resetSignal.

Section III-C — Backend Implementations

GDAKI: GPU constructs WQE → writes NIC doorbell → NIC polls GPU memory → RDMA execution → CQ update in GPU memory. Proxy: GPU writes 64-byte descriptor → CPU proxy polls → CPU calls iput → CPU updates completion. Comparison table (Table I). Backend selection: automatic probe for DOCA GPUNetIO; NCCL_GIN_BACKEND override.

Section IV — DeepEP Integration

Four integration requirements (QP parallelism, topology support, synchronization, backend coexistence). Multi-communicator mapping for QP requirements. Operation semantic mapping (Table II). HT kernel: odd/even SM specialization (Senders/Forwarders/NVLink Receivers), signal-based atomics for flow control, single-threaded put with syncwarp. LL kernel: per-expert signals, hybrid NVLink-RDMA, FP8 quantization, release-acquire via zero-byte put+SignalAdd.

Section V — Performance Evaluation

Cluster specs (EOS, 576 DGX H100 nodes). Point-to-point: GDAKI 16.7 µs comparable to NVSHMEM IBRC 16.0 µs (both better than NVSHMEM IBGDA 24.3 µs). HT results within 1–2% of NVSHMEM across all scales. LL with NVLink: NCCL GIN 9% lower latency at 2 nodes; otherwise within 1–3%. Pure RDMA: essentially identical. All results confirm GIN as NVSHMEM-equivalent within NCCL infrastructure.

Device-initiated communication history (OpenSHMEM, GPUrdma, GIO, NVSHMEM, DOCA GPUNetIO). Collective runtimes (NCCL, MPI RMA, UCX, UCC). MoE libraries (DeepSpeed-MoE, FasterMoE, Tutel, DeepEP, pplx-kernels). GIN's positioning: unique integration of device-initiated primitives into NCCL's production infrastructure with dual backends.

Section VII — Conclusions and Future Work

GIN validated as practical for device-initiated networking within NCCL. GDAKI: 16.7 µs small message latency. DeepEP: competitive with NVSHMEM across HT and LL kernels. Value: ecosystem unification (LSA + Multimem + GIN in one runtime), preservation of NCCL production features (hierarchical communicators, fault tolerance, topology-aware optimization). Future: PyTorch distributed, TensorRT-LLM, vLLM, SGLang, JAX/Triton; additional one-sided primitives.

11. Relevance to DynamICCL

High context relevance; low direct applicability in current phase.

1. NCCL 2.28 architecture — the state of the art. GIN describes NCCL's most recent architecture evolution (Device API in 2.28, released November 2025). DynamICCL targets NCCL's tuner plugin API for (algo, proto, nChannels, nThreads) selection. Understanding how NCCL 2.28 extends its architecture is necessary for ensuring DynamICCL remains compatible with and relevant to current NCCL versions.

2. NCCL proxy thread model. The Proxy backend for GIN uses the same CPU proxy thread model that NCCL has always used for network operations. The paper explains how GPU-to-CPU lock-free queues work, how the CPU proxy polls and dispatches operations, and how completion state is returned to the GPU. DynamICCL's tuner plugin intercepts at the layer above this — the ncclTuner_v3 API — which decides algorithm and protocol selection before the proxy threads are engaged. This architecture is complementary.

3. NCCL network plugin extensibility model. GIN introduces dual plugin semantics (GDAKI and Proxy) via NCCL's external plugin architecture (libnccl-net.so). DynamICCL uses the tuner plugin (libnccl-tuner.so). Both are external plugin mechanisms that NCCL loads dynamically. Understanding how GIN's network plugin hooks work clarifies the plugin boundary between DynamICCL's tuner plugin and NCCL's core.

4. MoE as a future DynamICCL workload. If DynamICCL expands beyond standard collective operations (AllReduce, AllGather, ReduceScatter) to MoE workloads, GIN's architecture becomes directly relevant. MoE's irregular all-to-all patterns would require Agent-2 to learn configurations for device-initiated communication rather than host-initiated collectives — a fundamentally different optimization problem.

5. The collective symmetric memory model. GIN's Device API operates over collective symmetric memory (window registration), contrasting with NCCL's pipeline primitives over regular memory. DynamICCL currently targets the pipeline primitive model (algo/proto selection affects pipeline behavior). NCCL 2.28's Device API provides a parallel execution model that DynamICCL's current tuner plugin does not interact with.

6. Benchmark design reference. The paper's performance evaluation methodology (microbenchmarks + application-level integration, multiple scales, multiple precisions, HT vs. LL kernel separation) provides a reference for how DynamICCL's evaluation should be structured: both micro (collective-level latency vs. algo/proto) and macro (end-to-end training step improvement).

7. Future DynamICCL extension direction. NCCL 2.28 introduces a new axis of device-API tuning (which GIN context to use, how many QPs, how to allocate communicators). A future version of DynamICCL's RL agent could potentially learn GIN resource allocation policies for MoE workloads. This is speculative but motivated by the trajectory of NCCL development toward device-initiated primitives.