Detailed Summary: GPU-Initiated Networking for NCCL (GIN)
Citation: Khaled Hamidouche, John Bachan, Pak Markthub, Peter-Jan Gootzen, Elena Agostini, Sylvain Jeaugey, Aamir Shafi, Georgios Theodorakis, Manjunath Gorentla Venkata. NVIDIA Corporation. arXiv:2511.15076v2, November 24, 2025.
PDF:
0016_GPU_Initiated_net_NCCL.pdf
Abstract
GIN (GPU-Initiated Networking) is part of NCCL 2.28's new Device API, which extends NCCL's host-initiated collective model with device-side communication primitives. GIN enables GPU threads to issue one-sided RDMA operations (put, signal, wait) directly from CUDA kernels for inter-node communication, eliminating CPU coordination overhead. GIN uses a three-layer architecture: NCCL Core host APIs for resource setup, Device GIN API for kernel-callable primitives, and a dual-backend network plugin (GDAKI via DOCA GPUNetIO for direct GPU-to-NIC; Proxy via CPU-assisted queues for broad compatibility). Validated via integration with DeepEP (a production MoE communication library), GIN achieves 16.7 µs round-trip latency for small messages (GDAKI) and competitive performance with NVSHMEM across HT and LL MoE kernels on 8–64 NVIDIA H100 GPUs.
1. Motivation
1.1 The Host-Initiated Model's Limitations
Traditional GPU communication follows a host-initiated model dictated by the CUDA runtime:
- GPU kernel executes computation, queues communication descriptors into host-visible buffers.
- CPU proxy threads pick up descriptors and invoke RDMA operations (ibv_post_send, etc.).
- GPU kernel waits for CPU to signal completion before proceeding.
This model requires explicit host-device synchronization (kernel boundaries) and separate kernel launches for each communication call. For large-scale collective operations with millions of bytes, this overhead is negligible relative to transfer time. But modern AI workloads increasingly need:
- Low-latency point-to-point operations for inference token generation (one token at a time, dozens of bytes).
- Custom communication patterns for MoE architectures: dynamic token routing where each token is dispatched to one of N experts on potentially remote GPUs, with token counts per expert varying dynamically. This creates irregular all-to-all patterns where message sizes and destinations are not known until runtime.
- Computation-communication fusion in compiler-generated kernels (JAX, Triton): the kernel itself needs to issue communication operations inline, without breaking out to a host-side collective call.
The CUDA runtime model forces a kernel launch boundary between computation and communication, introducing overhead that dominates small-message latency.
1.2 Why NVSHMEM Alone Is Insufficient
NVSHMEM (NVIDIA OpenSHMEM) already provides device-callable one-sided operations (put, get, atomics from CUDA kernels). However, NVSHMEM operates as a standalone runtime separate from NCCL. This means NVSHMEM applications cannot use:
- NCCL's topology-aware collective algorithms (ring, tree, NVLS)
- NCCL's hierarchical communicator infrastructure (multi-dimensional parallelism)
- NCCL's fault tolerance and elasticity features
- NCCL's production deployment infrastructure used by PyTorch, TensorRT-LLM, vLLM, SGLang
GIN brings device-initiated communication capabilities inside NCCL's framework, enabling applications to use both modalities (host-initiated collectives and device-initiated point-to-point) within a single runtime.
2. Background
2.1 GPUDirect Technologies Evolution
GPUDirect RDMA (2013): RDMA NICs can directly access GPU memory via PCIe BAR mappings, eliminating CPU and host memory from the data path. The NIC's DMA engine performs PCIe peer-to-peer transactions to GPU BARs via the nvidia_p2p kernel module. Limitation: consistency guarantees only at kernel boundaries — GPU's relaxed memory ordering and write-back caching prevent safe concurrent access to RDMA-registered memory from executing kernels.
GPUDirect Async (2016): Partial control-path offload. GPU threads write to NIC doorbell registers mapped into the GPU address space, triggering pre-configured network operations. CPU must pre-construct communication descriptors. Limits GPU autonomy to pre-configured operations.
DOCA GPUNetIO: Production-grade device-side RDMA APIs for IB and RoCE. Exposes device functions enabling GPU kernels to directly program NICs (both GPUDirect RDMA for data movement and GPUDirect Async Kernel-Initiated for GPU NIC control). Forms the foundation for GIN's GDAKI backend.
2.2 NVSHMEM
Extends OpenSHMEM semantics to GPU clusters. Device-callable put, get, atomics invokable from CUDA kernels. Transport backends: IBGDA (InfiniBand with GPUDirect Async) for inter-node, symmetric memory for intra-node. Enables computation-communication interleaving without kernel launch overhead. Standalone runtime — separate from NCCL.
2.3 Hardware Requirements for GPU-Initiated Communication
- InfiniBand: native RDMA, ~130 ns port-to-port latency, credit-based flow control. Adapters expose memory-mapped queue pairs (QPs), completion queues (CQs), and doorbell registers via PCIe BARs for direct GPU access with GPUDirect RDMA.
- RoCE: RDMA over Ethernet. RoCEv2 (UDP/IP), ~400 ns latency, requires PFC + ECN (lossless Ethernet). Shares InfiniBand verbs API.
- Hardware requirement for GIN: NIC support for device-accessible control structures. NVIDIA ConnectX-6 Dx and later; BlueField DPUs.
- System requirement: GPU and NIC co-located on same PCIe root complex for minimal peer-to-peer latency. Multi-socket systems incur inter-socket traversal penalties.
2.4 NCCL Proxy Thread Architecture
NCCL uses CPU proxy threads to orchestrate network operations. GPU kernels write communication descriptors into host-visible queues; CPU threads execute them via network plugins. NCCL's network plugin architecture decouples the core library from specific network implementations: built-in plugins (Socket, InfiniBand) plus external plugins (libnccl-net.so). External plugins dynamically loaded via NCCL_NET_PLUGIN environment variable.
2.5 MoE Communication Patterns
MoE architectures route each input token to one of N expert networks. In distributed MoE inference, experts are distributed across GPUs. Dynamic token routing creates irregular all-to-all communication: each GPU sends tokens to variable numbers of expert GPUs, with message sizes proportional to token count (unknown until runtime). Traditional collective operations (AllToAll with fixed message sizes) are not well-suited. DeepEP and pplx-kernels provide specialized CUDA-optimized libraries for this pattern, using NVSHMEM/IBGDA.
3. NCCL Device API and GIN Architecture
3.1 NCCL 2.28 Device API Overview
NCCL 2.28 introduces the Device API with three operation modes:
NCCL Device API
├── Load/Store Accessible (LSA)
│ └── Intra-node via NVLink/PCIe memory-mapped access
├── Multimem
│ └── NVLink SHARP hardware multicast via multimem PTX instructions
└── GPU-Initiated Networking (GIN) ◄── This paper
└── Inter-node RDMA over InfiniBand/RoCE
Applications can either:
- Use NCCL-provided single-shot collective algorithms implemented with the Device API (operating over collective symmetric memory), or
- Directly invoke Device API primitives from their own GPU kernels to implement custom communication patterns.
This contrasts with traditional NCCL's host-initiated collectives that use pipeline primitives (Simple, LL, LL128) over regular memory.
3.2 GIN Three-Layer Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Layer 1: NCCL Core (Host-Side) │
│ • ncclDevCommCreate — creates device communicator with GIN resources │
│ • ncclCommWindowRegister — collective buffer registration across ranks │
│ • Returns window handles with remote access metadata (RDMA rkeys, etc.)│
└─────────────────────────────────────┬───────────────────────────────────┘
│
┌─────────────────────────────────────▼───────────────────────────────────┐
│ Layer 2: Device GIN API (GPU-Callable, from CUDA kernels) │
│ ncclGin class: │
│ • put(team, peer, dstWin, dstOff, srcWin, srcOff, bytes, [actions]) │
│ • putValue(team, peer, dstWin, dstOff, value, ...) │
│ • signal(team, peer, signalId) │
│ • flush(coop) — local completion (safe buffer reuse) │
│ • waitSignal(coop, id, expected) — remote completion (data visible) │
│ • readCounter(id) — poll local counter │
│ • resetSignal/resetCounter — for round reuse │
│ ncclGinBarrierSession — global network barrier │
└─────────────────────────────────────┬───────────────────────────────────┘
│ dispatches to backend
┌─────────────────┴──────────────────┐
│ │
┌───────────────────▼──────────────────┐ ┌─────────────▼──────────────────┐
│ GDAKI Backend │ │ Proxy Backend │
│ (GPUDirect Async Kernel-Initiated) │ │ (CPU-assisted) │
│ • GPU threads construct RDMA WQEs │ │ • GPU writes 64-byte descs to │
│ • Direct NIC doorbell register write │ │ lock-free GPU-to-CPU queues │
│ • NIC polls GPU memory for WQEs │ │ • Dedicated CPU proxy thread │
│ • NIC executes RDMA, updates CQ │ │ polls queues, calls iput │
│ • Zero CPU involvement │ │ • CPU updates completion state │
│ • Requires: ConnectX-6 Dx+, CUDA 12 │ │ • Requires: any RDMA NIC │
└──────────────────────────────────────┘ └────────────────────────────────┘
3.3 Key Design Elements
One-sided communication semantics. put
allows GPU threads to write to remote peer memory without receiver
coordination. No handshaking. Initiator unilaterally issues transfers
and independently tracks completion. Eliminates overhead of two-sided
handshaking protocols. Particularly effective for MoE's unpredictable
traffic patterns.
Memory windows (collective registration).
Communication buffers must be collectively registered across all ranks
via ncclCommWindowRegister. Each registration produces a
window handle encapsulating remote access metadata (RDMA rkeys, peer
addresses). All ranks can address each other's registered memory.
Current NCCL 2.28 enforces symmetric buffer sizes (all ranks same size);
asymmetric sizes planned for future releases to support disaggregated
serving (prefill nodes need larger buffers than decode nodes).
GIN contexts. Each context abstracts a channel between the GPU and NIC, encapsulating network resources and QP connections. Multiple contexts per communicator enable parallelism across NICs/ports/QPs. Current limit: 4 contexts per communicator. A single context can address every rank in the communicator. Multiple contexts allow independent concurrent communication streams.
Asynchronous completion tracking — two mechanisms:
- Counters (local): Track completion on sender side.
Per-operation tracking (
counterIDparameter input). Allows fine-grained pipeline algorithm implementation.flush()for full context drain;waitCounter()for specific operation completion. - Signals (remote symmetric objects): Confirm data
arrival and visibility at destination. ID-based (integer ID, not memory
address). When a signal operation completes at destination, all
preceding
putoperations to that peer on the same context are guaranteed visible.
Ordering semantics. GIN operations are unordered by
default (maximizes throughput). Exception: all put
operations to a given peer on the same context are guaranteed visible
before a subsequent signal to that peer completes —
lightweight per-peer ordering without global fence. Applications batch
multiple puts and attach signal to final operation.
4. Device API Details
4.1 Initialization Workflow
- Create device communicator:
ncclDevCommCreate(comm, flags)— allocates GIN contexts with QP connections to all peer ranks. - Register memory:
ncclCommWindowRegister(devComm, ptr, size, &windowHandle)— collective registration across all ranks; returns handle usable in GPU kernels. - Launch kernel with
devCommand window handles. - Inside kernel: instantiate
ncclGin(devComm, contextIndex), issueput/signaloperations. - Synchronize using
waitSignalbefore consuming received data. - Reset signals/counters for next communication round.
4.2 API Summary (from Listing 1)
class ncclGin {
ncclGin(ncclDevComm comm, int contextIndex);
// Data Movement
void put(team, peer, dstWin, dstOff, srcWin, srcOff, bytes, ...);
void putValue(team, peer, dstWin, dstOff, value, ...);
void signal(team, peer, signalId);
// Local completion (safe buffer reuse)
void flush(coop);
uint64_t readCounter(counterId);
void waitCounter(coop, counterId, expected);
void resetCounter(counterId);
// Remote completion (data visible at destination)
uint64_t readSignal(signalId);
void waitSignal(coop, signalId, expected);
void resetSignal(signalId);
};
class ncclGinBarrierSession {
ncclGinBarrierSession(coop, gin, team, handle, index);
void sync(coop); // Global barrier
};
// Completion actions attached to put:
put(..., ncclGin_SignalInc{signalId}); // increment remote signal
put(..., ncclGin_CounterInc{counterId}); // increment local counter
4.3 Usage Example — Unidirectional Ring Exchange (Listing 2)
__global__ void ringExchange(ncclDevComm devComm,
ncclWindow_t sendWin, ncclWindow_t recvWin,
size_t dataSize, int myRank) {
ncclGin gin(devComm, 0); // Use context 0
int peer = (myRank + 1) % devComm.nRanks; // Next rank in ring
// Send data to peer; atomically increment peer's signal 0 on completion
gin.put(ncclTeamWorld(devComm), peer,
recvWin, myRank * dataSize, // dst: peer's recvWin at offset
sendWin, peer * dataSize, dataSize, // src: local sendWin at offset
ncclGin_SignalInc{0}); // attach signal increment action
// Wait for predecessor's data to arrive (local signal 0 == 1)
gin.waitSignal(ncclCoopCta(), 0, 1);
gin.resetSignal(0); // Reset for next round
}
This implements a common ring step: each rank sends to successor and waits for its predecessor.
5. Backend Implementations
5.1 GDAKI Backend — Direct GPU-to-NIC
When put is called in a GDAKI-backend kernel:
- GPU threads construct an RDMA Work Queue Entry (WQE) in device memory.
- GPU threads write directly to the NIC's doorbell register (memory-mapped into GPU address space via PCIe BAR).
- NIC hardware polls GPU memory for new WQEs, executes RDMA transactions over IB/RoCE, updates Completion Queue entries in GPU-visible memory.
- Completion is polled by GPU threads (readCounter / waitSignal).
No CPU involvement. For small messages, eliminates the GPU-CPU-NIC round-trip. Requires: ConnectX-6 Dx or newer, CUDA 12.2+, nv_peer_mem or dmabuf kernel module, properly configured GPUDirect RDMA (co-located GPU-NIC PCIe).
5.2 Proxy Backend — CPU-Assisted
When put is called in a Proxy-backend kernel:
- GPU thread writes a 64-byte descriptor to lock-free queue in CPU-visible memory (fire-and-forget).
- Descriptor contains: source/destination window handles, inline value (optional), offsets, sizes, completion actions.
- Dedicated CPU proxy thread (one per communicator, NUMA-pinned near
GPU and NIC) polls queue and calls network plugin's
iput/iput_signalinterface. - Plugin maps to standard IB verbs or other RDMA API.
- Proxy thread polls completions via plugin's
testinterface and updates completion state in GPU-visible memory (GPU or CPU-resident, depending on GDRCopy availability).
CPU overhead: one proxy thread per communicator continuously polling. Latency overhead vs. GDAKI: measured as ~18.0 µs vs. 16.7 µs for small messages. Advantage: supports any RDMA NIC, any CUDA version, Volta-or-newer GPUs.
5.3 Backend Comparison
| Feature | GDAKI | Proxy |
|---|---|---|
| Communication path | Direct GPU ↔︎ NIC | GPU → CPU ↔︎ NIC |
| CPU involvement | None | Required (1 thread/comm) |
| Progress model | NIC polls GPU memory | CPU polls queue, posts to NIC |
| Operation posting | GPU rings NIC doorbell | GPU writes descriptor; CPU posts |
| Hardware req. | ConnectX-6 Dx+, CUDA 12.2+ | Any RDMA NIC, any CUDA |
| Small msg latency | 16.7 µs (H100, HDR IB) | 18.0 µs |
| Debugging | Device-side tools only | Host-side inspection + tracing |
| Portability | GPU-NIC direct access required | Universal |
| Use case | Production HPC/AI clusters | Development, legacy, multi-vendor |
Backend selection: automatic at ncclCommInitRank time
via capability probe for DOCA GPUNetIO support. Override via
NCCL_GIN_BACKEND environment variable.
6. DeepEP Integration
6.1 Integration Requirements
DeepEP (DeepSeek's MoE communication library) uses NVSHMEM IBGDA for device-initiated all-to-all communication. Integrating GIN into DeepEP requires:
- High QP parallelism: HT kernels need 24 QPs; LL kernels need 8–16 QPs.
- Heterogeneous topology: HT uses symmetric rank-to-rank RDMA + NVLink forwarding; LL uses full all-to-all RDMA mesh.
- Fine-grained synchronization: atomic head/tail pointer updates for circular buffer flow control.
- Backend coexistence: NVSHMEM IBGDA and GIN must coexist for user preference flexibility.
6.2 Integration Strategy
Multi-communicator mapping: GIN provides 4 contexts per communicator.
For 24 QPs (HT): ⌈24/4⌉ = 6 communicators, with channel selection:
comm_id = id/4; ctx_id = id%4. Memory management:
registered buffers store device-accessible window handles in GPU memory;
kernels translate pointer arithmetic to (window, offset) pairs.
Synchronization: pre-allocated signal layouts map memory-based atomics
to signal primitives (HT: 2 signals/channel for head/tail; LL: 1
signal/expert).
6.3 NVSHMEM vs. GIN API Mapping (Table II)
| Aspect | NVSHMEM / IBGDA | NCCL GIN |
|---|---|---|
| Memory model | PGAS with symmetric heap; pointer-based | Window-based; offset addressing |
| Data transfer | put_nbi(dst_ptr, src_ptr, count, pe) |
put(team, peer, dstWin, dstOff, srcWin, srcOff, bytes) |
| Synchronization | Memory atomics: atomic_add, atomic_fetch
on remote memory |
Signal atomics: signal(peer, id),
readSignal(id) |
| Completion | quiet() per QP; blocks until complete |
flush() per context |
| Barrier | barrier(team) / barrier_all() |
ncclGinBarrierSession.sync() |
Key semantic translation: zero-byte puts with atomic signals emulate NVSHMEM's release-acquire semantics (all prior data transfers visible before signal completes).
6.4 High-Throughput Kernel
HT kernels: 4096 tokens, hierarchical RDMA + NVLink. SM specialization: odd-numbered SMs as Senders (transmit to remote ranks) and NVLink Receivers (final destinations); even-numbered SMs as Forwarders (receive RDMA tokens, forward via NVLink). Separate channels for data/tail updates vs. head-pointer updates to reduce contention.
Signal usage: remote tail signals via
gin.signal(SignalAdd, 1); head-pointer flow control via
gin.readSignal(signal_id) polling.
6.5 Low-Latency Kernel
LL kernels: 1–128 tokens, full all-to-all RDMA mesh. Per-expert
signal allocation. Hybrid NVLink-RDMA: checks NVLink availability via
nccl_get_p2p_ptr; uses warp memory ops for NVLink,
put() for RDMA. Expert allocation: G = ⌈N/S⌉ warp groups
per SM; expert_idx = sm_id × G + warp_group_id. FP8 quantization applied
during token copy into send buffers. Zero-byte puts with SignalAdd
implement release-acquire for count delivery.
7. Performance Evaluation
7.1 Cluster
NVIDIA EOS cluster: 576 DGX H100 nodes.
- GPU: 8× H100 80GB HBM3 per node
- GPU memory bandwidth: 3.2 TB/s
- NVLink: 4th generation, 900 GB/s bidirectional, 18 links per GPU
- InfiniBand: 8×400 Gb/s per node (compute), 2×400 Gb/s (storage)
- CPU: Intel Xeon Platinum 8480CL, 112 cores, 2 TB RAM
Baselines: NVSHMEM 3.4.5 (IBGDA for DeepEP kernels), NCCL 2.28. DeepEP version 1.2.1. DeepEP benchmarks: 24 SMs per GPU.
7.2 Point-to-Point Microbenchmarks (Figure 4)
Ping-pong test between 2 H100 GPUs, 4 bytes to 4 MB:
| Backend | Small msg latency (4–128B) |
|---|---|
| NCCL GIN GDAKI | 16.7 µs |
| NCCL GIN Proxy | 18.0 µs |
| NVSHMEM IBRC | 16.0 µs |
| NVSHMEM IBGDA | 24.3 µs |
Notable: NCCL GIN GDAKI (16.7 µs) is comparable to NVSHMEM IBRC (16.0 µs) and better than NVSHMEM IBGDA (24.3 µs). At large messages: all four converge — bandwidth-limited, no algorithmic differentiation.
7.3 HT Kernel Results (Figure 5)
BF16 precision, 2/4/8 nodes (16/32/64 GPUs), 4096 tokens:
| Config | NCCL GIN RDMA BW | NVSHMEM RDMA BW | Delta |
|---|---|---|---|
| 2 nodes, BF16 dispatch | 84.36 GB/s | 84.97 GB/s | -0.7% |
| 8 nodes, BF16 dispatch | ~53–54 GB/s | ~53–54 GB/s | <2% |
| FP8 variants | similar | similar | 1–2% |
All HT results within 1–2% across all scales, precisions, and operations (dispatch/combine). GIN successfully preserves HT throughput while enabling standardization on NCCL infrastructure.
7.4 LL Kernel Results with NVLink (Figures 6, 7)
BF16 precision, 1/2/4/8 nodes (8/16/32/64 GPUs), 1–128 tokens, hidden dim 7168, message size ~14 KB per token transfer:
| Config | NCCL GIN dispatch | NVSHMEM dispatch | BW delta | Latency delta |
|---|---|---|---|---|
| 1 node (8 GPUs) | 185.28 GB/s / 40.62 µs | 182.15 GB/s / 41.43 µs | +1.7% | -2.0% |
| 2 nodes (16 GPUs) | ~160 GB/s / 142.51 µs | ~155 GB/s / 157.00 µs | similar | -9.2% |
| 4 nodes (32 GPUs) | comparable | comparable | within 2% | within 2–5% |
NCCL GIN consistently shows lower latency at 2-node scale (9% better). Combine operations within 1–3% across all scales.
7.5 LL Kernel Results — Pure RDMA (Figures 8, 9)
NVLink disabled; all communication via RDMA only:
| Config | NCCL GIN dispatch | NVSHMEM dispatch |
|---|---|---|
| 1 node | 47.00 GB/s / 160.82 µs | 46.79 GB/s / 160.67 µs |
| 8 nodes | ~34–35 GB/s / 219–225 µs | ~34–35 GB/s / 219–225 µs |
Essentially identical across all configurations in pure RDMA mode. NCCL GIN functional correctness confirmed without NVLink.
8. Related Work
OpenSHMEM: Established PGAS semantics and one-sided communication for HPC. Early GPU extensions were CPU-mediated. NVSHMEM extended OpenSHMEM to CUDA kernels, achieving 60–75% speedups by eliminating kernel launch overhead (cited from related work).
GPUrdma, GIO (2016, 2020): Early device-initiated RDMA prototypes. GPUrdma: GPU-side IB verbs. GIO: correct and efficient intra-kernel networking for discrete GPUs. Both faced GPU-NIC memory consistency challenges. GIO achieved 44% improvement for irregular applications via kernel driver extensions.
DOCA GPUNetIO: Foundation for GIN's GDAKI backend. Production-grade, used by DeepEP's IBGDA transport.
MoE Libraries: DeepSpeed-MoE (hierarchical parallelism), FasterMoE (expert scheduling optimization), Tutel (adaptive MoE), DeepEP and pplx-kernels (low-latency GPU-initiated primitives outside NCCL ecosystem). GIN is the first to bring device-initiated primitives inside NCCL.
NCCL (prior work): Provides topology-aware collective algorithms and production infrastructure but used CPU proxy threads for all network operations prior to NCCL 2.28.
Positioning: GIN uniquely integrates device-initiated network primitives into NCCL's production infrastructure via dual backends. Unlike NVSHMEM (standalone runtime), GIN preserves NCCL ecosystem compatibility (hierarchical communicators, fault tolerance, topology-aware collectives).
9. Limitations
Current NCCL 2.28 symmetric window size constraint. All ranks must register the same buffer size. Disaggregated serving architectures where prefill nodes need larger buffers than decode nodes require the planned asymmetric capability. This is not a fundamental limitation — it's an implementation simplification that will be lifted.
4 contexts per communicator. DeepEP HT kernels require 24 QPs → 6 communicators needed. This creates code complexity for QP management and increases resource usage. Future work may increase the context limit.
GDAKI hardware requirements. ConnectX-6 Dx or newer NICs, CUDA 12.2+. Many existing deployments with older hardware cannot use GDAKI and must fall back to Proxy.
No comparison to host-initiated NCCL. The paper evaluates GIN vs. NVSHMEM for MoE workloads but does not compare GIN to traditional NCCL host-initiated AllToAll for the same workloads. This makes it difficult to quantify GIN's benefit over the status quo for standard workloads.
MoE-centric evaluation. All application-level results are on DeepEP (MoE kernels). Performance for other device-initiated use cases (inference token generation, JAX/Triton fused kernels) is not evaluated.
Under-optimized GDAKI backend. Planned optimizations include: batching WQEs across multiple operations, amortizing doorbell costs, which could further reduce GDAKI overhead at small message sizes.
No non-RDMA network support for GDAKI. GDAKI requires hardware-supported device verbs (currently InfiniBand/RoCE via DOCA GPUNetIO). GPU-to-CPU Ethernet deployments cannot use GDAKI.
10. Section-by-Section Paragraph Summaries
Section I — Introduction
Motivates device-initiated communication for MoE architectures, inference token generation, and compiler-generated fusion kernels. Introduces NCCL 2.28's Device API (LSA, Multimem, GIN). Presents GIN's three-layer architecture and dual backend design. Lists four contributions: GIN design+implementation with host/device APIs and dual backends; DeepEP integration; comprehensive benchmarking; comparative analysis with NVSHMEM.
Section II — Background
Reviews GPUDirect RDMA (2013), GPUDirect Async (2016), DOCA GPUNetIO. Describes NVSHMEM's device-callable one-sided operations and IBGDA transport. Discusses InfiniBand vs. RoCE hardware requirements for GPU-initiated communication. Explains NCCL proxy thread architecture and network plugin API. Introduces MoE communication patterns (dynamic token routing, irregular all-to-all) and existing specialized libraries (DeepEP, pplx-kernels).
Section III-A — Core Principles
One-sided communication semantics eliminating receiver coordination. Symmetric memory windows (MPI RMA model) with current symmetric size enforcement. GIN contexts as network parallelism abstraction (4 per communicator; one context per channel → QPs to NIC). Asynchronous completion tracking (Counters for local, Signals for remote). Ordering semantics (unordered by default; put-then-signal guarantees per-peer ordering without global fence).
Section III-B — Device API
Three-phase workflow: initialization (ncclDevCommCreate + ncclCommWindowRegister), execution (ncclGin instantiation + put/signal), synchronization (waitSignal + reset). API summary (Listing 1) with put, putValue, signal, flush, readCounter, waitCounter, readSignal, waitSignal, reset primitives. Unidirectional ring exchange example (Listing 2): put with SignalInc action, waitSignal, resetSignal.
Section III-C — Backend Implementations
GDAKI: GPU constructs WQE → writes NIC doorbell → NIC polls GPU memory → RDMA execution → CQ update in GPU memory. Proxy: GPU writes 64-byte descriptor → CPU proxy polls → CPU calls iput → CPU updates completion. Comparison table (Table I). Backend selection: automatic probe for DOCA GPUNetIO; NCCL_GIN_BACKEND override.
Section IV — DeepEP Integration
Four integration requirements (QP parallelism, topology support, synchronization, backend coexistence). Multi-communicator mapping for QP requirements. Operation semantic mapping (Table II). HT kernel: odd/even SM specialization (Senders/Forwarders/NVLink Receivers), signal-based atomics for flow control, single-threaded put with syncwarp. LL kernel: per-expert signals, hybrid NVLink-RDMA, FP8 quantization, release-acquire via zero-byte put+SignalAdd.
Section V — Performance Evaluation
Cluster specs (EOS, 576 DGX H100 nodes). Point-to-point: GDAKI 16.7 µs comparable to NVSHMEM IBRC 16.0 µs (both better than NVSHMEM IBGDA 24.3 µs). HT results within 1–2% of NVSHMEM across all scales. LL with NVLink: NCCL GIN 9% lower latency at 2 nodes; otherwise within 1–3%. Pure RDMA: essentially identical. All results confirm GIN as NVSHMEM-equivalent within NCCL infrastructure.
Section VI — Related Work
Device-initiated communication history (OpenSHMEM, GPUrdma, GIO, NVSHMEM, DOCA GPUNetIO). Collective runtimes (NCCL, MPI RMA, UCX, UCC). MoE libraries (DeepSpeed-MoE, FasterMoE, Tutel, DeepEP, pplx-kernels). GIN's positioning: unique integration of device-initiated primitives into NCCL's production infrastructure with dual backends.
Section VII — Conclusions and Future Work
GIN validated as practical for device-initiated networking within NCCL. GDAKI: 16.7 µs small message latency. DeepEP: competitive with NVSHMEM across HT and LL kernels. Value: ecosystem unification (LSA + Multimem + GIN in one runtime), preservation of NCCL production features (hierarchical communicators, fault tolerance, topology-aware optimization). Future: PyTorch distributed, TensorRT-LLM, vLLM, SGLang, JAX/Triton; additional one-sided primitives.
11. Relevance to DynamICCL
High context relevance; low direct applicability in current phase.
1. NCCL 2.28 architecture — the state of the art.
GIN describes NCCL's most recent architecture evolution (Device API in
2.28, released November 2025). DynamICCL targets NCCL's tuner plugin API
for (algo, proto, nChannels, nThreads) selection.
Understanding how NCCL 2.28 extends its architecture is necessary for
ensuring DynamICCL remains compatible with and relevant to current NCCL
versions.
2. NCCL proxy thread model. The Proxy backend for
GIN uses the same CPU proxy thread model that NCCL has always used for
network operations. The paper explains how GPU-to-CPU lock-free queues
work, how the CPU proxy polls and dispatches operations, and how
completion state is returned to the GPU. DynamICCL's tuner plugin
intercepts at the layer above this — the ncclTuner_v3 API —
which decides algorithm and protocol selection before the proxy threads
are engaged. This architecture is complementary.
3. NCCL network plugin extensibility model. GIN introduces dual plugin semantics (GDAKI and Proxy) via NCCL's external plugin architecture (libnccl-net.so). DynamICCL uses the tuner plugin (libnccl-tuner.so). Both are external plugin mechanisms that NCCL loads dynamically. Understanding how GIN's network plugin hooks work clarifies the plugin boundary between DynamICCL's tuner plugin and NCCL's core.
4. MoE as a future DynamICCL workload. If DynamICCL expands beyond standard collective operations (AllReduce, AllGather, ReduceScatter) to MoE workloads, GIN's architecture becomes directly relevant. MoE's irregular all-to-all patterns would require Agent-2 to learn configurations for device-initiated communication rather than host-initiated collectives — a fundamentally different optimization problem.
5. The collective symmetric memory model. GIN's Device API operates over collective symmetric memory (window registration), contrasting with NCCL's pipeline primitives over regular memory. DynamICCL currently targets the pipeline primitive model (algo/proto selection affects pipeline behavior). NCCL 2.28's Device API provides a parallel execution model that DynamICCL's current tuner plugin does not interact with.
6. Benchmark design reference. The paper's performance evaluation methodology (microbenchmarks + application-level integration, multiple scales, multiple precisions, HT vs. LL kernel separation) provides a reference for how DynamICCL's evaluation should be structured: both micro (collective-level latency vs. algo/proto) and macro (end-to-end training step improvement).
7. Future DynamICCL extension direction. NCCL 2.28 introduces a new axis of device-API tuning (which GIN context to use, how many QPs, how to allocate communicators). A future version of DynamICCL's RL agent could potentially learn GIN resource allocation policies for MoE workloads. This is speculative but motivated by the trajectory of NCCL development toward device-initiated primitives.