GPU-Initiated Networking for NCCL (GIN) — Block Diagram Analysis
Paper: "GPU-Initiated Networking for NCCL" Authors: Hamidouche, Bachan, Markthub, Gootzen, Agostini, Jeaugey, Shafi, Theodorakis, Venkata — NVIDIA Corporation Venue: arXiv 2511.15076v2, Nov 2025 (NCCL 2.28 Device API)
Fig 1: System Overview — GIN Three-Layer Architecture
┌──────────────────────────────────────────────────────────────────┐
│ User Application (PyTorch, vLLM, SGLang, DeepEP) │
│ calls collectives OR invokes Device API directly │
└──────────────┬──────────────────────────────┬────────────────────┘
│ collective call │ device API call
▼ ▼
┌──────────────────────────────────────────────────────────────────┐
│ NCCL Core (Layer 1) │
│ ┌──────────────────┐ ┌─────────────────┐ ┌────────────────┐ │
│ │ Device │ │ Memory Window │ │ Two-sided │ │
│ │ Communicator │ │ Registration │ │ Collectives │ │
│ │ (nCCLDevComm) │ │ (nCCLCommWindow │ │ API (Ring, │ │
│ │ │ │ Register) │ │ Tree etc.) │ │
│ └──────────┬───────┘ └────────┬─────────┘ └───────┬────────┘ │
│ │ host-side setup │ window handles │ │
│ └────────────┬──────┘ │ │
│ ▼ │ │
│ Device GIN API (callable from CUDA kernels) │
│ put / putValue / signal / flush / readCounter / │
│ waitCounter / readSignal / waitSignal / resetSignal │
└──────────────────────────────┬───────────────────────────────────┘
│ dispatch
┌───────────────┴────────────────┐
▼ ▼
┌──────────────────────────┐ ┌───────────────────────────────┐
│ GIN Network Plugin │ │ Existing Net Plugin │
│ (Layer 3 — GinNet) │ │ (host-initiated collectives)│
│ │ │ │
│ ┌─────────┐ ┌─────────┐ │ │ ┌─────────────────────────┐ │
│ │ GDAKI │ │ Proxy │ │ │ │ RDMA Ops (IB/RoCE/NVL) │ │
│ │ backend │ │ backend │ │ │ │ (pipeline prims: │ │
│ │ │ │ │ │ │ │ Simple, LL, LL128) │ │
│ └────┬────┘ └────┬────┘ │ │ └─────────────────────────┘ │
│ │ │ │ └───────────────────────────────┘
│ ┌────▼──────┐ ┌──▼─────┐ │
│ │DOCA │ │CPU │ │
│ │GPUNetIO │ │Proxy │ │
│ │(direct │ │Thread │ │
│ │GPU→NIC) │ │(lock- │ │
│ │ │ │free Q) │ │
│ └───────────┘ └────────┘ │
│ Network: IB/RoCE/SPC-X │
└───────────────────────────┘
▲ Fig 1: GIN three-layer architecture — NCCL Core manages setup,
Device GIN API exposes per-kernel primitives, GIN Network Plugin
implements two backends (GDAKI and Proxy) for hardware coverage.
GIN's three-layer structure is a deliberate separation of concerns. The host- side NCCL Core layer handles all resource management and window registration (the "slow path"), while the Device GIN API is the "hot path" invoked thousands of times per training iteration entirely from GPU threads. The GIN Network Plugin decouples hardware capabilities from the device API: applications see identical semantics regardless of whether GDAKI or Proxy handles the transport.
Fig 2: Key Architecture Diagram — GDAKI vs Proxy Backend Internals
GDAKI Backend (Direct GPU-to-NIC)
─────────────────────────────────
GPU Kernel Thread
│
│ invokes put()
▼
┌────────────────────────────────────────┐
│ CUDA kernel (device code) │
│ Construct RDMA Work Queue Entry (WQE) │
│ Populate: src/dst addr, size, keys │
│ Write WQE directly to │
│ NIC Doorbell Register (BAR mapping) │
└──────────────────┬─────────────────────┘
│ PCIe BAR write
▼
┌──────────────────────────────────────────┐
│ NIC (ConnectX-6 Dx or newer) │
│ Polls GPU memory for new WQEs │
│ Executes RDMA transaction autonomously │
│ Posts completion to GPU-visible CQ │
└──────────────────────────────────────────┘
Latency: 16.7 µs RTT (small msgs)
CPU involvement: ZERO
Requires: DOCA GPUNetIO, CUDA 12.x, ConnectX-6 Dx+
Proxy Backend (CPU-Assisted, Universal)
──────────────────────────────────────
GPU Kernel Thread
│
│ invokes put()
▼
┌────────────────────────────────────────┐
│ CUDA kernel (device code) │
│ Write 64-byte descriptor into │
│ lock-free GPU-to-CPU queue │
│ (fire-and-forget store, no sync) │
└───────────────────┬────────────────────┘
│ PCIe queue poll
▼
┌──────────────────────────────────────────┐
│ CPU Proxy Thread (1 per communicator) │
│ pinned to NUMA node near GPU + NIC │
│ Reads descriptor from queue │
│ Calls network plugin iput/iput_signal │
│ Maps to standard IB verbs or RoCE │
└──────────────────┬───────────────────────┘
│ standard RDMA path
▼
┌──────────────────────────────────────────┐
│ Any RDMA-capable NIC (IB, RoCE, iWARP) │
│ CPU proxy polls completion, writes │
│ counter/signal to GPU-visible memory │
└──────────────────────────────────────────┘
Latency: 18.0 µs RTT (small msgs, ~1.3µs overhead vs GDAKI)
CPU involvement: 1 dedicated thread per communicator
Portability: any RDMA NIC, any CUDA version
▲ Fig 2: GDAKI eliminates CPU entirely via NIC BAR writes; Proxy
adds a CPU intermediary for universal hardware compatibility.
Fig 3: Control Flow — GPU Kernel Initiates a One-Sided put with Signal
START: CUDA kernel executing on GPU (e.g., dispatch in MoE)
│
▼
① [Kernel instantiates ncclGin object]
│ ncclGin gin(devComm, contextIndex)
│ — selects comm context (one of 4 per communicator)
│ — backend chosen at init time (GDAKI or Proxy)
▼
② [Identify peer (e.g., (myRank + 1) % nRanks)]
│
▼
③ [Issue put with remote signal]
│ gin.put(recvWin, peer, peer * dataSize,
│ sendWin, srcOffset, dataSize,
│ ncclGin_SignalInc(signalId=0))
│
├── GDAKI path:
│ GPU constructs WQE → writes NIC doorbell register
│ NIC DMA-transfers data GPU→remote GPU memory
│ NIC posts completion → GPU-visible CQ entry
│
└── Proxy path:
GPU writes 64-byte descriptor → lock-free queue
CPU proxy reads descriptor → posts IB verb
NIC executes RDMA → CPU polls CQ → writes
signal to GPU-visible memory
▼
④ [Remote GPU: waitSignal]
│ gin.waitSignal(coop, signalId=0, expectedValue=1)
│ — polls signal until incremented by sender
│ — ordering guarantee: all prior puts to same peer
│ on same context are visible when signal arrives
▼
⑤ [Sender: reset signal for next round]
│ gin.resetSignal(signalId=0)
▼
⑥ [Computation proceeds with received data — fused kernel]
│
▼
END: no separate kernel launch, no CPU synchronization required
▲ Fig 3: Control flow for one GPU-initiated put+signal — the entire
sequence executes within a single CUDA kernel without CPU involvement.
Fig 4: Data Flow — MoE Dispatch Kernel (DeepEP HT Mode)
GPU Node A GPU Node B
(576-node cluster, H100 80GB HBM3)
SM Role assignment (per-kernel):
Odd SMs → Senders Even SMs → Forwarders
┌─────────────────────────────────────────────────────────────┐
│ RDMA Send Buffer (symmetric, window-registered) │
│ Tokens: FP8 quantized expert data │
└────────────────────────┬────────────────────────────────────┘
│ NVLink (intra-node)
┌────────────────────────▼────────────────────────────────────┐
│ Forwarder SM (even-numbered) │
│ Receives NVLink tokens from Senders │
│ put() → RDMA to destination GPU node │
│ signal() → zero-byte put with SignalAdd to receiver │
└────────────────────────╫────────────────────────────────────┘
║ RDMA (inter-node, IB/RoCE)
║ 8×400 Gbit/s InfiniBand
▼
┌────────────────────────────────────────────────────────────────┐
│ GPU Node B — Receiver SM (any SM) │
│ waitSignal(signalId) — polls until token count arrives │
│ Reads tokens from RDMA receive buffer │
│ Applies top-k weights, FP32 reduction → BF16 output │
└────────────────────────────────────────────────────────────────┘
Key data objects:
══ FP8 token data ══► put() (bulk, no signal)
══ zero-byte signal ══► signal() with SignalAdd (after bulk put)
◄══ expert count ══ signal() on reverse path (combine)
Circular buffer flow control:
Head pointer ── readSignal(head_signal) ──► Sender
Tail pointer ── signal(tail_signal) ──► Receiver
▲ Fig 4: MoE HT dispatch data flow — NVLink for intra-node hops,
RDMA for inter-node; specialized SM roles minimize NVLink
traffic while saturating InfiniBand bandwidth.
Fig 5: State Machine — GIN Communicator Lifecycle
nCCLDevCommCreate() with GIN flag
[UNINIT] ───────────────────────────────► [INITIALIZING]
│
probe DOCA GPUNetIO
capability query
┌─────┴─────┐
DOCA ok│ │DOCA absent
▼ ▼
[GDAKI mode] [Proxy mode]
│ │
└─────┬──────┘
│
nCCLCommWindowRegister()
(register symmetric memory windows)
│
▼
[ACTIVE]
│ ▲
kernel │ │ resetSignal /
put/signal │ │ resetCounter
ops │ │ for reuse
▼ │
[OP IN FLIGHT]
│
waitSignal / waitCounter
│
▼
[OP COMPLETE]
│
next round ─┘
nCCLCommDestroy()
[ACTIVE] ───────────────────────────────► [DESTROYED]
▲ Fig 5: GIN communicator state machine — backend selected at init
based on capability probe; signal/counter resets enable reuse
across communication rounds without re-initialization.
Fig 6: Layered Stack — GIN Software Abstraction Levels
┌──────────────────────────────────────────────────────────────┐
│ Application layer │
│ PyTorch DDP / MegatronLM (collective path) │
│ DeepEP / pplx-kernels / vLLM (device API path) │
├──────────────────────────────────────────────────────────────┤
│ NCCL 2.28 Device API │
│ Unified interface: LSA (NVLink/PCIe) + Multimem (SHARP) │
│ + GIN (network RDMA) — one runtime, three modes │
├──────────────────────────────────────────────────────────────┤
│ GIN Device-Side API (callable from CUDA kernels) │
│ ncclGin: put / putValue / signal / flush │
│ readCounter / waitCounter / readSignal / waitSignal│
│ ncclGinBarrierSession::sync() │
├──────────────────────────────────────────────────────────────┤
│ GIN Network Plugin (GinNet shared library) │
│ ┌──────────────────────┐ ┌──────────────────────────────┐ │
│ │ GDAKI backend │ │ Proxy backend │ │
│ │ (DOCA GPUNetIO) │ │ (CPU proxy thread) │ │
│ │ Direct GPU→NIC WQE │ │ lock-free descriptor queue │ │
│ └──────────────────────┘ └──────────────────────────────┘ │
├──────────────────────────────────────────────────────────────┤
│ Hardware transport │
│ InfiniBand (IBGDA, credit-based flow, 130ns native latency) │
│ RoCEv2 (ECN+PFC, 400ns, standard Ethernet) │
│ NVLink 4th gen (900 GB/s, intra-node, used in HT kernels) │
└──────────────────────────────────────────────────────────────┘
▲ Fig 6: Layered GIN stack — NCCL 2.28 Device API unifies three
communication modes; GIN plugs in as a network plugin, sharing
all collective infrastructure above it.
Fig 7: Sequence Diagram — Low-Latency (LL) MoE Inference Decode Round
GPU Rank 0 (Sender) GPU Rank 1 (Forwarder) GPU Rank 2 (Receiver)
│ │ │
① ────┼── put(tokens, peer=1) ──►│ │
│ (NVLink warp-level) │ │
│ ② ──────┼── put(tokens, peer=2) ──►│
│ │ RDMA via GIN │
│ │ │
③ ────┼── signal(tail, peer=1) ─►│ │
│ zero-byte + SignalAdd │ │
│ ④ ──────┼── signal(tail, peer=2) ──►│
│ │ (all prior puts visible)│
│ │ │
│ │ ⑤ waitSignal│
│ │ (poll) │
│ │ │◄ poll
│ │ │
│ │ ⑥ compute (top-k │
│ │ reduction) │
│ │ │
⑦ reset signals for next round ─────────────────────────────┤
│ │ │
▼ ▼ ▼
[done] [done] [done]
▲ Fig 7: LL kernel sequence — NVLink for intra-node forwarding,
GIN put+signal for inter-node RDMA; computation fused within
the same CUDA kernel, no kernel launch boundaries.
Fig 8: GDAKI vs Proxy Backend Comparison Table
| Characteristic | GDAKI Backend | Proxy Backend | Winner for DynamICCL |
|---|---|---|---|
| Communication path | Direct GPU→NIC | GPU→CPU→NIC | GDAKI (production HPC) |
| CPU involvement | Zero (fully device-driven) | 1 dedicated thread per communicator | GDAKI for latency; Proxy for portability |
| Progress model | NIC autonomously polls GPU memory | CPU thread polls queue, posts to NIC | GDAKI (no scheduling jitter) |
| Operation posting | GPU writes NIC doorbell directly | GPU writes 64B descriptor to lock-free queue | GDAKI (eliminates PCIe round-trip) |
| Hardware requirements | ConnectX-6 Dx+, CUDA 12.2+, DOCA GPUNetIO | Any RDMA NIC, any CUDA version | Proxy for lab/dev clusters |
| Small-msg RTT | 16.7 µs | 18.0 µs (1.3 µs overhead) | GDAKI |
| Debugging support | Device-side tools only | Host-side inspection + tracing | Proxy for development |
| Portability | Production HPC only | Universal (all RDMA vendors) | Proxy for multi-cluster |
| Backend selection | Automatic (capability probe at init) | Automatic fallback | Both — runtime selects |
For DynamICCL, prefer GDAKI on production HPC clusters because the 1.3 µs overhead of the Proxy backend, while small in absolute terms, is significant relative to small-message collective latency (~16 µs) — a 8% overhead on every fine-grained point-to-point operation that DynamICCL's RL agent might select for low-latency collectives in MoE workloads. On development/testbed clusters, Proxy enables identical semantics without ConnectX-6 hardware.
Fig 9: Design Trade-off Analysis
| Decision | Alternative A | Alternative B (GIN) | Winner | Why |
|---|---|---|---|---|
| Communication initiation | Host-initiated (CPU orchestrates all ops) | Device-initiated (GPU kernels issue ops) | B | Eliminates host-device sync overhead (~kernel launch latency); enables computation-communication fusion within a single kernel |
| Synchronization model | Two-sided (sender+receiver coordinate) | One-sided RDMA (put/signal) | B | MoE token routing has irregular, unpredictable destinations; receiver coordination adds latency and requires barrier synchronization |
| Signal addressing | Address-based (pointer) like NVSHMEM | ID-based (integer signal ID) | B | ID-based simplifies resource management across window registrations; automatic fence via signal ordering without explicit memory fences |
| Completion tracking | Global flush (all ops complete) | Per-context Counter + remote Signal | B | Counter tracks local buffer reuse; Signal tracks remote visibility separately — enables pipeline overlap without stalling |
| Backend selection | Static (compile-time choice) | Dynamic (runtime capability probe) | B | Same application binary runs on both GDAKI and Proxy hardware; portability without code changes |
| Memory windows | PGAS flat address space (NVSHMEM) | Collective window registration | B | Window handles enable zero-copy with backend-specific RDMA descriptors; asymmetric buffer sizes for prefill vs decode ranks |
| Thread granularity | Warp-collective (all threads) | Thread-level or warp-level (flexible) | B | MoE dispatch assigns specific experts to specific SMs/warps; thread-level puts let individual warps transfer independently |
| Inter-node protocol for LL | Pure RDMA mesh | Hybrid NVLink+RDMA | B | NVLink for intra-node forwarding reduces IB traffic; NVLink BW (900 GB/s) vs IB (400 Gbit/s) means intra-node bandwidth is 9x higher |
For DynamICCL, prefer B across all rows because DynamICCL's RL agent currently selects NCCL configs for host-initiated collectives. GIN opens a fundamentally new optimization axis: whether to use a collective at all, versus custom device-initiated point-to-point primitives for irregular communication patterns. This is the correct choice for MoE workloads that DynamICCL will encounter as LLM serving becomes a dominant HPC workload.
What to Borrow for DynamICCL
Pattern 1 — Device-initiated config selection as a new action dimension. DynamICCL's current action space is (algo, proto, nChannels, numThreads) for host-initiated collectives. GIN introduces a binary meta-action: "use collective (NCCL)" vs "use device-initiated point-to-point (GIN)". For AllGather and ReduceScatter in MoE contexts where message sizes are small and irregular, GIN's 16.7 µs RTT is competitive with or better than NCCL's LL protocol for the same message size. DynamICCL should extend its action space to include GIN as an alternative to collective algorithms for qualifying workloads.
Pattern 2 — Asynchronous completion tracking with signals and counters. GIN's distinction between local completion (Counter — buffer reuse safe) and remote completion (Signal — data visible at destination) is a precise model for two types of latency that DynamICCL's LSTM currently conflates. Agent-1 (CUSUM) measures end-to-end collective completion time. By separately tracking buffer-release latency (when the next collective can start) and data-visibility latency (when the receiver can compute), DynamICCL can reward configurations that minimize the critical-path latency, not just total wall time.
Pattern 3 — Context-based network parallelism for nChannels tuning. GIN uses multiple contexts per communicator to exploit multiple NICs, ports, and Queue Pairs simultaneously — each context is an independent network channel. This is isomorphic to NCCL's nChannels parameter: more contexts = more parallelism, but with diminishing returns due to congestion (the γ congestion factor from AutoCCL's model). DynamICCL's Agent-2 already tunes nChannels; GIN's context model suggests that the saturation point (optimal nChannels) is determined by the number of physical NICs per node, not just message size. This topology feature should be added to the state vector.
Pattern 4 — Backend auto-detection as a congestion signal. GIN probes DOCA GPUNetIO capability at communicator initialization and falls back to Proxy when unavailable. More importantly, the Proxy backend's 1.3 µs overhead vs GDAKI's direct path is detectable by Agent-1's LSTM as a latency anomaly. If DynamICCL observes that collective latency is consistently higher than the GDAKI-predicted baseline, this signals that the cluster is running in Proxy mode — a condition that changes the optimal nChannels (fewer channels reduce CPU proxy thread contention) and should trigger a policy switch.
Pattern 5 — SM role specialization as an architecture pattern for tuning. GIN's HT kernel assigns Sender, Forwarder, and Receiver roles to different SMs to minimize NVLink traffic and maximize RDMA bandwidth. This SM specialization pattern directly informs DynamICCL's numThreads tuning: for NVLink-RDMA hybrid topologies, the optimal numThreads is not uniform across all channels — NVLink channels benefit from more threads (higher intra-node bandwidth) while RDMA channels are NIC-bound. A topology-aware numThreads policy (different thread counts per channel type) is a concrete extension to DynamICCL's action space.
Pattern 6 — Zero-byte put+signal for release-acquire ordering without fences. GIN implements release-acquire semantics through a zero-byte put() combined with SignalAdd — this ensures all prior puts to the same peer on the same context are complete before the signal arrives, without requiring explicit CUDA memory fences. DynamICCL's reward function should account for this: when using the LL protocol (which uses 64-byte chunks with inline flags), the ordering is handled in-band by NCCL, but when DynamICCL selects the Simple protocol for large messages, ordering guarantees require fence operations that add latency. This trade-off should be captured in the protocol selection component of Agent-2's reward signal.