Brief Summary: MSCCL++: Rethinking GPU Communication Abstractions for Cutting-Edge AI Applications

Citation: Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, Ziyue Yang. Microsoft Research / Microsoft Azure. arXiv:2504.09014v3, August 21, 2025. (Published at a systems venue; in production at Microsoft Azure.)

Code: https://github.com/microsoft/mscclpp

Problem

Modern cutting-edge AI applications are deployed on fast-evolving, heterogeneous hardware (A100, H100, MI300x, etc.). General-purpose collective communication libraries like NCCL are slow to optimize for each new hardware generation and each workload-specific scenario. Practitioners therefore write custom communication stacks from scratch (e.g., TensorRT-LLM's custom AllReduce outperforms NCCL for small messages). This creates massive redundant development effort, produces non-portable code, and fragments the software ecosystem. The root cause: NCCL's abstractions hide hardware capabilities to simplify programming, but this hiding prevents expert-level optimizations that are hardware- and workload-specific.

Core Insight

Separate hardware primitives from high-level collective algorithms via a two-level architecture. A minimal Primitive Interface (put, signal, wait, flush) exposes hardware capabilities directly from within GPU kernels — making it easy to implement new hardware quickly and to perform fine-grained optimizations. Higher-level interfaces (DSL API for custom algorithm specification; Collective API for NCCL-compatible drop-in replacement) are built on top of this primitive layer. This separation of concerns provides both portability (algorithms work across GPUs by swapping primitive implementations) and performance (users can bypass the high-level interface when needed).

Method

MSCCL++ defines three communication channel types, each corresponding to a hardware data-transfer mode:

MemoryChannel: Thread-copy mode (GPU threads directly write to peer GPU memory via NVLink/PCIe/xGMI). Two protocols: LL (low-latency, coarse chunk synchronization) and HB (high-bandwidth, bulk transfer).
PortChannel: DMA-copy mode (CPU initiates DMA or RDMA transfers). Uses lock-free GPU-to-CPU request queues; GPU writes requests, a dedicated CPU thread processes them via ibv_post_send or cudaMemcpy.
SwitchChannel: Switch-based aggregation (NVLink SHARP / NVSwitch multimem instructions). Exposes hardware reduce and broadcast at the NVSwitch level.

Built-in collective algorithms in MSCCL++ Collective API:

One-phase All-pairs (1PA): All GPUs simultaneously send all local data to all others. Best for tiny messages (< few KB) where latency dominates.
Two-phase All-pairs (2PA): ReduceScatter phase (each GPU collects 1/N of data) then AllGather phase. More bandwidth-efficient; supports PortChannel, MemoryChannel (LL or HB), or SwitchChannel variants. Used for single-node up to a few MB.
Two-phase Hierarchical (2PH): Minimizes cross-node traffic via local node-level ReduceScatter/AllGather. Two versions: small-message LL protocol (more redundant data, fewer sync steps) and large-message HB protocol (pipelining local and cross-node phases). Used for multi-node at any size.

Key Results

Evaluated on A100-40G (8 GPUs/node, NVLink3, 200 Gb/s HDR IB), A100-80G, H100 (NVLink4), and AMD MI300x. Comparison baseline: NCCL 2.26.2, RCCL 2.20.5, MSCCL 2.23.

AllReduce on A100-40G:

Small messages (≤ 1 MB): MSCCL++ up to 4.2× faster than NCCL; 3.1× faster than MSCCL (which uses the same algorithms but NCCL primitives).
Large messages (≥ 1 MB): MSCCL++ up to 1.8× faster than both NCCL and MSCCL.

AllGather on A100-40G:

Small messages: MSCCL++ up to 5.4× faster than NCCL; 2.3× faster than MSCCL.
Large messages: up to 1.8× faster than both.

H100 (single-node, NVLink4):

AllReduce: up to 2.8× faster than NCCL; up to 2.4× faster for large messages.
SwitchChannel (NVSwitch multimem) delivers up to 56% higher bandwidth than equivalent MemoryChannel on H100.

AMD MI300x (single-node, Infinity Fabric):

AllReduce: up to 3.8× faster than RCCL; 2.2× faster than MSCCL.
AMD-specific code in MSCCL++ is less than 10 lines (excluding Makefiles and algorithms), demonstrating portability.

LLM Inference (vLLM + Llama2-70b, single-node 8×A100-80GB):

Decode speedup: 4%–15% faster than NCCL AllReduce across batch/sequence-length configurations.
Prefill speedup: similar or up to 6% faster (communication is smaller relative to compute in prefill).

Development efficiency:

AMD MI300x port: 7 weeks for one developer (3 weeks for basic support + 4 weeks for new algorithms outperforming RCCL for 1 KB–1 GB).
NVSwitch SwitchChannel interface: 8 weeks for two developers.

Limitations

NCCL ecosystem parity: MSCCL++ reimplements the NCCL API as a compatibility layer, but this layer inherits NCCL's algorithm-selection limitations (same suboptimal algorithm choices as NCCL for workloads outside the manually tuned set).
DSL performance overhead: DSL API introduces a 3% average runtime overhead (up to 18% in corner cases) vs. direct Primitive API use, because DSL generates instructions interpreted by the DSL Executor kernel.
Single-node tested most thoroughly: Multi-node (4-node, 32 GPUs) is tested using the DSL API only; Primitive API validation is limited to 2-node.
Manual algorithm selection: Currently requires expert knowledge to write efficient custom algorithms. No automated synthesis path for the Primitive API (unlike MSCCL which uses MSCCLang synthesis).
No inter-node PortChannel for intra-node in NCCL: NCCL uses thread-copy (MemoryChannel equivalent) within a single node even when DMA-copy could achieve 15.8% higher bandwidth — MSCCL++ exposes this choice to the user.

Relevance to DynamICCL

Highly relevant as background context; moderate direct applicability.

NCCL limitations are fully articulated. MSCCL++ provides the most detailed publicly available analysis of NCCL's performance limitations: wasted GPU cycles in send/recv, inflexible synchronization, inability to use DMA-copy intra-node, static thread-group abstraction that doesn't adapt to different message sizes or algorithms. These are the exact limitations that DynamICCL's RL agent works around by selecting among NCCL's available algorithms and protocols at runtime.
Algorithm-performance relationship. The paper quantifies how algorithm choice (all-pairs vs. ring vs. hierarchical) dramatically affects latency/bandwidth across message sizes. This is precisely the structure of DynamICCL's action space (algo ∈ {ring, tree, collnet_direct, collnet_chain, nvls, nvls_tree, pat}). MSCCL++ data shows that for small messages, all-pairs (not ring) is optimal — consistent with why NCCL's default ring underperforms for small messages.
SwitchChannel and NVLS. MSCCL++ SwitchChannel achieves 56% higher bandwidth than MemoryChannel on H100 by leveraging NVSwitch multimem instructions — which is precisely the NVLS algorithm in NCCL. DynamICCL's action space includes nvls and nvls_tree algorithms; the MSCCL++ data showing 2.2–2.8× speedup with SwitchChannel confirms the importance of having the RL agent learn when NVLS is beneficial.
Hardware-specific optimization. MSCCL++ shows MI300x requires reversed loop ordering (write to all peers simultaneously) vs. NVIDIA GPUs (write to each peer sequentially). DynamICCL operating on NVIDIA clusters would not need this AMD-specific insight, but the general principle (hardware topology determines optimal algorithm structure) applies directly to DynamICCL's cluster-aware configuration.
Not a direct competitor. MSCCL++ requires expert developers to write custom kernels using the Primitive API or DSL. DynamICCL uses RL to select among existing NCCL algorithms and parameters without requiring custom kernel development. DynamICCL is more accessible to practitioners but operates within a smaller optimization space than MSCCL++.