Brief Summary: MSCCL++: Rethinking GPU Communication Abstractions for Cutting-Edge AI Applications

Citation: Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, Ziyue Yang. Microsoft Research / Microsoft Azure. arXiv:2504.09014v3, August 21, 2025. (Published at a systems venue; in production at Microsoft Azure.)

Code: https://github.com/microsoft/mscclpp


Problem

Modern cutting-edge AI applications are deployed on fast-evolving, heterogeneous hardware (A100, H100, MI300x, etc.). General-purpose collective communication libraries like NCCL are slow to optimize for each new hardware generation and each workload-specific scenario. Practitioners therefore write custom communication stacks from scratch (e.g., TensorRT-LLM's custom AllReduce outperforms NCCL for small messages). This creates massive redundant development effort, produces non-portable code, and fragments the software ecosystem. The root cause: NCCL's abstractions hide hardware capabilities to simplify programming, but this hiding prevents expert-level optimizations that are hardware- and workload-specific.

Core Insight

Separate hardware primitives from high-level collective algorithms via a two-level architecture. A minimal Primitive Interface (put, signal, wait, flush) exposes hardware capabilities directly from within GPU kernels — making it easy to implement new hardware quickly and to perform fine-grained optimizations. Higher-level interfaces (DSL API for custom algorithm specification; Collective API for NCCL-compatible drop-in replacement) are built on top of this primitive layer. This separation of concerns provides both portability (algorithms work across GPUs by swapping primitive implementations) and performance (users can bypass the high-level interface when needed).

Method

MSCCL++ defines three communication channel types, each corresponding to a hardware data-transfer mode:

Built-in collective algorithms in MSCCL++ Collective API:

Key Results

Evaluated on A100-40G (8 GPUs/node, NVLink3, 200 Gb/s HDR IB), A100-80G, H100 (NVLink4), and AMD MI300x. Comparison baseline: NCCL 2.26.2, RCCL 2.20.5, MSCCL 2.23.

AllReduce on A100-40G:

AllGather on A100-40G:

H100 (single-node, NVLink4):

AMD MI300x (single-node, Infinity Fabric):

LLM Inference (vLLM + Llama2-70b, single-node 8×A100-80GB):

Development efficiency:

Limitations

Relevance to DynamICCL

Highly relevant as background context; moderate direct applicability.

  1. NCCL limitations are fully articulated. MSCCL++ provides the most detailed publicly available analysis of NCCL's performance limitations: wasted GPU cycles in send/recv, inflexible synchronization, inability to use DMA-copy intra-node, static thread-group abstraction that doesn't adapt to different message sizes or algorithms. These are the exact limitations that DynamICCL's RL agent works around by selecting among NCCL's available algorithms and protocols at runtime.

  2. Algorithm-performance relationship. The paper quantifies how algorithm choice (all-pairs vs. ring vs. hierarchical) dramatically affects latency/bandwidth across message sizes. This is precisely the structure of DynamICCL's action space (algo ∈ {ring, tree, collnet_direct, collnet_chain, nvls, nvls_tree, pat}). MSCCL++ data shows that for small messages, all-pairs (not ring) is optimal — consistent with why NCCL's default ring underperforms for small messages.

  3. SwitchChannel and NVLS. MSCCL++ SwitchChannel achieves 56% higher bandwidth than MemoryChannel on H100 by leveraging NVSwitch multimem instructions — which is precisely the NVLS algorithm in NCCL. DynamICCL's action space includes nvls and nvls_tree algorithms; the MSCCL++ data showing 2.2–2.8× speedup with SwitchChannel confirms the importance of having the RL agent learn when NVLS is beneficial.

  4. Hardware-specific optimization. MSCCL++ shows MI300x requires reversed loop ordering (write to all peers simultaneously) vs. NVIDIA GPUs (write to each peer sequentially). DynamICCL operating on NVIDIA clusters would not need this AMD-specific insight, but the general principle (hardware topology determines optimal algorithm structure) applies directly to DynamICCL's cluster-aware configuration.

  5. Not a direct competitor. MSCCL++ requires expert developers to write custom kernels using the Primitive API or DSL. DynamICCL uses RL to select among existing NCCL algorithms and parameters without requiring custom kernel development. DynamICCL is more accessible to practitioners but operates within a smaller optimization space than MSCCL++.