MSCCLang: Microsoft Collective Communication Language — Detailed Summary

Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi (Microsoft Research, Redmond), Yifan Xiong (Microsoft Research, Beijing) | ASPLOS '23 (28th ACM Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, Vol. 2) | DOI: 10.1145/3575693.3575724

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.


Abstract


1. Introduction

Background and motivation:

Motivating workloads:

Gap left by prior work:

MSCCLang's contribution:


2. Background

2.1 Collective Operations

2.2 NCCL Algorithms

2.3 Hardware

Platform GPUs / Node Intra-node Interconnect Inter-node
Azure ND A100 v4 (NDv4) 8x A100-80GB 12x 3rd-gen NVLink to 6 NVSwitches; 600 GB/s bi-dir 8x HDR IB NICs at 25 GB/s each (1 NIC per GPU pair)
NVIDIA DGX-2 16x V100-32GB 6x 2nd-gen NVLink to 6 NVSwitches 1x HDR IB NIC per pair of GPUs

2.4 NCCL P2P Primitives


3. MSCCLang DSL Design

3.1 Chunk-Oriented Semantics

3.2 Core Primitives

DSL primitive Meaning
chunk(rank, buffer, index, count) Returns a reference to count contiguous chunks starting at index of buffer on rank.
c.copy(dst_rank, dst_buffer, dst_index, ch) Send chunks c to (dst_rank, dst_buffer, dst_index) over channel ch. Returns a reference to the destination chunks for further chaining.
c.reduce(c2, ch) Element-wise reduction of two chunk references using the channel's reduction op (typically sum).
parallelize(N) Run a code fragment as N parallel instances, each over 1/N of the data.

3.3 Channels

3.4 Safety Properties


4. Compiler / IR / Lowering

4.1 Compilation Pipeline

DSL Program (Python)
  |
  | tracing
  v
Chunk DAG  (per-chunk producer/consumer graph; exposes natural parallelism)
  |
  | lowering
  v
Instruction DAG  (nodes: send, recv, reduce, copy)
  |
  | fusion + aggregation + threadblock allocation + scheduling
  v
MSCCL-IR (XML, tree-shaped, per-rank)
  |
  | runtime load
  v
Cooperative single-kernel interpreter

4.2 Chunk DAG

4.3 Instruction DAG and Fusion Passes

The Instruction DAG has four base node types: send, recv, reduce, copy. The compiler runs peephole fusion passes over patterns of adjacent nodes:

Fusion Pattern Effect
rcs (receive-copy-send) recv + copy + send on same chunk Single fused kernel; chunk forwarded without exiting registers.
rrc (receive-reduce-copy) recv + reduce + copy local Combines incoming partial sum with local data and writes.
rrcs (receive-reduce-copy-send) recv + reduce + copy + send The full Ring-AllReduce inner step; matches NCCL's hand-tuned fused kernel.
rrs (receive-reduce-send) recv + reduce + send (no local copy) Special case: result forwarded but not retained, freeing registers.

These passes restore the register-resident dataflow that hand-tuned NCCL kernels rely on; without them, IR-emitted code would round-trip through global memory between every operation.

4.4 Aggregation

4.5 Threadblock Allocation

4.6 Scheduling and Critical-Path Priority

4.7 Cross-Threadblock Synchronization

4.8 Output: MSCCL-IR


5. Runtime / Execution Model

5.1 Single-Kernel Cooperative Launch

5.2 Tile Execution and Pipelining

5.3 NCCL API Compatibility


6. Expressed Algorithms

6.1 Ring AllReduce

6.2 All Pairs AllReduce (Novel Low-Latency)

6.3 Hierarchical AllReduce (4-phase)

6.4 Two-Step AllToAll

6.5 AllToNext (Custom)


7. Evaluation Setup

7.1 Clusters

Cluster GPUs / Node Memory Intra Inter
Azure ND A100 v4 8x A100 80 GB NVLink 3.0 / 6 NVSwitches (600 GB/s bi-dir) 8x HDR IB NICs (25 GB/s each)
NVIDIA DGX-2 16x V100 32 GB NVLink 2.0 / 6 NVSwitches 1x HDR IB NIC per GPU pair

7.2 Software Stack

7.3 Baselines

7.4 Workloads

7.5 Metrics


8. Experimental Results

8.1 AllReduce Microbenchmark

Setup Algorithm Buffer regime Speedup vs. NCCL
1-node NDv4 (8x A100) All Pairs / Hierarchical 32 KB - 3 MB up to 1.9x
1-node DGX-2 (16x V100) All Pairs small (sub-MB) up to 3.0x
Multi-node NDv4 Hierarchical 4-phase medium-to-large matches or beats NCCL

8.2 AllToAll Microbenchmark

Setup Buffer vs. hand-CUDA vs. NCCL
16-node, 256x A100 > 512 MB 1.3x ~1.20x (20%)

8.3 AllToNext Microbenchmark

Setup Speedup vs. naive CUDA
3-node, 24x A100 up to 14.5x

8.4 End-to-End Training

Workload Scale Speedup
Azure OpenAI Copilot production 20% GPU-time reduction
Large MoE 256x A100 1.10x - 1.89x depending on architecture

8.5 Expressiveness

Algorithm MSCCLang LOC Hand-CUDA LOC
Two-Step AllToAll 15 70

8.6 Comparison to SCCL


9. Limitations


System Position vs. MSCCLang
NCCL / RCCL Foundation MSCCLang builds on; provides Ring/Tree algorithms, P2P transports, channels. MSCCLang adds programmability above.
SCCL Synthesizer for intra-node collective algorithms; emits MSCCL-IR XML. Lacks the fusion / aggregation / scheduling passes that MSCCLang's compiler adds.
TACCL Multi-node synthesizer using sketches + MILP; emits MSCCL-IR XML. Same gap as SCCL re: execution-level optimization.
BLINK Generates fast collectives by traversing spanning trees of the topology; requires manual implementation of routing decisions. MSCCLang provides the implementation layer Blink lacks.
Horovod / BytePS Framework-level orchestration (Wait-Free Backprop, tensor fusion, partition); operate above NCCL. MSCCLang is orthogonal — it improves the kernel below.
MSCCL (predecessor) Same XML IR but no compiler — algorithms must be hand-authored XML. MSCCLang is the missing high-level language and optimizer for MSCCL.
DSL/compiler analogs The paper positions MSCCLang in the same lineage as Halide / TVM (DSL + IR + lowering passes) but specialized for collective communication rather than tensor computation.

11. Conclusion and Future Work


12. Key Equations and Cost Models

Model Formula Used for
Alpha-beta link cost T = alpha + S * beta Aggregation pass — bundles small chunks to amortize alpha.
Pipelining Buffer split into s FIFO slots; s defaults to 8 Tile execution — overlap intra-node and inter-node phases.
Critical-path priority priority = depth + reverse_depth over the Instruction DAG Scheduler — orders instructions within each threadblock.

No solver-based cost model (no MILP / SMT) — MSCCLang's compiler is heuristic-driven and fast.


13. Named Methods, DSL Primitives, Compiler Passes

Term One-line definition
Chunk Abstract fixed-size data unit; the routing primitive of the DSL.
Chunk-oriented programming Style where the author specifies how chunks move, not how threads execute.
Chunk DAG First IR; per-chunk producer/consumer graph used to expose parallelism.
Instruction DAG Second IR; operations are send, recv, reduce, copy.
MSCCL-IR Final XML representation; a per-rank tree of opcodes consumed by the runtime.
rcs / rrc / rrcs / rrs Peephole fusion patterns for receive(-reduce)(-copy)(-send).
Aggregation Compiler pass that bundles contiguous chunks to amortize alpha.
Threadblock allocation Greedy assignment of instructions to TBs by (send-peer, recv-peer, channel) tuple.
Cooperative launch All TBs guaranteed to run concurrently, enabling cross-TB semaphores.
Tile execution Runtime loop dividing chunks into FIFO-slot-sized tiles for pipelining.
All Pairs algorithm 2-step low-latency AllReduce expressed in MSCCLang; wins at small buffers.
Two-Step AllToAll Bundle-then-bulk-send AllToAll; wins on multi-node large buffers.
AllToNext Custom collective (rank i -> rank i+1) demonstrating extensibility.
parallelize(N) DSL modifier to instantiate N parallel copies of a code fragment.
Channel Multiple parallel NCCL connections between the same GPU pair, exposed as the ch parameter.
Single-writer chunk Safety invariant — each chunk reference may be written exactly once.

14. Cross-Cutting Empirical Take-Aways

Take-away Derived from
Latency-optimal AllReduce (All Pairs) wins at small sizes by 3.0x on 16x V100 Sec. 8.1
Bandwidth-optimal Hierarchical AllReduce wins at multi-node large sizes Sec. 8.1
Two-Step AllToAll converts per-GPU-pair alphas to per-node-pair alphas (1.3x at 256 GPUs) Sec. 8.2
NIC-fanout matters as much as algorithm for non-standard collectives (14.5x AllToNext) Sec. 8.3
End-to-end training picks up 20% on Copilot from collective-level wins Sec. 8.4
4.7x LOC reduction (15 vs. 70 lines) without performance loss Sec. 8.5

15. Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that, via the NCCL tuner-plugin API, selects per-collective algorithm (Ring / Tree / CollNet / NVLS), protocol (LL / LL128 / Simple), nChannels, numThreads, and chunkSize to minimize collective wall-clock on HPC GPU clusters. MSCCLang is a DSL + compiler + runtime that sits below DynamICCL's selection layer — it is the substrate that produces the algorithms DynamICCL chooses among. Each MSCCLang finding maps to a specific DynamICCL design implication:

Direct mappings:

MSCCLang finding DynamICCL design implication
Each compiled MSCCL-IR is a discrete program loaded at runtime The algorithm action enum must be extended dynamically: when MSCCL is loaded, every MSCCL-IR variant (All Pairs, Hierarchical, Two-Step AllToAll, AllToNext, ...) becomes a new categorical action.
All Pairs AllReduce 3.0x at small sizes; Ring/Hierarchical at large sizes Confirms message-size log-binning as primary state feature; bias the action prior toward latency-optimal (Tree-like / All Pairs) at < ~1 MB and bandwidth-optimal (Ring / Hierarchical) at > ~16 MB.
Tile execution with s = 8 FIFO slots DynamICCL's chunkSize and numPipeOps actions are the NCCL-side equivalents; explore this axis explicitly rather than holding fixed.
Critical-path priority depth + reverse_depth Cannot be observed directly, but a coarse "is this collective on the step's critical path?" feature can be derived from the recent-collective LSTM window already in DynamICCL's state.
AllToNext 14.5x via 8-NIC fanout Topology fingerprint must capture per-GPU NIC count, not just NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet — the existing 4-class fingerprint loses the load-balanced-NIC regime.
Hierarchical 4-phase (intra-RS, inter-RS, inter-AG, intra-AG) When DynamICCL observes a (NVLink + IB) topology, action prior should weight CollNet / NVLS heavily — the NCCL-side analog of hierarchical decomposition.
Two-Step AllToAll bundles cross-node sends At multi-node large AllToAll calls, bias toward larger chunkSize and Simple protocol — bundle to amortize alpha.
Two-Step AllToAll in 15 LOC vs. 70 in CUDA Authoring cost of new MSCCL-IR variants is low enough that the DynamICCL action set can grow as new collectives are added — supports a catalog-extensible RL design rather than a fixed-cardinality one.
Per-collective wall-clock is the headline metric (microbench latency, training throughput) DynamICCL's reward r = -collective_wall_clock_us matches; sign and unit consistent with paper's evaluation.
20% GPU-time reduction at OpenAI Copilot scale A real-world floor on the value of better algorithm selection — sets DynamICCL's expected end-to-end gain envelope when picking from a richer catalog.

Specific design priors for the RL agent:

  1. Action-space expansion as a function of loaded plugin. The action space is dynamic — algorithm in {Ring, Tree, CollNet, NVLS} when only NCCL is loaded; algorithm in {Ring, Tree, CollNet, NVLS, MSCCL_AllPairs_AR, MSCCL_Hierarchical_AR, MSCCL_TwoStep_A2A, MSCCL_AllToNext, ...} when MSCCL runtime is loaded. The policy should accept a plugin-capability vector as input so a single trained network generalizes across deployments.

  2. State features motivated by MSCCLang's compiler decisions.

    • Message size (already log-binned) — drives All Pairs vs. Ring crossover.
    • Tile / FIFO slot context — derived from NCCL's chunkSize / numPipeOps; lets policy learn pipeline-depth interactions.
    • Per-GPU NIC count — captures AllToNext-style fanout regimes.
    • NVSwitch presence flag — distinguishes DGX-2 (16 GPU NVSwitch) from NDv4 (8 GPU NVSwitch); changes hierarchical-phase ratios.
    • Recent-collective LSTM context (already present) — proxy for critical-path-ness.
  3. Exploration prior.

    • Small messages (< ~256 KB): bias toward latency-optimal MSCCL variants (All Pairs analogs) or NCCL Tree + LL/LL128 + nChannels=1-2.
    • Medium messages (256 KB - 16 MB): main exploration zone — let RL learn the crossover.
    • Large messages (> 16 MB): bias toward bandwidth-optimal Hierarchical / Ring + Simple + nChannels=4-8.
    • Multi-node + AllToAll: prior toward Two-Step / bundled variants.
    • Multi-NIC topology + custom collectives: prior toward NIC-fanout-aware variants.
  4. Reward shaping.

    • Primary: r = -collective_wall_clock_us (matches MSCCLang's reported metric).
    • Optional secondary: penalize p99 to catch tile-pipelining tail outliers (consistent with MSCCLang's tile-execution model where bad slot interleaving creates spikes).
  5. Research positioning — generation vs. selection. MSCCLang generates and executes algorithms; DynamICCL selects among them online. The two systems compose cleanly:

    SCCL / TACCL synthesize  ->  MSCCLang lowers + executes
                                           |
                                           v
                             Pre-compiled MSCCL-IR catalog
                                           |
                                           v
                                   DynamICCL selects
                                           |
                                           v
                               NCCL invokes the chosen kernel

    This positions DynamICCL as the missing online selector that MSCCLang's "static / per-topology compilation" limitation explicitly leaves open.

  6. Open-problem alignment.

    • MSCCLang's future-work item "size-adaptive code generation" maps onto DynamICCL's discrete-action selection problem: rather than JIT-recompile, choose the right pre-compiled IR.
    • MSCCLang's future-work item "compute-comm scheduling" suggests DynamICCL's reward could optionally include a step-time bonus when the chosen collective covers the backward-pass compute window — an extension once MSCCLang grows compute primitives.
    • MSCCLang's future-work item "cross-collective fusion" hints at a longer-horizon DynamICCL formulation where the MDP horizon extends across consecutive collectives in a step rather than treating each in isolation.
  7. Exploration budget — prefer offline-precomputed IRs to live recompilation. MSCCLang compilation is offline and fast (heuristic-driven, no MILP/SMT). DynamICCL must amortize exploration against a fixed pre-compiled action set; the policy is discrete-categorical over loaded MSCCL-IR variants plus NCCL built-ins, not parametric-continuous over any synthesized space.

  8. Topology embedding refinement. MSCCLang's per-platform results (NDv4 vs. DGX-2) are sensitive to NVLink generation, NVSwitch count, and IB-NIC density. The four-class topology fingerprint is too coarse — DynamICCL should encode (NVLink-bandwidth-class, NVSwitch-count, NICs-per-GPU, IB-bandwidth-class) as a small embedding rather than a one-hot.

  9. Catalog versioning. MSCCL-IR variants are stable artifacts; DynamICCL can fingerprint each loaded IR (by hash of XML) and treat them as a versioned action set. When the operator adds or removes IR variants, the policy can either (a) retrain the categorical head or (b) feed an IR-feature embedding (collective type, latency-vs-bandwidth lean) so a single network handles a growing catalog.

  10. Quantitative anchor for expected gains. MSCCLang reports microbench wins of 1.9x (AllReduce small), 1.3x (AllToAll large), 14.5x (AllToNext), and end-to-end 1.10-1.89x (MoE) and 20% (Copilot). These bound the upper end of what DynamICCL can recover by algorithm selection alone — when DynamICCL picks the right MSCCL-IR variant per call. They also set realistic stretch goals: a learned online selector that achieves >80% of MSCCLang's offline-best across regimes is a defensible contribution.