MSCCLang: Microsoft Collective Communication Language

Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi (Microsoft Research, Redmond), Yifan Xiong (Microsoft Research, Beijing) | ASPLOS '23 (28th ACM Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, Vol. 2) | DOI: 10.1145/3575693.3575724

Problem

Modern ML training depends on collective communication (AllReduce, AllGather, AllToAll) executed across hundreds of GPUs. As models cross the billion-parameter scale, communication becomes the dominant cost — the paper cites workloads where DeepLight spends ~79% of step time in collectives and where AllReduce on medium buffers blocks transformer steps. NCCL/RCCL ship a tiny set of fixed algorithm templates (Ring, double-binary Tree) that were hand-tuned for a few canonical topologies; on Azure NDv4 (8x A100 + 6 NVSwitches + HDR IB) and DGX-2 (16x V100 + 6 NVSwitches), these templates leave large gaps for irregular GPU/NIC ratios, novel collectives (AllToAll, AllToNext), and small/medium message sizes. Synthesis tools (SCCL, TACCL, Blink) have shown that better algorithms exist, but they emit XML schedules without the low-level execution machinery — instruction fusion, pipelining, threadblock scheduling — that make NCCL kernels fast, and naive translation through multiple kernel launches squanders the algorithmic gains. The result is a productivity gap: researchers can describe better algorithms but cannot implement them at hand-written-CUDA performance.

Core Insight

A chunk-oriented domain-specific language combined with an optimizing compiler that lowers algorithms through a Chunk DAG and an Instruction DAG — fusing recv/reduce/copy/send into single peephole kernels and scheduling them onto threadblocks via a priority-driven heuristic — and executed by a single-kernel interpreter (MSCCL runtime) lets a 15-line DSL program match or exceed 70-line hand-tuned CUDA implementations, delivering up to 1.9x speedup on AllReduce and 1.3x on AllToAll over hand-optimized baselines, and turning custom collectives into a one-day exercise rather than a multi-week kernel-engineering effort.

Method

MSCCLang is a three-layer system: a Python-embedded DSL, an optimizing compiler that produces MSCCL-IR, and an interpreter-based runtime embedded in NCCL.

+-------------------------------------------------------------+
|  DSL (Python-embedded, chunk-oriented)                      |
|   - chunk(rank, buf, idx, count)                            |
|   - c.copy(dst_rank, dst_buf, dst_idx, ch)                  |
|   - c.reduce(c2, ch)                                        |
|   - parallelize(N), channels                                |
+-------------------------------+-----------------------------+
                                v
+-------------------------------------------------------------+
|  Compiler                                                   |
|   1. Tracing       -> Chunk DAG  (global data movement)     |
|   2. Lowering      -> Instruction DAG (recv/reduce/copy/send)|
|   3. Fusion passes:  rcs, rrc, rrcs, rrs                    |
|   4. Aggregation:    contiguous chunks -> one transfer      |
|   5. Threadblock allocation (greedy on (peer-pair, ch))     |
|   6. Scheduling:     priority = depth + reverse_depth       |
|   7. Cross-TB sync:  inserts global semaphores              |
|                                                             |
|  Output: MSCCL-IR (XML, tree-shaped per-rank program)       |
+-------------------------------+-----------------------------+
                                v
+-------------------------------------------------------------+
|  MSCCL Runtime (in NCCL 2.8.4-1)                            |
|   - Single cooperative-launch kernel interprets MSCCL-IR    |
|   - Tile execution: chunk -> s FIFO slots (~8) for pipeln   |
|   - Falls back to NCCL Ring/Tree for unsupported sizes      |
+-------------------------------------------------------------+

The DSL enforces single-writer chunk semantics — using a stale chunk reference is a compile-time error — eliminating data races by construction. Authors express Ring AllReduce, an All Pairs low-latency AllReduce, a 4-phase Hierarchical AllReduce, a Two-Step AllToAll that bundles cross-node sends, and a custom AllToNext (rank i -> rank i+1) that fans across all 8 IB NICs per NDv4 node.

Experimental Setup

Component	Value
Platform A	Azure ND A100 v4 (NDv4): 8x A100-80GB / node
Platform A links	12x 3rd-gen NVLink to 6 NVSwitches (600 GB/s bi-dir intra-node); 8x HDR IB NICs at 25 GB/s each (1 NIC per GPU pair)
Platform B	NVIDIA DGX-2: 16x V100-32GB / node
Platform B links	6x 2nd-gen NVLink to 6 NVSwitches; 1x HDR IB NIC per GPU pair
Multi-node sweeps	1 node, 2 nodes (16 GPUs), 3 nodes (24 GPUs), up to 32 nodes (256 GPUs)
Software	NCCL 2.8.4-1, CUDA 11.x
MSCCLang DSL	Python-embedded
Solver	none (heuristic compiler, no MILP/SMT)
Baselines	NCCL Ring/Tree, hand-optimized CUDA, SCCL-emitted MSCCL-IR
Microbenchmarks	AllReduce, AllToAll, AllToNext sweeping buffer size (KB to GB)
Workloads	BERT, GPT, Mixture-of-Experts (MoE), Azure OpenAI Copilot training
Metrics	Microbench latency (us); end-to-end training throughput; LOC (expressiveness); compilation time

Headline Quantitative Results

AllReduce microbenchmark (vs. NCCL baseline):

Platform	Algorithm	Buffer regime	Speedup
1-node NDv4 (8x A100)	All Pairs / Hierarchical	32 KB - 3 MB	up to 1.9x
1-node DGX-2 (16x V100)	All Pairs	small (< few MB)	up to 3.0x
Multi-node NDv4	Hierarchical 4-phase	medium-large	matches or beats NCCL

AllToAll microbenchmark (vs. baselines):

Platform	Buffer	Speedup vs. hand-CUDA	Speedup vs. NCCL
16-node, 256x A100	> 512 MB	1.3x	1.20x (~20%)

AllToNext custom collective (vs. naive CUDA):

Platform	Speedup
3-node, 24x A100	up to 14.5x (utilizes all 8 IB links per node vs. 1)

End-to-end training (vs. NCCL):

Workload	Scale	Speedup
Azure OpenAI Copilot	production cluster	20% GPU-time reduction
Large MoE	256x A100	1.10x - 1.89x depending on model arch

Expressiveness (lines-of-code):

Algorithm	MSCCLang	Hand-optimized CUDA
Two-Step AllToAll	15 lines	70 lines

Cost-model usage: the compiler aggregates contiguous chunks driven by the classic T = alpha + S * beta model (amortize alpha at small S), and uses priority = depth + reverse_depth over the Instruction DAG to surface critical-path instructions to the scheduler.

Limitations

Static / per-topology compilation. Each MSCCL-IR is generated for a specific (collective, rank-count, topology) triple; size-dependent algorithm switching is delegated to the runtime's table of pre-compiled IRs rather than dynamic recompilation.
No cross-collective fusion. Adjacent collectives in a training step (e.g., gradient AllReduce followed by parameter Broadcast) are compiled independently; cross-collective optimization is named as future work.
No compute-comm fusion. The DSL describes communication only; reduction is the only compute primitive. Overlap with backward-pass compute relies on existing framework-level scheduling (Wait-Free Backprop, etc.).
NVIDIA-centric. All measurements use CUDA 11 + NCCL 2.8.4-1 on V100 / A100; AMD ROCm/RCCL is not evaluated.
No automatic algorithm synthesis. Algorithms must be authored by a human — the system is an executor of designs, not a discoverer (in contrast to SCCL/TACCL which synthesize).
NCCL-version coupled. The runtime is embedded in a specific NCCL fork; upgrading to NCCL 2.18+ requires re-porting the interpreter.

Open Problems Called Out

Computation-aware DSL. Future work explicitly extends the DSL to express compute scheduling alongside communication so backward-pass gradient production and inter-rank reduction can be fused.
Cross-collective fusion. A program-level optimizer that fuses adjacent collectives in a step (e.g., AllToAll-AllToAll back-to-back in MoE) is unimplemented.
Dynamic / size-adaptive code generation. Currently a discrete table of pre-compiled IRs; an online generator that JIT-specializes for observed message-size distributions is open.
Automated authoring. SCCL / TACCL synthesize algorithms but emit weaker execution; coupling synthesis to MSCCLang's compiler so discovered algorithms inherit fusion + pipelining is the natural integration.

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that, via the NCCL tuner-plugin API, selects per-collective algorithm (Ring / Tree / CollNet / NVLS), protocol (LL / LL128 / Simple), nChannels, numThreads, and chunkSize to minimize collective wall-clock on HPC GPU clusters. MSCCLang sits at a different layer — it is a DSL + compiler + runtime that adds new algorithms (and the machinery to execute them well) into NCCL — but its design choices map directly onto DynamICCL's state vector, action space, and exploration priors:

Action-space expansion: each MSCCLang program is a discrete action. When the MSCCL runtime is loaded, every compiled MSCCL-IR (Ring AR, All Pairs AR, Hierarchical AR, Two-Step AllToAll, AllToNext, ...) is a new entry in DynamICCL's algorithm enum. The 4-way Ring/Tree/ CollNet/NVLS choice expands to a topology-conditioned bag of pre-compiled IRs. The agent's job is to learn which pre-compiled IR wins for each (size, topology) regime — exactly the gap MSCCLang itself does not close.
State feature: tile / FIFO-slot count s. MSCCLang's pipelining model splits buffers into s FIFO slots (default 8). This is the same family as NCCL's numPipeOps and chunkSize. DynamICCL's state should bin observed buffer size against the FIFO capacity so the policy learns when extra pipeline depth helps versus when alpha-cost amortization wins.
State feature: critical-path slack proxy. MSCCLang's scheduler computes priority = depth + reverse_depth over the Instruction DAG. DynamICCL cannot observe this DAG directly, but a coarse proxy — "is this collective on the critical path of the current step?" — could be derived from the recent-collective LSTM window already in the state vector and used to bias toward latency-optimal vs. bandwidth-optimal actions.
Action prior: small messages favor All Pairs-like algorithms; large messages favor Ring / Hierarchical. MSCCLang's headline result is 3.0x AllReduce speedup at small buffers via All Pairs on DGX-2, confirming that the latency-optimal Pareto corner is distinct from the bandwidth-optimal corner. DynamICCL's exploration prior should bias toward (Tree, LL/LL128, nChannels=1-2) at small message sizes — the equivalent in NCCL's catalog of the All Pairs low-latency regime — and toward (Ring, Simple, nChannels=4-8) at large sizes.
Action prior: hierarchical decomposition for multi-node. MSCCLang's 4-phase Hierarchical AllReduce (intra-RS -> inter-RS -> inter-AG -> intra-AG) maps onto NCCL's hierarchical algorithm family. When DynamICCL observes a PCIe+IB or NVLink+IB topology fingerprint, the action prior should weight CollNet / NVLS heavily — the NCCL-side equivalent of MSCCLang's hierarchical pattern.
State feature: per-link-class capability vector. MSCCLang's AllToNext speedup of 14.5x comes purely from utilizing all 8 IB NICs per NDv4 node instead of one. DynamICCL's topology fingerprint must capture the per-GPU IB-NIC count (and intra-node NVLink/NVSwitch count), not just the four coarse classes (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet) — this is the load-balanced-NIC regime.
Reward shaping: collective wall-clock matches MSCCLang's metric. MSCCLang reports microbench latency (us) and end-to-end training speedup; both are direct functions of collective wall-clock. DynamICCL's r = -collective_wall_clock_us is the right primary signal. The 20% Copilot GPU-time reduction shows that collective- level wins translate to step-level wins without further normalization.
Research positioning: MSCCLang generates, DynamICCL selects. MSCCLang authors (and synthesizers like SCCL/TACCL feeding into MSCCL-IR) populate the algorithm catalog; DynamICCL learns which catalog entry to invoke per call. A joint stack — SCCL/TACCL synthesize, MSCCLang lowers and executes, DynamICCL picks at runtime — is the natural composition and cleanly positions DynamICCL's contribution as the missing online selector that MSCCLang's "static / per-topology compilation" limitation leaves open.
Exploration budget: prefer offline-precomputed IRs to live recompilation. MSCCLang compilation is offline (no published compile-time table, but lowering is heuristic-driven, not MILP- solved). DynamICCL must amortize exploration against a fixed, pre-compiled action set; the policy should be discrete-categorical over loaded MSCCL-IR variants, not parametric-continuous over any synthesized space.
Open-problem alignment: dynamic size-adaptive selection. MSCCLang's open problem #3 (size-adaptive code generation) is precisely DynamICCL's contribution at the selection layer: given a fixed library of MSCCL-IR variants, learn the size / topology / load mapping to the best one. DynamICCL closes this gap without requiring JIT compilation.