MSCCLang: Microsoft Collective Communication Language
Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi (Microsoft Research, Redmond), Yifan Xiong (Microsoft Research, Beijing) | ASPLOS '23 (28th ACM Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, Vol. 2) | DOI: 10.1145/3575693.3575724
Problem
Modern ML training depends on collective communication (AllReduce, AllGather, AllToAll) executed across hundreds of GPUs. As models cross the billion-parameter scale, communication becomes the dominant cost — the paper cites workloads where DeepLight spends ~79% of step time in collectives and where AllReduce on medium buffers blocks transformer steps. NCCL/RCCL ship a tiny set of fixed algorithm templates (Ring, double-binary Tree) that were hand-tuned for a few canonical topologies; on Azure NDv4 (8x A100 + 6 NVSwitches + HDR IB) and DGX-2 (16x V100 + 6 NVSwitches), these templates leave large gaps for irregular GPU/NIC ratios, novel collectives (AllToAll, AllToNext), and small/medium message sizes. Synthesis tools (SCCL, TACCL, Blink) have shown that better algorithms exist, but they emit XML schedules without the low-level execution machinery — instruction fusion, pipelining, threadblock scheduling — that make NCCL kernels fast, and naive translation through multiple kernel launches squanders the algorithmic gains. The result is a productivity gap: researchers can describe better algorithms but cannot implement them at hand-written-CUDA performance.
Core Insight
A chunk-oriented domain-specific language combined with an optimizing
compiler that lowers algorithms through a Chunk DAG and an Instruction
DAG — fusing
recv/reduce/copy/send
into single peephole kernels and scheduling them onto threadblocks via a
priority-driven heuristic — and executed by a single-kernel interpreter
(MSCCL runtime) lets a 15-line DSL program match or exceed 70-line
hand-tuned CUDA implementations, delivering up to 1.9x speedup on
AllReduce and 1.3x on AllToAll over hand-optimized baselines, and
turning custom collectives into a one-day exercise rather than a
multi-week kernel-engineering effort.
Method
MSCCLang is a three-layer system: a Python-embedded DSL, an optimizing compiler that produces MSCCL-IR, and an interpreter-based runtime embedded in NCCL.
+-------------------------------------------------------------+
| DSL (Python-embedded, chunk-oriented) |
| - chunk(rank, buf, idx, count) |
| - c.copy(dst_rank, dst_buf, dst_idx, ch) |
| - c.reduce(c2, ch) |
| - parallelize(N), channels |
+-------------------------------+-----------------------------+
v
+-------------------------------------------------------------+
| Compiler |
| 1. Tracing -> Chunk DAG (global data movement) |
| 2. Lowering -> Instruction DAG (recv/reduce/copy/send)|
| 3. Fusion passes: rcs, rrc, rrcs, rrs |
| 4. Aggregation: contiguous chunks -> one transfer |
| 5. Threadblock allocation (greedy on (peer-pair, ch)) |
| 6. Scheduling: priority = depth + reverse_depth |
| 7. Cross-TB sync: inserts global semaphores |
| |
| Output: MSCCL-IR (XML, tree-shaped per-rank program) |
+-------------------------------+-----------------------------+
v
+-------------------------------------------------------------+
| MSCCL Runtime (in NCCL 2.8.4-1) |
| - Single cooperative-launch kernel interprets MSCCL-IR |
| - Tile execution: chunk -> s FIFO slots (~8) for pipeln |
| - Falls back to NCCL Ring/Tree for unsupported sizes |
+-------------------------------------------------------------+
The DSL enforces single-writer chunk semantics — using a stale chunk
reference is a compile-time error — eliminating data races by
construction. Authors express Ring AllReduce, an All Pairs
low-latency AllReduce, a 4-phase Hierarchical AllReduce, a Two-Step
AllToAll that bundles cross-node sends, and a custom AllToNext (rank
i -> rank i+1) that fans across all 8 IB
NICs per NDv4 node.
Experimental Setup
| Component | Value |
|---|---|
| Platform A | Azure ND A100 v4 (NDv4): 8x A100-80GB / node |
| Platform A links | 12x 3rd-gen NVLink to 6 NVSwitches (600 GB/s bi-dir intra-node); 8x HDR IB NICs at 25 GB/s each (1 NIC per GPU pair) |
| Platform B | NVIDIA DGX-2: 16x V100-32GB / node |
| Platform B links | 6x 2nd-gen NVLink to 6 NVSwitches; 1x HDR IB NIC per GPU pair |
| Multi-node sweeps | 1 node, 2 nodes (16 GPUs), 3 nodes (24 GPUs), up to 32 nodes (256 GPUs) |
| Software | NCCL 2.8.4-1, CUDA 11.x |
| MSCCLang DSL | Python-embedded |
| Solver | none (heuristic compiler, no MILP/SMT) |
| Baselines | NCCL Ring/Tree, hand-optimized CUDA, SCCL-emitted MSCCL-IR |
| Microbenchmarks | AllReduce, AllToAll, AllToNext sweeping buffer size (KB to GB) |
| Workloads | BERT, GPT, Mixture-of-Experts (MoE), Azure OpenAI Copilot training |
| Metrics | Microbench latency (us); end-to-end training throughput; LOC (expressiveness); compilation time |
Headline Quantitative Results
AllReduce microbenchmark (vs. NCCL baseline):
| Platform | Algorithm | Buffer regime | Speedup |
|---|---|---|---|
| 1-node NDv4 (8x A100) | All Pairs / Hierarchical | 32 KB - 3 MB | up to 1.9x |
| 1-node DGX-2 (16x V100) | All Pairs | small (< few MB) | up to 3.0x |
| Multi-node NDv4 | Hierarchical 4-phase | medium-large | matches or beats NCCL |
AllToAll microbenchmark (vs. baselines):
| Platform | Buffer | Speedup vs. hand-CUDA | Speedup vs. NCCL |
|---|---|---|---|
| 16-node, 256x A100 | > 512 MB | 1.3x | 1.20x (~20%) |
AllToNext custom collective (vs. naive CUDA):
| Platform | Speedup |
|---|---|
| 3-node, 24x A100 | up to 14.5x (utilizes all 8 IB links per node vs. 1) |
End-to-end training (vs. NCCL):
| Workload | Scale | Speedup |
|---|---|---|
| Azure OpenAI Copilot | production cluster | 20% GPU-time reduction |
| Large MoE | 256x A100 | 1.10x - 1.89x depending on model arch |
Expressiveness (lines-of-code):
| Algorithm | MSCCLang | Hand-optimized CUDA |
|---|---|---|
| Two-Step AllToAll | 15 lines | 70 lines |
Cost-model usage: the compiler aggregates contiguous
chunks driven by the classic T = alpha + S * beta model
(amortize alpha at small S), and uses
priority = depth + reverse_depth over the Instruction DAG
to surface critical-path instructions to the scheduler.
Limitations
- Static / per-topology compilation. Each MSCCL-IR is generated for a specific (collective, rank-count, topology) triple; size-dependent algorithm switching is delegated to the runtime's table of pre-compiled IRs rather than dynamic recompilation.
- No cross-collective fusion. Adjacent collectives in a training step (e.g., gradient AllReduce followed by parameter Broadcast) are compiled independently; cross-collective optimization is named as future work.
- No compute-comm fusion. The DSL describes communication only; reduction is the only compute primitive. Overlap with backward-pass compute relies on existing framework-level scheduling (Wait-Free Backprop, etc.).
- NVIDIA-centric. All measurements use CUDA 11 + NCCL 2.8.4-1 on V100 / A100; AMD ROCm/RCCL is not evaluated.
- No automatic algorithm synthesis. Algorithms must be authored by a human — the system is an executor of designs, not a discoverer (in contrast to SCCL/TACCL which synthesize).
- NCCL-version coupled. The runtime is embedded in a specific NCCL fork; upgrading to NCCL 2.18+ requires re-porting the interpreter.
Open Problems Called Out
- Computation-aware DSL. Future work explicitly extends the DSL to express compute scheduling alongside communication so backward-pass gradient production and inter-rank reduction can be fused.
- Cross-collective fusion. A program-level optimizer that fuses adjacent collectives in a step (e.g., AllToAll-AllToAll back-to-back in MoE) is unimplemented.
- Dynamic / size-adaptive code generation. Currently a discrete table of pre-compiled IRs; an online generator that JIT-specializes for observed message-size distributions is open.
- Automated authoring. SCCL / TACCL synthesize algorithms but emit weaker execution; coupling synthesis to MSCCLang's compiler so discovered algorithms inherit fusion + pipelining is the natural integration.
Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that, via the
NCCL tuner-plugin API, selects per-collective algorithm
(Ring / Tree / CollNet / NVLS), protocol (LL / LL128 /
Simple), nChannels, numThreads, and
chunkSize to minimize collective wall-clock on HPC GPU
clusters. MSCCLang sits at a different layer — it is a DSL + compiler +
runtime that adds new algorithms (and the machinery to execute
them well) into NCCL — but its design choices map directly onto
DynamICCL's state vector, action space, and exploration priors:
Action-space expansion: each MSCCLang program is a discrete action. When the MSCCL runtime is loaded, every compiled MSCCL-IR (Ring AR, All Pairs AR, Hierarchical AR, Two-Step AllToAll, AllToNext, ...) is a new entry in DynamICCL's
algorithmenum. The 4-way Ring/Tree/ CollNet/NVLS choice expands to a topology-conditioned bag of pre-compiled IRs. The agent's job is to learn which pre-compiled IR wins for each (size, topology) regime — exactly the gap MSCCLang itself does not close.State feature: tile / FIFO-slot count
s. MSCCLang's pipelining model splits buffers intosFIFO slots (default 8). This is the same family as NCCL'snumPipeOpsandchunkSize. DynamICCL's state should bin observed buffer size against the FIFO capacity so the policy learns when extra pipeline depth helps versus when alpha-cost amortization wins.State feature: critical-path slack proxy. MSCCLang's scheduler computes
priority = depth + reverse_depthover the Instruction DAG. DynamICCL cannot observe this DAG directly, but a coarse proxy — "is this collective on the critical path of the current step?" — could be derived from the recent-collective LSTM window already in the state vector and used to bias toward latency-optimal vs. bandwidth-optimal actions.Action prior: small messages favor
All Pairs-like algorithms; large messages favorRing/Hierarchical. MSCCLang's headline result is 3.0x AllReduce speedup at small buffers via All Pairs on DGX-2, confirming that the latency-optimal Pareto corner is distinct from the bandwidth-optimal corner. DynamICCL's exploration prior should bias toward (Tree, LL/LL128, nChannels=1-2) at small message sizes — the equivalent in NCCL's catalog of theAll Pairslow-latency regime — and toward (Ring, Simple, nChannels=4-8) at large sizes.Action prior: hierarchical decomposition for multi-node. MSCCLang's 4-phase Hierarchical AllReduce (intra-RS -> inter-RS -> inter-AG -> intra-AG) maps onto NCCL's hierarchical algorithm family. When DynamICCL observes a PCIe+IB or NVLink+IB topology fingerprint, the action prior should weight CollNet / NVLS heavily — the NCCL-side equivalent of MSCCLang's hierarchical pattern.
State feature: per-link-class capability vector. MSCCLang's AllToNext speedup of 14.5x comes purely from utilizing all 8 IB NICs per NDv4 node instead of one. DynamICCL's topology fingerprint must capture the per-GPU IB-NIC count (and intra-node NVLink/NVSwitch count), not just the four coarse classes (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet) — this is the load-balanced-NIC regime.
Reward shaping: collective wall-clock matches MSCCLang's metric. MSCCLang reports microbench latency (us) and end-to-end training speedup; both are direct functions of collective wall-clock. DynamICCL's
r = -collective_wall_clock_usis the right primary signal. The 20% Copilot GPU-time reduction shows that collective- level wins translate to step-level wins without further normalization.Research positioning: MSCCLang generates, DynamICCL selects. MSCCLang authors (and synthesizers like SCCL/TACCL feeding into MSCCL-IR) populate the algorithm catalog; DynamICCL learns which catalog entry to invoke per call. A joint stack — SCCL/TACCL synthesize, MSCCLang lowers and executes, DynamICCL picks at runtime — is the natural composition and cleanly positions DynamICCL's contribution as the missing online selector that MSCCLang's "static / per-topology compilation" limitation leaves open.
Exploration budget: prefer offline-precomputed IRs to live recompilation. MSCCLang compilation is offline (no published compile-time table, but lowering is heuristic-driven, not MILP- solved). DynamICCL must amortize exploration against a fixed, pre-compiled action set; the policy should be discrete-categorical over loaded MSCCL-IR variants, not parametric-continuous over any synthesized space.
Open-problem alignment: dynamic size-adaptive selection. MSCCLang's open problem #3 (size-adaptive code generation) is precisely DynamICCL's contribution at the selection layer: given a fixed library of MSCCL-IR variants, learn the size / topology / load mapping to the best one. DynamICCL closes this gap without requiring JIT compilation.