MSCCLang: Microsoft Collective Communication Language — Detailed Summary
Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi (Microsoft Research, Redmond), Yifan Xiong (Microsoft Research, Beijing) | ASPLOS '23 (28th ACM Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, Vol. 2) | DOI: 10.1145/3575693.3575724
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.
Abstract
- ML models with millions to billions of parameters increasingly train on large multi-GPU systems; collective communication becomes the scaling bottleneck.
- Custom collective algorithms tuned to particular network topologies and application-specific communication patterns can alleviate the bottleneck, but writing them correctly and efficiently is hard.
- The paper introduces MSCCLang: a domain-specific language (DSL) for collective communication, an optimizing compiler that lowers DSL programs to an executable IR, and an interpreter-based runtime that executes the IR efficiently and flexibly.
- Headline empirical claim: MSCCLang-authored AllReduce and AllToAll algorithms run up to 1.9x and 1.3x faster than hand-optimized implementations, respectively.
1. Introduction
Background and motivation:
- Distributed training is necessary for modern large models; data, model, and pipeline parallelism all rely on frequent inter-GPU communication.
- Communication consumes a dominant share of step time. Cited example: DeepLight spends ~79% of execution in collectives.
- NCCL/RCCL Ring and Tree algorithms cover the common cases but are topology-and-size-suboptimal at small/medium messages and on irregular fabrics (NVSwitch with multi-NIC IB, asymmetric GPU-to-NIC ratios).
Motivating workloads:
- ResNet-50 (data parallel)
- DeepLight (recommendation, communication-heavy)
- BERT and GPT-3 (transformer pre-training; AllReduce-bound)
- Mixture-of-Experts (MoE) (AllToAll-bound, ratio of compute to comm flips per expert routing pattern)
Gap left by prior work:
- NCCL/RCCL are "black boxes" — limited algorithm choice and no programmable extension point at the per-collective level.
- Algorithm-synthesis systems (SCCL, TACCL, BLINK) discover better schedules but emit XML/CUDA without execution-level optimizations: no instruction fusion, no pipelining, no threadblock-aware scheduling.
- Naive composition of multiple kernels per step adds launch overhead and prevents cross-kernel optimization.
MSCCLang's contribution:
- A unified system: chunk-oriented DSL + optimizing compiler + interpreter-based runtime embedded in NCCL.
- Lets ML researchers express custom collectives at high level and obtain hand-tuned-CUDA performance.
- Demonstrated on novel algorithms (All Pairs AllReduce, Two-Step AllToAll, AllToNext) achieving up to 1.9x AllReduce, 1.3x AllToAll, 14.5x AllToNext speedups over hand-optimized baselines.
2. Background
2.1 Collective Operations
- Standard MPI-style collectives: AllReduce, AllGather, ReduceScatter, AllToAll, Broadcast, Reduce.
- Each collective specifies a (pre-condition, post-condition) on data layout across ranks; algorithms differ in how they realize this.
2.2 NCCL Algorithms
- Ring AllReduce: bandwidth-optimal at large sizes;
cost scales with
(n-1)steps. - Double-binary Tree AllReduce: logarithmic-depth, latency-optimal at small sizes.
- NCCL chooses between Ring and Tree at runtime based on size and topology heuristics — but the choice is coarse and the algorithms themselves are fixed.
2.3 Hardware
| Platform | GPUs / Node | Intra-node Interconnect | Inter-node |
|---|---|---|---|
| Azure ND A100 v4 (NDv4) | 8x A100-80GB | 12x 3rd-gen NVLink to 6 NVSwitches; 600 GB/s bi-dir | 8x HDR IB NICs at 25 GB/s each (1 NIC per GPU pair) |
| NVIDIA DGX-2 | 16x V100-32GB | 6x 2nd-gen NVLink to 6 NVSwitches | 1x HDR IB NIC per pair of GPUs |
2.4 NCCL P2P Primitives
- MSCCLang reuses NCCL 2.8.4-1's connection setup, channel abstraction (NVLink, IB, TCP transports), and FIFO-slot-based send/recv buffers.
- The MSCCL runtime is layered into NCCL as an alternate kernel that executes MSCCL-IR rather than the canned Ring/Tree kernel.
3. MSCCLang DSL Design
3.1 Chunk-Oriented Semantics
- Programs are written by describing how chunks (abstract, fixed-size units of data) move between ranks.
- The DSL is embedded in Python — algorithm authors get loops, conditionals, comprehensions for free.
- Each rank exposes three named buffers:
Input,Output,Scratch.
3.2 Core Primitives
| DSL primitive | Meaning |
|---|---|
chunk(rank, buffer, index, count) |
Returns a reference to count contiguous chunks starting
at index of buffer on rank. |
c.copy(dst_rank, dst_buffer, dst_index, ch) |
Send chunks c to
(dst_rank, dst_buffer, dst_index) over channel
ch. Returns a reference to the destination chunks for
further chaining. |
c.reduce(c2, ch) |
Element-wise reduction of two chunk references using the channel's reduction op (typically sum). |
parallelize(N) |
Run a code fragment as N parallel instances, each over
1/N of the data. |
3.3 Channels
- Optional
chparameter to each operation lets the author distinguish parallel connections between the same GPU pair (multiple NCCL channels). Channels are how the user expresses ILP at the wire level.
3.4 Safety Properties
- DSL enforces single-writer chunk semantics: a chunk reference may be written exactly once. Using a stale (overwritten) reference is a compile-time error.
- This eliminates entire classes of data-race / use-after-overwrite bugs that plague raw CUDA collective code.
4. Compiler / IR / Lowering
4.1 Compilation Pipeline
DSL Program (Python)
|
| tracing
v
Chunk DAG (per-chunk producer/consumer graph; exposes natural parallelism)
|
| lowering
v
Instruction DAG (nodes: send, recv, reduce, copy)
|
| fusion + aggregation + threadblock allocation + scheduling
v
MSCCL-IR (XML, tree-shaped, per-rank)
|
| runtime load
v
Cooperative single-kernel interpreter
4.2 Chunk DAG
- Captures global chunk movement: which rank produces what, which consumes what.
- Natural target for symmetry / parallelism extraction (the
parallelize(N)modifier expands here).
4.3 Instruction DAG and Fusion Passes
The Instruction DAG has four base node types: send,
recv, reduce, copy. The compiler
runs peephole fusion passes over patterns of adjacent nodes:
| Fusion | Pattern | Effect |
|---|---|---|
rcs (receive-copy-send) |
recv + copy + send on same chunk | Single fused kernel; chunk forwarded without exiting registers. |
rrc (receive-reduce-copy) |
recv + reduce + copy local | Combines incoming partial sum with local data and writes. |
rrcs (receive-reduce-copy-send) |
recv + reduce + copy + send | The full Ring-AllReduce inner step; matches NCCL's hand-tuned fused kernel. |
rrs (receive-reduce-send) |
recv + reduce + send (no local copy) | Special case: result forwarded but not retained, freeing registers. |
These passes restore the register-resident dataflow that hand-tuned NCCL kernels rely on; without them, IR-emitted code would round-trip through global memory between every operation.
4.4 Aggregation
- Contiguous chunks destined for the same
(peer, channel)are bundled into a single network transfer. - Driven by the alpha-beta cost model
T = alpha + S * beta: amortizing alpha at smallSis the dominant win when many small chunks share a path.
4.5 Threadblock Allocation
- Greedy heuristic assigns instructions to threadblocks based on
unique
(send-peer, receive-peer, channel)tuples. - Each threadblock then handles a single concurrent network conversation; minimizes intra-TB synchronization.
4.6 Scheduling and Critical-Path Priority
- Within each threadblock, instructions are ordered by
priority = depth + reverse_depthover the Instruction DAG. - This is the canonical critical-path heuristic: nodes with high combined upstream + downstream depth must run early to avoid becoming the schedule bottleneck.
4.7 Cross-Threadblock Synchronization
- When an instruction in TB A depends on the output of an instruction
in TB B, the compiler inserts an explicit semaphore in global memory
(write-then-read with
__threadfence_system()). - Avoids relying on grid-wide barriers — which would defeat the cooperative-launch single-kernel design.
4.8 Output: MSCCL-IR
- A tree-shaped, per-rank XML program.
- Each rank's tree describes its sequence of
(opcode, src, dst, channel, dependency)records. - Compatible with the prior MSCCL XML format, so SCCL/TACCL outputs can also feed the MSCCL runtime.
5. Runtime / Execution Model
5.1 Single-Kernel Cooperative Launch
- The MSCCL runtime launches one CUDA kernel per collective that interprets the MSCCL-IR for the local rank.
- Cooperative launch ensures all threadblocks run concurrently (necessary because cross-TB semaphores would deadlock with serialized scheduling).
- Interpreter-style execution avoids per-step kernel launch overhead.
5.2 Tile Execution and Pipelining
- The outermost loop divides each chunk into tiles sized to fit in NCCL's FIFO-slot abstraction (~512 KB to 5 MB).
- The runtime maintains
sFIFO slots (defaults = 8). - While slot
iis performing inter-node IB transfer, sloti+1performs intra-node NVLink transfer — pipelining hides IB latency behind NVLink throughput.
5.3 NCCL API Compatibility
- MSCCL runtime is invoked via the standard NCCL collective API.
- For unsupported (collective, size, topology) triples, the runtime falls back to the canned NCCL Ring/Tree kernel.
- Existing PyTorch / Horovod / DeepSpeed integrations require no source change — only library swap.
6. Expressed Algorithms
6.1 Ring AllReduce
- Implemented as
ReduceScatterfollowed byAllGather, both in MSCCLang. - Demonstrates that classical bandwidth-optimal algorithms compose naturally from chunk operations.
6.2 All Pairs AllReduce (Novel Low-Latency)
- 2-step algorithm for small buffers.
- Step 1: each rank gathers a chunk from every other rank (effectively AllGather of partial chunks).
- Step 2: each rank reduces its incoming chunks and broadcasts the result back.
- Trades bandwidth for latency: a few large messages instead of
n-1small ones; wins decisively at small sizes (3.0x speedup at 16x V100 small buffers, see Sec. 8).
6.3 Hierarchical AllReduce (4-phase)
- Phase 1: intra-node ReduceScatter (NVLink/NVSwitch).
- Phase 2: inter-node ReduceScatter (IB).
- Phase 3: inter-node AllGather (IB).
- Phase 4: intra-node AllGather (NVLink/NVSwitch).
- Crucial for Azure NDv4: separates the cheap intra-NVSwitch domain from the expensive IB domain and pipelines them via tile execution.
6.4 Two-Step AllToAll
- Bundles many small cross-node AllToAll messages into a single large IB transfer per node-pair.
- Step 1: locally gather chunks destined for each remote node.
- Step 2: large bulk send across IB.
- Wins by a factor of
(GPUs/node)in IB efficiency on large- buffer AllToAll (1.3x over hand-optimized CUDA at 256 GPUs).
6.5 AllToNext (Custom)
- Not an MPI-standard collective. Rank
isends its entire buffer to ranki+1. - Naive CUDA implementation uses one IB NIC; MSCCLang authors a scattered-send variant that fans out across all 8 NICs of an NDv4 node before shipping cross-node.
- Demonstrates extensibility: a 15-line MSCCLang program implements a collective NCCL has no template for, and beats naive CUDA by 14.5x via NIC parallelism.
7. Evaluation Setup
7.1 Clusters
| Cluster | GPUs / Node | Memory | Intra | Inter |
|---|---|---|---|---|
| Azure ND A100 v4 | 8x A100 | 80 GB | NVLink 3.0 / 6 NVSwitches (600 GB/s bi-dir) | 8x HDR IB NICs (25 GB/s each) |
| NVIDIA DGX-2 | 16x V100 | 32 GB | NVLink 2.0 / 6 NVSwitches | 1x HDR IB NIC per GPU pair |
7.2 Software Stack
- NCCL 2.8.4-1 (forked to host MSCCL runtime)
- CUDA 11.x
- MSCCLang DSL (Python-embedded)
- Compiler is Python; emits MSCCL-IR XML.
7.3 Baselines
- NCCL Ring/Tree — production library default.
- Hand-optimized CUDA — bespoke kernels written by Microsoft engineers for specific Azure workloads (e.g., the AllToAll used in OpenAI Copilot training).
- SCCL — synthesizer that emits MSCCL-IR XML; comparison isolates the value of MSCCLang's compiler relative to a synthesizer-only approach.
7.4 Workloads
- Microbenchmarks: AllReduce, AllToAll, AllToNext over buffer sizes from a few KB to 1 GB, sweeping 1-32 nodes (8 to 256 GPUs).
- End-to-end: BERT, GPT, MoE, the Azure OpenAI Copilot training pipeline.
7.5 Metrics
- Microbench: latency in microseconds; speedup factor.
- End-to-end: training throughput / step time.
- Expressiveness: lines of code (LOC).
- (No explicit compilation-time table, but lowering is described as fast — heuristic, not solver-bound.)
8. Experimental Results
8.1 AllReduce Microbenchmark
| Setup | Algorithm | Buffer regime | Speedup vs. NCCL |
|---|---|---|---|
| 1-node NDv4 (8x A100) | All Pairs / Hierarchical | 32 KB - 3 MB | up to 1.9x |
| 1-node DGX-2 (16x V100) | All Pairs | small (sub-MB) | up to 3.0x |
| Multi-node NDv4 | Hierarchical 4-phase | medium-to-large | matches or beats NCCL |
- At small sizes the All Pairs algorithm wins because it converts many short Ring steps into two longer phases — directly exploiting the alpha-beta cost asymmetry.
- At large sizes the Hierarchical algorithm wins on multi-node configs by pipelining intra-NVLink and inter-IB phases via tile execution.
8.2 AllToAll Microbenchmark
| Setup | Buffer | vs. hand-CUDA | vs. NCCL |
|---|---|---|---|
| 16-node, 256x A100 | > 512 MB | 1.3x | ~1.20x (20%) |
- Two-Step AllToAll pays a single alpha per remote-node pair instead of one per remote-GPU pair; saving compounds at high GPU-count.
8.3 AllToNext Microbenchmark
| Setup | Speedup vs. naive CUDA |
|---|---|
| 3-node, 24x A100 | up to 14.5x |
- Fans across all 8 NDv4 IB NICs vs. naive 1-NIC implementation.
- Demonstrates that MSCCLang's productivity advantage is most visible in the long-tail of non-standard collectives.
8.4 End-to-End Training
| Workload | Scale | Speedup |
|---|---|---|
| Azure OpenAI Copilot | production | 20% GPU-time reduction |
| Large MoE | 256x A100 | 1.10x - 1.89x depending on architecture |
- Copilot's improvement comes from replacing the training-step AllToAll with a Two-Step variant.
- MoE range reflects per-architecture sensitivity to AllToAll efficiency: deeper / wider experts hit the 1.89x end.
8.5 Expressiveness
| Algorithm | MSCCLang LOC | Hand-CUDA LOC |
|---|---|---|
| Two-Step AllToAll | 15 | 70 |
- ~4.7x reduction in code volume; correctness comes for free from the DSL's chunk-write-once invariant.
8.6 Comparison to SCCL
- The paper compares MSCCL-IR programs authored in MSCCLang against MSCCL-IR programs emitted by SCCL (the synthesizer).
- MSCCLang programs are faster because the compiler runs fusion + aggregation passes that SCCL's emit-step skips. (The paper does not quote a single percentage here — it positions MSCCLang as a compiler that completes the SCCL pipeline.)
9. Limitations
- Static, per-topology compilation. Each MSCCL-IR is
generated for a fixed
(collective, ranks, topology)— algorithm switching at runtime is delegated to a table of pre-compiled IRs rather than online recompilation. - No cross-collective fusion. Adjacent collectives in a step (e.g., AllToAll-AllToAll in MoE forward + backward) are compiled independently; cross-collective optimization is named as future work.
- No compute-comm fusion in the DSL. The DSL describes communication only (with reduction the sole compute primitive). Overlap with backward-pass compute relies on framework-level scheduling outside MSCCLang.
- NVIDIA-centric. All measurements use V100 / A100 + CUDA 11 + NCCL 2.8.4-1. ROCm / RCCL not evaluated; portability claim relies on inherited NCCL P2P primitives.
- Author burden. Algorithms must be hand-written; MSCCLang doesn't synthesize. SCCL/TACCL fill this gap upstream by emitting MSCCL-IR directly, but they bypass MSCCLang's compiler optimizations.
- NCCL-version coupling. The runtime is embedded in a NCCL fork; upgrading to NCCL 2.18+ requires re-porting the interpreter and cooperative-launch glue.
10. Related Work
| System | Position vs. MSCCLang |
|---|---|
| NCCL / RCCL | Foundation MSCCLang builds on; provides Ring/Tree algorithms, P2P transports, channels. MSCCLang adds programmability above. |
| SCCL | Synthesizer for intra-node collective algorithms; emits MSCCL-IR XML. Lacks the fusion / aggregation / scheduling passes that MSCCLang's compiler adds. |
| TACCL | Multi-node synthesizer using sketches + MILP; emits MSCCL-IR XML. Same gap as SCCL re: execution-level optimization. |
| BLINK | Generates fast collectives by traversing spanning trees of the topology; requires manual implementation of routing decisions. MSCCLang provides the implementation layer Blink lacks. |
| Horovod / BytePS | Framework-level orchestration (Wait-Free Backprop, tensor fusion, partition); operate above NCCL. MSCCLang is orthogonal — it improves the kernel below. |
| MSCCL (predecessor) | Same XML IR but no compiler — algorithms must be hand-authored XML. MSCCLang is the missing high-level language and optimizer for MSCCL. |
| DSL/compiler analogs | The paper positions MSCCLang in the same lineage as Halide / TVM (DSL + IR + lowering passes) but specialized for collective communication rather than tensor computation. |
11. Conclusion and Future Work
- The chunk-oriented DSL combined with an optimizing compiler and an interpreter-based runtime achieves hand-tuned-CUDA performance from high-level descriptions.
- Authors report this lowers the barrier to algorithm exploration enough that ML researchers (not just CUDA experts) can prototype custom collectives.
- Future work explicitly mentioned:
- Extend the DSL to express compute scheduling alongside communication for compute-comm overlap.
- Cross-collective optimization within a training step.
- Better integration with synthesis tools (SCCL/TACCL) so synthesizer outputs inherit MSCCLang's fusion / pipelining.
12. Key Equations and Cost Models
| Model | Formula | Used for |
|---|---|---|
| Alpha-beta link cost | T = alpha + S * beta |
Aggregation pass — bundles small chunks to amortize alpha. |
| Pipelining | Buffer split into s FIFO slots; s defaults
to 8 |
Tile execution — overlap intra-node and inter-node phases. |
| Critical-path priority | priority = depth + reverse_depth over the Instruction
DAG |
Scheduler — orders instructions within each threadblock. |
No solver-based cost model (no MILP / SMT) — MSCCLang's compiler is heuristic-driven and fast.
13. Named Methods, DSL Primitives, Compiler Passes
| Term | One-line definition |
|---|---|
| Chunk | Abstract fixed-size data unit; the routing primitive of the DSL. |
| Chunk-oriented programming | Style where the author specifies how chunks move, not how threads execute. |
| Chunk DAG | First IR; per-chunk producer/consumer graph used to expose parallelism. |
| Instruction DAG | Second IR; operations are send, recv,
reduce, copy. |
| MSCCL-IR | Final XML representation; a per-rank tree of opcodes consumed by the runtime. |
| rcs / rrc / rrcs / rrs | Peephole fusion patterns for receive(-reduce)(-copy)(-send). |
| Aggregation | Compiler pass that bundles contiguous chunks to amortize alpha. |
| Threadblock allocation | Greedy assignment of instructions to TBs by
(send-peer, recv-peer, channel) tuple. |
| Cooperative launch | All TBs guaranteed to run concurrently, enabling cross-TB semaphores. |
| Tile execution | Runtime loop dividing chunks into FIFO-slot-sized tiles for pipelining. |
| All Pairs algorithm | 2-step low-latency AllReduce expressed in MSCCLang; wins at small buffers. |
| Two-Step AllToAll | Bundle-then-bulk-send AllToAll; wins on multi-node large buffers. |
| AllToNext | Custom collective (rank i -> rank i+1)
demonstrating extensibility. |
parallelize(N) |
DSL modifier to instantiate N parallel copies of a code
fragment. |
| Channel | Multiple parallel NCCL connections between the same GPU pair,
exposed as the ch parameter. |
| Single-writer chunk | Safety invariant — each chunk reference may be written exactly once. |
14. Cross-Cutting Empirical Take-Aways
| Take-away | Derived from |
|---|---|
| Latency-optimal AllReduce (All Pairs) wins at small sizes by 3.0x on 16x V100 | Sec. 8.1 |
| Bandwidth-optimal Hierarchical AllReduce wins at multi-node large sizes | Sec. 8.1 |
| Two-Step AllToAll converts per-GPU-pair alphas to per-node-pair alphas (1.3x at 256 GPUs) | Sec. 8.2 |
| NIC-fanout matters as much as algorithm for non-standard collectives (14.5x AllToNext) | Sec. 8.3 |
| End-to-end training picks up 20% on Copilot from collective-level wins | Sec. 8.4 |
| 4.7x LOC reduction (15 vs. 70 lines) without performance loss | Sec. 8.5 |
15. Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that, via the
NCCL tuner-plugin API, selects per-collective algorithm
(Ring / Tree / CollNet / NVLS), protocol (LL / LL128 /
Simple), nChannels, numThreads, and
chunkSize to minimize collective wall-clock on HPC GPU
clusters. MSCCLang is a DSL + compiler + runtime that sits
below DynamICCL's selection layer — it is the substrate that
produces the algorithms DynamICCL chooses among. Each MSCCLang finding
maps to a specific DynamICCL design implication:
Direct mappings:
| MSCCLang finding | DynamICCL design implication |
|---|---|
| Each compiled MSCCL-IR is a discrete program loaded at runtime | The algorithm action enum must be extended dynamically:
when MSCCL is loaded, every MSCCL-IR variant (All Pairs, Hierarchical,
Two-Step AllToAll, AllToNext, ...) becomes a new categorical
action. |
| All Pairs AllReduce 3.0x at small sizes; Ring/Hierarchical at large sizes | Confirms message-size log-binning as primary state feature; bias the action prior toward latency-optimal (Tree-like / All Pairs) at < ~1 MB and bandwidth-optimal (Ring / Hierarchical) at > ~16 MB. |
Tile execution with s = 8 FIFO slots |
DynamICCL's chunkSize and numPipeOps
actions are the NCCL-side equivalents; explore this axis explicitly
rather than holding fixed. |
Critical-path priority depth + reverse_depth |
Cannot be observed directly, but a coarse "is this collective on the step's critical path?" feature can be derived from the recent-collective LSTM window already in DynamICCL's state. |
| AllToNext 14.5x via 8-NIC fanout | Topology fingerprint must capture per-GPU NIC count, not just NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet — the existing 4-class fingerprint loses the load-balanced-NIC regime. |
| Hierarchical 4-phase (intra-RS, inter-RS, inter-AG, intra-AG) | When DynamICCL observes a (NVLink + IB) topology, action prior should weight CollNet / NVLS heavily — the NCCL-side analog of hierarchical decomposition. |
| Two-Step AllToAll bundles cross-node sends | At multi-node large AllToAll calls, bias toward larger
chunkSize and Simple protocol — bundle to amortize
alpha. |
| Two-Step AllToAll in 15 LOC vs. 70 in CUDA | Authoring cost of new MSCCL-IR variants is low enough that the DynamICCL action set can grow as new collectives are added — supports a catalog-extensible RL design rather than a fixed-cardinality one. |
| Per-collective wall-clock is the headline metric (microbench latency, training throughput) | DynamICCL's reward r = -collective_wall_clock_us
matches; sign and unit consistent with paper's evaluation. |
| 20% GPU-time reduction at OpenAI Copilot scale | A real-world floor on the value of better algorithm selection — sets DynamICCL's expected end-to-end gain envelope when picking from a richer catalog. |
Specific design priors for the RL agent:
Action-space expansion as a function of loaded plugin. The action space is dynamic —
algorithm in {Ring, Tree, CollNet, NVLS}when only NCCL is loaded;algorithm in {Ring, Tree, CollNet, NVLS, MSCCL_AllPairs_AR, MSCCL_Hierarchical_AR, MSCCL_TwoStep_A2A, MSCCL_AllToNext, ...}when MSCCL runtime is loaded. The policy should accept a plugin-capability vector as input so a single trained network generalizes across deployments.State features motivated by MSCCLang's compiler decisions.
- Message size (already log-binned) — drives All Pairs vs. Ring crossover.
- Tile / FIFO slot context — derived from NCCL's
chunkSize / numPipeOps; lets policy learn pipeline-depth interactions. - Per-GPU NIC count — captures AllToNext-style fanout regimes.
- NVSwitch presence flag — distinguishes DGX-2 (16 GPU NVSwitch) from NDv4 (8 GPU NVSwitch); changes hierarchical-phase ratios.
- Recent-collective LSTM context (already present) — proxy for critical-path-ness.
Exploration prior.
- Small messages (< ~256 KB): bias toward latency-optimal MSCCL variants (All Pairs analogs) or NCCL Tree + LL/LL128 + nChannels=1-2.
- Medium messages (256 KB - 16 MB): main exploration zone — let RL learn the crossover.
- Large messages (> 16 MB): bias toward bandwidth-optimal Hierarchical / Ring + Simple + nChannels=4-8.
- Multi-node + AllToAll: prior toward Two-Step / bundled variants.
- Multi-NIC topology + custom collectives: prior toward NIC-fanout-aware variants.
Reward shaping.
- Primary:
r = -collective_wall_clock_us(matches MSCCLang's reported metric). - Optional secondary: penalize p99 to catch tile-pipelining tail outliers (consistent with MSCCLang's tile-execution model where bad slot interleaving creates spikes).
- Primary:
Research positioning — generation vs. selection. MSCCLang generates and executes algorithms; DynamICCL selects among them online. The two systems compose cleanly:
SCCL / TACCL synthesize -> MSCCLang lowers + executes | v Pre-compiled MSCCL-IR catalog | v DynamICCL selects | v NCCL invokes the chosen kernelThis positions DynamICCL as the missing online selector that MSCCLang's "static / per-topology compilation" limitation explicitly leaves open.
Open-problem alignment.
- MSCCLang's future-work item "size-adaptive code generation" maps onto DynamICCL's discrete-action selection problem: rather than JIT-recompile, choose the right pre-compiled IR.
- MSCCLang's future-work item "compute-comm scheduling" suggests DynamICCL's reward could optionally include a step-time bonus when the chosen collective covers the backward-pass compute window — an extension once MSCCLang grows compute primitives.
- MSCCLang's future-work item "cross-collective fusion" hints at a longer-horizon DynamICCL formulation where the MDP horizon extends across consecutive collectives in a step rather than treating each in isolation.
Exploration budget — prefer offline-precomputed IRs to live recompilation. MSCCLang compilation is offline and fast (heuristic-driven, no MILP/SMT). DynamICCL must amortize exploration against a fixed pre-compiled action set; the policy is discrete-categorical over loaded MSCCL-IR variants plus NCCL built-ins, not parametric-continuous over any synthesized space.
Topology embedding refinement. MSCCLang's per-platform results (NDv4 vs. DGX-2) are sensitive to NVLink generation, NVSwitch count, and IB-NIC density. The four-class topology fingerprint is too coarse — DynamICCL should encode (NVLink-bandwidth-class, NVSwitch-count, NICs-per-GPU, IB-bandwidth-class) as a small embedding rather than a one-hot.
Catalog versioning. MSCCL-IR variants are stable artifacts; DynamICCL can fingerprint each loaded IR (by hash of XML) and treat them as a versioned action set. When the operator adds or removes IR variants, the policy can either (a) retrain the categorical head or (b) feed an IR-feature embedding (collective type, latency-vs-bandwidth lean) so a single network handles a growing catalog.
Quantitative anchor for expected gains. MSCCLang reports microbench wins of 1.9x (AllReduce small), 1.3x (AllToAll large), 14.5x (AllToNext), and end-to-end 1.10-1.89x (MoE) and 20% (Copilot). These bound the upper end of what DynamICCL can recover by algorithm selection alone — when DynamICCL picks the right MSCCL-IR variant per call. They also set realistic stretch goals: a learned online selector that achieves >80% of MSCCLang's offline-best across regimes is a defensible contribution.