GC3: An Optimizing Compiler for GPU Collective Communication — Detailed Summary
Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi (Microsoft Research, Redmond), Yifan Xiong (Microsoft Research Asia) | arXiv:2201.11840v3 (19 Jul 2022) — pre-print of the ASPLOS '23 MSCCLang paper | Code: github.com/microsoft/msccl-tools, github.com/microsoft/msccl
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them. Note: the paper calls the system GC3; post-publication it was rebranded to MSCCLang for ASPLOS '23. Same authors, same compiler, same runtime, same headline numbers. Cross-references to the 0033 MSCCLang summary are included where the two papers' content overlaps.
Abstract
"This paper introduces GC3, a system for programmable GPU communication. GC3 provides a domain specific language for writing collective communication algorithms and an optimizing compiler for lowering them to an executable form, which can be executed efficiently and flexibly in an interpreter based runtime. We used GC3 to write novel collective algorithms for AllReduce and AllToAll that are up to 1.9x and 1.3x faster than hand-optimized implementations, respectively."
- The paper targets the communication bottleneck in large-scale ML training (data, model, pipeline, and expert parallelism).
- It contributes (i) a chunk-oriented Python-embedded DSL, (ii) an optimizing compiler producing GC3-IR, (iii) an interpreter-based runtime layered into NCCL.
- Headline empirical claim: 1.9x AllReduce / 1.3x AllToAll over hand-optimized baselines, plus a 14.5x AllToNext speedup from exploiting multi-NIC bandwidth that NCCL's primitives do not.
1. Introduction
Communication wall:
- Distributed training is now the norm; communication is increasingly the dominant cost. The paper cites DeepLight (~2 GB params) at 79% of step time in collectives versus ResNet-50 (~100 MB) at 3%.
- Vendor libraries (NCCL, RCCL) provide only a small fixed catalog of collectives (Ring, double-binary Tree). They are tuned for canonical NVLink-symmetric topologies and break down on irregular fabrics (NVSwitch islands, asymmetric GPU/NIC ratios) and at small/medium message sizes.
Productivity gap:
- ML researchers can describe better algorithms but cannot reach hand-tuned-kernel performance without weeks of CUDA work.
- Algorithm-synthesis systems (SCCL, TACCL, BLINK) discover better schedules but emit XML/CUDA without execution-level optimizations (no instruction fusion, no pipelining, no threadblock-aware scheduling). Naive composition of multiple kernel launches squanders the algorithmic gains.
GC3's contribution:
- A unified system: chunk-oriented DSL + optimizing compiler + interpreter-based runtime embedded in NCCL.
- Lets researchers express custom collectives at high level and obtain hand-tuned-CUDA performance.
- Demonstrated on novel algorithms: All Pairs AllReduce, 4-phase Hierarchical AllReduce, Two-Step AllToAll, AllToNext.
Naming note: The paper uses GC3
throughout. The publicly released artifact is hosted at
github.com/microsoft/msccl-tools and
github.com/microsoft/msccl; the system was later rebranded
to MSCCLang at ASPLOS '23. Authors and core technical
content are identical between the two.
2. GC3 Example: Hierarchical AllReduce
- The introductory example is a 4-phase Hierarchical
AllReduce:
- Intra-node ReduceScatter
- Inter-node ReduceScatter
- Inter-node AllGather
- Intra-node AllGather
- This decomposition exploits NVLink intra-node bandwidth and IB inter-node bandwidth distinctly — a single Ring AllReduce cannot.
- DSL constructs introduced:
chunk(rank, buf, idx, count)— declare a chunk referencec1.copy(dst_rank, dst_buf, dst_idx, ch=...)— point-to-point copyc1.reduce(c2, ch=...)— reduce two chunksparallelize(N)— split a transfer across N parallel channelsaggregation— concat contiguous chunks
- Quantitative motivation given here: GC3's hierarchical AR achieves 1.22-1.29x language-model-inference speedup on 8x A100 and 1.10-1.89x MoE-training speedup on 256x A100 versus the NCCL Ring/Tree fallback (Section 7.5).
3. GC3 DSL
3.1 Buffers
- Three buffer classes:
input(read-only initial state),output(final destination),scratch(temporary). - A collective specifies a postcondition mapping ranks/indices to expected reduced values; the compiler validates the algorithm produces the postcondition.
3.2 Operations
c1.copy(dst_rank, dst_buf, dst_idx, ch=...)— declarative movec1.reduce(c2, ch=...)— produces a new chunk equal toc1 op c2ch=argument explicitly assigns the operation to a NCCL channel
3.3 Safety by Construction
- Chunk references obey single-writer semantics. Once
a chunk is written (by
copyorreduce), the prior reference becomes stale; attempting to use a stale reference is a compile-time error. - This eliminates data races by construction — no explicit synchronization in user code.
3.4 DSL Code Snippets in the Paper
- Fig. 3a/3b: DSL code for
HierarchicalAllReducewith helper functions forReduceScatterandAllGather. - Fig. 8: DSL code for the Two-Step
alltoall(N, G)algorithm (~16 lines) that bundles cross-node sends.
4. Lowering GC3 Programs
4.1 Tracing -> Chunk DAG
- The DSL is traced to produce a Chunk DAG that captures global data movement.
- Nodes:
copyandreduceoperations. - Edges:
- True dependencies: chunk
c2depends on the value ofc1. - False dependencies: two operations reuse the same buffer index (write-after-write or write-after-read).
- True dependencies: chunk
4.2 Lowering -> Instruction DAG
- The Chunk DAG is lowered to an Instruction DAG of
low-level primitives:
send,recv(point-to-point)reduce,copy(local)
- Edges:
- Communication edges: synchronization between sender
and receiver of a
send/recvpair. - Processing edges: execution order within a single rank.
- Communication edges: synchronization between sender
and receiver of a
4.3 Fusion / Peephole Optimization
The compiler fuses adjacent instruction patterns into single primitives that keep intermediates register-resident, avoiding global-memory traffic:
| Fused name | Expansion | Effect |
|---|---|---|
rrc (recvReduceCopy) |
recv -> reduce -> copy | one global write |
rcs (recvCopySend) |
recv -> copy -> send | back-to-back forwarding |
rrcs (recvReduceCopySend) |
recv -> reduce -> copy -> send | full pipeline-stage fusion |
rrs (recvReduceSend) |
recv -> reduce -> send (no local save) | pure intermediate |
- Fused instructions reduce DRAM bandwidth pressure and shorten the per-tile critical path inside the interpreter loop.
4.4 Aggregation
- The compiler merges contiguous chunks into a single transfer to
amortize the alpha (start-up) cost of the canonical
T = alpha + S * betamodel (Section 5.1). - Hardware-specific alpha/beta numbers are not reported; the optimization is qualitative in the paper.
5. Scheduling GC3 Programs
5.1 Cost Model
- Canonical alpha-beta model:
T = alpha + S * beta. - alpha = per-message start-up cost; beta = per-byte cost.
- Aggregation amortizes alpha by sending contiguous chunks together.
5.2 Scheduling Algorithm
The compiler uses a 5-step greedy heuristic:
- Create thread blocks. Scan all instructions per
GPU; create one TB for every unique
(send-peer, recv-peer, channel)tuple. - Compute dependency depth. Number of hops a chunk has already traversed; lower depth = scheduled earlier.
- Compute reverse dependency depth. Hops remaining; higher reverse depth = closer to the critical path.
- Sort. Global topological order via a heap with
priority
(dep_depth ascending, reverse_dep_depth descending). - Assign. Process instructions in sorted order and append each to the matching threadblock.
The priority depth + reverse_depth is the same family
used by classic critical-path schedulers and shows up in the MSCCLang
ASPLOS '23 paper (0033) as well.
5.3 Threadblock Constraints
- Each TB is restricted to at most one send-peer and one recv-peer. This avoids serialization and matches NCCL's per-channel transport model.
- Multiple TBs per channel-pair are spawned via
parallelize(N)(Section 5.4).
5.4 Parallelization
(parallelize(N))
- One A100 threadblock cannot saturate a 600 GB/s NVLink link.
parallelize(N)(also calledrin the paper) splits a transfer acrossNparallel TBs to drive bandwidth. - Optimal
ris workload- and platform-dependent (Section 7.4) and is currently manually tuned.
5.5 Deadlock Avoidance
- Cooperative-launch kernels run all TBs concurrently; cycles in the inter-TB dependency graph would deadlock.
- The compiler enforces a global topological order that respects both communication edges and processing edges. Because TBs execute their assigned instruction sequences in this order, no cycles can arise.
6. GC3 Runtime
6.1 NCCL Integration
- The runtime is a fork of NCCL 2.8.4-1.
- It reuses NCCL's connection setup, channel abstraction (NVLink, IB, PCIe, TCP transports), and FIFO-slot send/recv buffers.
- Replaces NCCL's canned Ring/Tree kernels with a single GC3-IR-interpreting kernel.
6.2 Cooperative Kernel Launch
- Uses
cudaLaunchCooperativeKernelso all TBs are simultaneously resident on the GPU. Required for inter-TB dependencies. - Number of TBs is bounded by the GPU's SM count (a hard ceiling, Section 8 / Limitations).
6.3 Interpreter Pseudocode (Fig. 5)
for (t = 0; t < chunkSize; t += tileSize) {
for (s = 0; s < N; s++) {
auto instr = instrs[s];
if (tid < D)
wait(semaphore[instr.depBid[tid]], instr.depStep[tid]);
switch (instr.opCode) {
case SEND: send(instr.srcPtr + srcOff, count * tileSize); break;
case RECV: recv(instr.dstPtr + dstOff, count * tileSize); break;
case REDUCE: reduce(...); break;
case COPY: copy(...); break;
// and the fused: RRC, RCS, RRCS, RRS
}
if (instr.hasDep && tid == 0)
set(semaphore[bid], s);
}
}
- Outer loop: tile pipeline over
chunkSize / tileSizetiles. - Inner loop: instruction sequence per TB.
- Synchronization: per-step semaphores in global
memory;
waitblocks until predecessor TBs complete the corresponding step.
6.4 Protocols
- GC3 inherits NCCL's three protocols: Simple, LL, LL128.
- LL is used for small chunks to reduce latency; the choice is expressed at the IR level.
7. Evaluation
7.1 Methodology
- Platform A: 8x NVIDIA A100 (40 GB), 12x 3rd-gen NVLink to 6 NVSwitches (600 GB/s bi-dir intra-node), 2x HDR IB NICs at 25 GB/s each.
- Platform B: 16x NVIDIA V100 (32 GB), 6x 2nd-gen NVLink to 6 NVSwitches, 1x HDR IB NIC per pair of GPUs.
- Multi-node sweeps: 1, 2, 3, ..., up to 32 nodes (256x A100).
- Software: NCCL 2.8.4-1 base, CUDA 11.x.
- Iterations: 50 timed iterations after 20 warmup rounds.
- Baselines: NCCL Ring/Tree, hand-optimized CUDA (Two-Step AllToAll), naive composed kernels.
7.2 AllReduce
The paper evaluates three GC3 algorithms against NCCL Ring/Tree:
| Algorithm | Buffer regime | Platform | Speedup vs. NCCL |
|---|---|---|---|
| Ring AllReduce (GC3) | 32 KB - 3 MB | 8x A100 | up to 1.9x |
| All Pairs AllReduce | 1 KB - 1 MB | 8x A100 | up to 1.8x |
| All Pairs AllReduce | small (< few MB) | 16x V100 (DGX-2) | up to 3.0x |
| Hierarchical 4-phase | medium-large | multi-node A100 | matches/beats NCCL |
- Why GC3's Ring beats NCCL's Ring at 32 KB-3 MB:
instruction fusion (
rrcs) keeps reduced values in registers, eliminating per-step DRAM round-trips that NCCL's classic Ring incurs. - All Pairs AllReduce is a latency-optimal algorithm: every rank sends its chunk directly to every other rank, then reduces locally. Wins at small sizes where bandwidth is irrelevant.
7.3 AllToAll
| Buffer | Platform | vs. hand-CUDA | vs. NCCL |
|---|---|---|---|
| > 512 MB | 16-node, 256x A100 | 1.3x | ~20% (~1.20x) |
- The Two-Step AllToAll algorithm bundles cross-node
sends: intra-node AllToAll, then inter-node AllToAll using bundled
buffers. This reduces the inter-node message count by a factor of
G(GPUs per node) and amortizes per-message alpha cost. - DSL implementation: ~15-16 lines (Fig. 8) versus ~70 lines for the hand-tuned CUDA equivalent.
7.4 AllToNext (Custom Collective)
- Pattern: rank
i-> ranki + 1(a ring shift). Used in pipeline parallelism and some MoE routing patterns. - Result: up to 14.5x over a hand-written CUDA baseline at large buffers (3-node, 24x A100).
- Why such a large gain: A100 nodes have multiple IB NICs; the baseline uses one. GC3's AllToNext divides the buffer across all available IB NICs, so all NICs work in parallel. Demonstrates that GC3's value extends well beyond what NCCL's standard collective catalog can express.
- Ablation note:
parallelize(r=16)is used in AllToNext (Fig. 7g). Increasingrshows speedup until bandwidth saturates.
7.5 End-to-End Workloads
| Workload | Scale | Speedup |
|---|---|---|
| LM inference (GPT-3 class) | 8x A100 | 1.22x - 1.29x |
| MoE training | 256x A100 | 1.10x - 1.89x |
- The 1.10x - 1.89x range for MoE is workload-architecture-dependent (number of experts, activation size, batch size).
7.6 Ablations Reported
- Parallelize(N) factor: speedup grows with
runtil bandwidth saturates (no closed-form recipe given). - Aggregation: described qualitatively as alpha amortization; no isolated speedup table.
- Protocol comparison: Simple / LL / LL128 are compared in Fig. 7; LL128 is reported as best for Two-Step AllToAll at certain sizes.
- Compilation time: not explicitly quantified; programs take "15 minutes to an hour to write and manually optimize" (Section 7).
8. Limitations
- SM-count ceiling. Cooperative launch requires all TBs resident concurrently; the number of TBs is bounded by the GPU's SM count.
- Manual tuning of
rand protocol. Optimal parallelization factor and protocol choice still require human exploration (Section 7.4). - Static / per-topology compilation. Each GC3-IR is generated for a specific (collective, rank-count, topology) triple. Size-dependent algorithm switching is delegated to a runtime table of pre-compiled IRs.
- No compute-comm fusion. Reduction is the only compute primitive in the DSL; overlap with backward-pass compute relies on existing framework-level scheduling.
- NVIDIA-centric. All measurements use CUDA 11 + NCCL 2.8.4-1 on V100 / A100; AMD ROCm/RCCL is not evaluated.
- No automatic synthesis. Algorithms are human-authored (cf. SCCL/TACCL which synthesize via MILP/SMT).
- NCCL-version coupled. The runtime is forked from NCCL 2.8.4-1; upgrading requires re-porting the interpreter.
9. Open Problems / Future Work
The authors call out the following directions:
- Compute-aware DSL. Extend the language to express compute scheduling alongside communication, enabling fused gradient-production + inter-rank reduction.
- Automated tuning of
rand tile size. Replace manual exploration with heuristics or learned policies. - Cross-collective fusion. Optimize back-to-back collectives (e.g., AllToAll-AllToAll in MoE) as a unit.
- Dynamic / size-adaptive code generation. Today's design uses a discrete table of pre-compiled IRs; an online generator that JIT-specializes for observed message-size distributions is open.
- Synthesis-to-execution coupling. Couple SCCL/TACCL synthesis to GC3's compiler so discovered algorithms inherit fusion + pipelining.
10. Related Work
| System | Type | GC3's delta |
|---|---|---|
| NCCL / RCCL | Vendor library, fixed Ring/Tree | GC3 keeps NCCL's transport plumbing but replaces the kernel with a programmable interpreter. |
| SCCL [Cai et al.] | MILP synthesis of collective algorithms | GC3 builds on SCCL's algorithmic ideas but adds the low-level optimizations (fusion, parallelization, scheduling) that SCCL omits. |
| TACCL | Topology-Aware Collective Communication Library | Similar relationship to SCCL — TACCL discovers; GC3 executes. |
| BlueConnect [6, 7] / Blink [42] | Composed-kernel collectives | GC3 fuses multiple steps into a single cooperative kernel,
eliminating launch overhead and enabling cross-step
recvReduceSend fusion. |
| MSCCL++ [Microsoft] | DPDK-style fine-grained P2P primitives | GC3 is the higher-level DSL+compiler; MSCCL++ is the lower-level transport. The two are complementary. |
| SHArP / SwitchML | In-network reduction (hardware) | GC3 is software-only on commodity NVIDIA hardware; SHArP requires Mellanox switches. |
| Horovod / BytePS | Framework-level data-parallel schedulers | Operate above NCCL; GC3 is the NCCL-replacement layer. Composable. |
GC3 explicitly cites SCCL's MILP synthesizer as [4] (Cai et al., "Synthesizing optimal collective algorithms") and positions itself as the missing execution layer for that line of work.
11. Cross-Cutting Empirical Take-Aways
| Take-away | Derived from |
|---|---|
| Instruction fusion (rrcs, rrc, rcs, rrs) keeps intermediates register-resident | Section 4.3, Section 6.3 |
Single TB cannot saturate NVLink — parallelize(r) is
essential |
Section 5.4, AllToNext (Sec 7.4) |
| Hierarchical 4-phase AllReduce wins for multi-node by exploiting NVLink + IB separately | Section 2, Section 7.2 |
| All Pairs AllReduce wins at small sizes (latency-bound) | Section 7.2, esp. DGX-2 (3.0x) |
| Bundling sends amortizes alpha (Two-Step AllToAll) | Section 5.1 cost model + Section 7.3 |
| Multi-NIC bandwidth is left on the floor by NCCL's AllToAll-derived AllToNext | Section 7.4 (14.5x) |
| LL/LL128/Simple choice is preserved at IR level; LL for small | Section 6.4 |
| A 15-line DSL program matches a 70-line hand-tuned CUDA kernel | Fig. 8, Section 7 |
12. Discussion of NCCL
- GC3 is layered into NCCL 2.8.4-1; it inherits NCCL's connection setup, channels, and transports (NVLink, NVSwitch, IB, PCIe, TCP) unchanged.
- The 3-way protocol family (Simple / LL / LL128) is preserved at the IR level; LL is selected for small chunks to reduce latency.
- The major delta vs. stock NCCL: a single-cooperative-kernel interpreter replaces the canned Ring/Tree kernels, enabling cross-step fusion and threadblock-aware scheduling.
- Programs are pre-compiled offline (15-60 min per algorithm) rather than synthesized at runtime — DynamICCL's online selection problem remains open.
13. Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that selects,
via the NCCL tuner-plugin API, per-collective algorithm
(Ring / Tree / CollNet / NVLS), protocol (LL / LL128 /
Simple), nChannels, numThreads, and
chunkSize to minimize collective wall-clock time on HPC GPU
clusters. GC3 adds new algorithms beneath the NCCL API, but its design
choices and limitations map directly onto DynamICCL's state vector,
action space, and exploration priors.
GC3 is the arXiv pre-print of MSCCLang (0033). The implications below
build on the MSCCLang relevance section but emphasize what GC3 surfaces
more concretely (in particular: parallelize(r), alpha-beta
cost model usage, and per-link-class bandwidth exploitation).
| GC3 finding | DynamICCL design implication |
|---|---|
| Each compiled GC3-IR is a discrete pre-compiled algorithm | Action-space expansion: when GC3/MSCCL is loaded, every IR variant
becomes a new entry in DynamICCL's algorithm enum. The
4-way Ring/Tree/CollNet/NVLS choice becomes a topology-conditioned bag
of pre-compiled IRs. |
parallelize(r): one A100 TB cannot saturate NVLink |
DynamICCL's nChannels action axis is real and
load-dependent. Bias toward nChannels=1-2 at small messages
(alpha-bound) and nChannels=4-8 at large messages
(beta-bound). |
| Protocol catalog (Simple/LL/LL128) preserved; LL for small chunks | Strong prior: LL for < ~16 KiB, LL128 for ~16 KiB-1 MiB, Simple for >= 1 MiB. Use this as the exploration anchor. |
| All Pairs wins at small sizes; Ring/Hierarchical at large | Action prior on algorithm should be size-conditioned;
on multi-node hierarchical fabrics, prefer
Hierarchical/CollNet/NVLS. |
| AllToNext exploits all IB NICs (14.5x) | Topology fingerprint must include per-GPU IB-NIC count, not just the four coarse classes (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet). |
priority = depth + reverse_depth schedules
instructions |
Coarse proxy for DynamICCL: "is this collective on the critical path of the current step?" — derivable from the recent-collective LSTM window. |
Aggregation amortizes alpha (T = alpha + S * beta) |
DynamICCL's chunkSize action axis is the duals of GC3's
Aggregation. Fine chunks at large sizes (saturate beta), coarse at small
sizes (amortize alpha). |
| 1.22-1.29x LM inference, 1.10-1.89x MoE training, 14.5x AllToNext | Reward shaping: collective wall-clock translates to step-level wins
without further normalization.
r = -collective_wall_clock_us is the right primary
signal. |
r and protocol still manually tuned (Limitations) |
Open problem aligned with DynamICCL — RL-based
r/protocol selection is exactly the gap. |
Specific design priors:
Exploration prior on (algorithm, protocol):
- msg < 16 KiB: Tree / All-Pairs-style + LL,
nChannels=1,numThreads=128 - 16 KiB - 1 MiB: Ring or hierarchical + LL128 (RL explores)
- msg >= 1 MiB: Ring / Hierarchical + Simple,
nChannels=4-8,numThreads=512
- msg < 16 KiB: Tree / All-Pairs-style + LL,
Exploration prior on (nChannels, numThreads):
- GC3's
parallelize(r)confirms one TB cannot saturate NVLink at large sizes; maprdirectly tonChannelsexploration. - At small sizes, extra channels add only setup overhead; cap at 1-2.
- GC3's
Reward shaping:
- Primary:
r = -collective_wall_clock_us - Optional secondary: penalize p99 latency to catch tail outliers.
- GC3's end-to-end results (1.22-1.29x LM inference, 1.10-1.89x MoE) confirm collective-level wins translate to step-level wins linearly.
- Primary:
State features (per GC3's structural insights):
- Message size (log-binned)
- Model intensity I = C/D (workload identity feature)
- Topology fingerprint: NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet — but augmented with per-GPU IB-NIC count and intra-node NVSwitch count (motivated by GC3's AllToNext).
- Recent-collective timing window (LSTM-encoded)
- "Loaded GC3-IR variants" identifier vector (which custom IRs are available on this cluster)
Action-space duality (parallelize <-> nChannels; aggregation <-> chunkSize):
- GC3's
parallelize(N)and Aggregation are the same dimensions DynamICCL exposes asnChannelsandchunkSize. The RL agent should learn coarse chunks for small messages (effectively aggregation) and fine chunks for large messages (effectively pipelining).
- GC3's
Research positioning:
- GC3 pre-compiles algorithms and selects from a static IR table at runtime. DynamICCL inverts the locus: it learns which pre-compiled IR (and which NCCL config knobs) to invoke per call, online. The SCCL/TACCL -> GC3 -> DynamICCL stack is natural: synthesize, lower+execute, select.
Open-problem alignment:
- GC3 open problem #2 (automated tuning of
rand tile size) and #4 (dynamic size-adaptive code generation) are precisely DynamICCL's contribution at the selection layer. Given a fixed library of GC3-IR variants, DynamICCL learns the (size x topology x load) -> best-IR mapping without requiring JIT compilation. - GC3 open problem #5 (synthesis-to-execution coupling) hints at a unified pipeline: SCCL/TACCL synthesize MILP-optimal algorithms, GC3 lowers and runs them, DynamICCL selects per call. DynamICCL's contribution makes this pipeline closed-loop.
- GC3 open problem #2 (automated tuning of
Exploration budget:
- GC3 compilation is offline (15-60 min). DynamICCL must amortize exploration against a fixed, pre-compiled action set — discrete categorical policy over loaded GC3-IR variants, not parametric continuous over an infinite synthesized space.
- The agent's exploration budget is therefore bounded by the loaded library size; this is much smaller than NCCL-tuner priors initially suggest, making PPO/DQN tractable.