GC3: An Optimizing Compiler for GPU Collective Communication — Detailed Summary

Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi (Microsoft Research, Redmond), Yifan Xiong (Microsoft Research Asia) | arXiv:2201.11840v3 (19 Jul 2022) — pre-print of the ASPLOS '23 MSCCLang paper | Code: github.com/microsoft/msccl-tools, github.com/microsoft/msccl

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them. Note: the paper calls the system GC3; post-publication it was rebranded to MSCCLang for ASPLOS '23. Same authors, same compiler, same runtime, same headline numbers. Cross-references to the 0033 MSCCLang summary are included where the two papers' content overlaps.


Abstract

"This paper introduces GC3, a system for programmable GPU communication. GC3 provides a domain specific language for writing collective communication algorithms and an optimizing compiler for lowering them to an executable form, which can be executed efficiently and flexibly in an interpreter based runtime. We used GC3 to write novel collective algorithms for AllReduce and AllToAll that are up to 1.9x and 1.3x faster than hand-optimized implementations, respectively."


1. Introduction

Communication wall:

Productivity gap:

GC3's contribution:

Naming note: The paper uses GC3 throughout. The publicly released artifact is hosted at github.com/microsoft/msccl-tools and github.com/microsoft/msccl; the system was later rebranded to MSCCLang at ASPLOS '23. Authors and core technical content are identical between the two.


2. GC3 Example: Hierarchical AllReduce


3. GC3 DSL

3.1 Buffers

3.2 Operations

3.3 Safety by Construction

3.4 DSL Code Snippets in the Paper


4. Lowering GC3 Programs

4.1 Tracing -> Chunk DAG

4.2 Lowering -> Instruction DAG

4.3 Fusion / Peephole Optimization

The compiler fuses adjacent instruction patterns into single primitives that keep intermediates register-resident, avoiding global-memory traffic:

Fused name Expansion Effect
rrc (recvReduceCopy) recv -> reduce -> copy one global write
rcs (recvCopySend) recv -> copy -> send back-to-back forwarding
rrcs (recvReduceCopySend) recv -> reduce -> copy -> send full pipeline-stage fusion
rrs (recvReduceSend) recv -> reduce -> send (no local save) pure intermediate

4.4 Aggregation


5. Scheduling GC3 Programs

5.1 Cost Model

5.2 Scheduling Algorithm

The compiler uses a 5-step greedy heuristic:

  1. Create thread blocks. Scan all instructions per GPU; create one TB for every unique (send-peer, recv-peer, channel) tuple.
  2. Compute dependency depth. Number of hops a chunk has already traversed; lower depth = scheduled earlier.
  3. Compute reverse dependency depth. Hops remaining; higher reverse depth = closer to the critical path.
  4. Sort. Global topological order via a heap with priority (dep_depth ascending, reverse_dep_depth descending).
  5. Assign. Process instructions in sorted order and append each to the matching threadblock.

The priority depth + reverse_depth is the same family used by classic critical-path schedulers and shows up in the MSCCLang ASPLOS '23 paper (0033) as well.

5.3 Threadblock Constraints

5.4 Parallelization (parallelize(N))

5.5 Deadlock Avoidance


6. GC3 Runtime

6.1 NCCL Integration

6.2 Cooperative Kernel Launch

6.3 Interpreter Pseudocode (Fig. 5)

for (t = 0; t < chunkSize; t += tileSize) {
  for (s = 0; s < N; s++) {
    auto instr = instrs[s];
    if (tid < D)
      wait(semaphore[instr.depBid[tid]], instr.depStep[tid]);
    switch (instr.opCode) {
      case SEND: send(instr.srcPtr + srcOff, count * tileSize); break;
      case RECV: recv(instr.dstPtr + dstOff, count * tileSize); break;
      case REDUCE: reduce(...); break;
      case COPY:   copy(...);   break;
      // and the fused: RRC, RCS, RRCS, RRS
    }
    if (instr.hasDep && tid == 0)
      set(semaphore[bid], s);
  }
}

6.4 Protocols


7. Evaluation

7.1 Methodology

7.2 AllReduce

The paper evaluates three GC3 algorithms against NCCL Ring/Tree:

Algorithm Buffer regime Platform Speedup vs. NCCL
Ring AllReduce (GC3) 32 KB - 3 MB 8x A100 up to 1.9x
All Pairs AllReduce 1 KB - 1 MB 8x A100 up to 1.8x
All Pairs AllReduce small (< few MB) 16x V100 (DGX-2) up to 3.0x
Hierarchical 4-phase medium-large multi-node A100 matches/beats NCCL

7.3 AllToAll

Buffer Platform vs. hand-CUDA vs. NCCL
> 512 MB 16-node, 256x A100 1.3x ~20% (~1.20x)

7.4 AllToNext (Custom Collective)

7.5 End-to-End Workloads

Workload Scale Speedup
LM inference (GPT-3 class) 8x A100 1.22x - 1.29x
MoE training 256x A100 1.10x - 1.89x

7.6 Ablations Reported


8. Limitations


9. Open Problems / Future Work

The authors call out the following directions:

  1. Compute-aware DSL. Extend the language to express compute scheduling alongside communication, enabling fused gradient-production + inter-rank reduction.
  2. Automated tuning of r and tile size. Replace manual exploration with heuristics or learned policies.
  3. Cross-collective fusion. Optimize back-to-back collectives (e.g., AllToAll-AllToAll in MoE) as a unit.
  4. Dynamic / size-adaptive code generation. Today's design uses a discrete table of pre-compiled IRs; an online generator that JIT-specializes for observed message-size distributions is open.
  5. Synthesis-to-execution coupling. Couple SCCL/TACCL synthesis to GC3's compiler so discovered algorithms inherit fusion + pipelining.

System Type GC3's delta
NCCL / RCCL Vendor library, fixed Ring/Tree GC3 keeps NCCL's transport plumbing but replaces the kernel with a programmable interpreter.
SCCL [Cai et al.] MILP synthesis of collective algorithms GC3 builds on SCCL's algorithmic ideas but adds the low-level optimizations (fusion, parallelization, scheduling) that SCCL omits.
TACCL Topology-Aware Collective Communication Library Similar relationship to SCCL — TACCL discovers; GC3 executes.
BlueConnect [6, 7] / Blink [42] Composed-kernel collectives GC3 fuses multiple steps into a single cooperative kernel, eliminating launch overhead and enabling cross-step recvReduceSend fusion.
MSCCL++ [Microsoft] DPDK-style fine-grained P2P primitives GC3 is the higher-level DSL+compiler; MSCCL++ is the lower-level transport. The two are complementary.
SHArP / SwitchML In-network reduction (hardware) GC3 is software-only on commodity NVIDIA hardware; SHArP requires Mellanox switches.
Horovod / BytePS Framework-level data-parallel schedulers Operate above NCCL; GC3 is the NCCL-replacement layer. Composable.

GC3 explicitly cites SCCL's MILP synthesizer as [4] (Cai et al., "Synthesizing optimal collective algorithms") and positions itself as the missing execution layer for that line of work.


11. Cross-Cutting Empirical Take-Aways

Take-away Derived from
Instruction fusion (rrcs, rrc, rcs, rrs) keeps intermediates register-resident Section 4.3, Section 6.3
Single TB cannot saturate NVLink — parallelize(r) is essential Section 5.4, AllToNext (Sec 7.4)
Hierarchical 4-phase AllReduce wins for multi-node by exploiting NVLink + IB separately Section 2, Section 7.2
All Pairs AllReduce wins at small sizes (latency-bound) Section 7.2, esp. DGX-2 (3.0x)
Bundling sends amortizes alpha (Two-Step AllToAll) Section 5.1 cost model + Section 7.3
Multi-NIC bandwidth is left on the floor by NCCL's AllToAll-derived AllToNext Section 7.4 (14.5x)
LL/LL128/Simple choice is preserved at IR level; LL for small Section 6.4
A 15-line DSL program matches a 70-line hand-tuned CUDA kernel Fig. 8, Section 7

12. Discussion of NCCL


13. Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects, via the NCCL tuner-plugin API, per-collective algorithm (Ring / Tree / CollNet / NVLS), protocol (LL / LL128 / Simple), nChannels, numThreads, and chunkSize to minimize collective wall-clock time on HPC GPU clusters. GC3 adds new algorithms beneath the NCCL API, but its design choices and limitations map directly onto DynamICCL's state vector, action space, and exploration priors.

GC3 is the arXiv pre-print of MSCCLang (0033). The implications below build on the MSCCLang relevance section but emphasize what GC3 surfaces more concretely (in particular: parallelize(r), alpha-beta cost model usage, and per-link-class bandwidth exploitation).

GC3 finding DynamICCL design implication
Each compiled GC3-IR is a discrete pre-compiled algorithm Action-space expansion: when GC3/MSCCL is loaded, every IR variant becomes a new entry in DynamICCL's algorithm enum. The 4-way Ring/Tree/CollNet/NVLS choice becomes a topology-conditioned bag of pre-compiled IRs.
parallelize(r): one A100 TB cannot saturate NVLink DynamICCL's nChannels action axis is real and load-dependent. Bias toward nChannels=1-2 at small messages (alpha-bound) and nChannels=4-8 at large messages (beta-bound).
Protocol catalog (Simple/LL/LL128) preserved; LL for small chunks Strong prior: LL for < ~16 KiB, LL128 for ~16 KiB-1 MiB, Simple for >= 1 MiB. Use this as the exploration anchor.
All Pairs wins at small sizes; Ring/Hierarchical at large Action prior on algorithm should be size-conditioned; on multi-node hierarchical fabrics, prefer Hierarchical/CollNet/NVLS.
AllToNext exploits all IB NICs (14.5x) Topology fingerprint must include per-GPU IB-NIC count, not just the four coarse classes (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet).
priority = depth + reverse_depth schedules instructions Coarse proxy for DynamICCL: "is this collective on the critical path of the current step?" — derivable from the recent-collective LSTM window.
Aggregation amortizes alpha (T = alpha + S * beta) DynamICCL's chunkSize action axis is the duals of GC3's Aggregation. Fine chunks at large sizes (saturate beta), coarse at small sizes (amortize alpha).
1.22-1.29x LM inference, 1.10-1.89x MoE training, 14.5x AllToNext Reward shaping: collective wall-clock translates to step-level wins without further normalization. r = -collective_wall_clock_us is the right primary signal.
r and protocol still manually tuned (Limitations) Open problem aligned with DynamICCL — RL-based r/protocol selection is exactly the gap.

Specific design priors:

  1. Exploration prior on (algorithm, protocol):

    • msg < 16 KiB: Tree / All-Pairs-style + LL, nChannels=1, numThreads=128
    • 16 KiB - 1 MiB: Ring or hierarchical + LL128 (RL explores)
    • msg >= 1 MiB: Ring / Hierarchical + Simple, nChannels=4-8, numThreads=512
  2. Exploration prior on (nChannels, numThreads):

    • GC3's parallelize(r) confirms one TB cannot saturate NVLink at large sizes; map r directly to nChannels exploration.
    • At small sizes, extra channels add only setup overhead; cap at 1-2.
  3. Reward shaping:

    • Primary: r = -collective_wall_clock_us
    • Optional secondary: penalize p99 latency to catch tail outliers.
    • GC3's end-to-end results (1.22-1.29x LM inference, 1.10-1.89x MoE) confirm collective-level wins translate to step-level wins linearly.
  4. State features (per GC3's structural insights):

    • Message size (log-binned)
    • Model intensity I = C/D (workload identity feature)
    • Topology fingerprint: NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet — but augmented with per-GPU IB-NIC count and intra-node NVSwitch count (motivated by GC3's AllToNext).
    • Recent-collective timing window (LSTM-encoded)
    • "Loaded GC3-IR variants" identifier vector (which custom IRs are available on this cluster)
  5. Action-space duality (parallelize <-> nChannels; aggregation <-> chunkSize):

    • GC3's parallelize(N) and Aggregation are the same dimensions DynamICCL exposes as nChannels and chunkSize. The RL agent should learn coarse chunks for small messages (effectively aggregation) and fine chunks for large messages (effectively pipelining).
  6. Research positioning:

    • GC3 pre-compiles algorithms and selects from a static IR table at runtime. DynamICCL inverts the locus: it learns which pre-compiled IR (and which NCCL config knobs) to invoke per call, online. The SCCL/TACCL -> GC3 -> DynamICCL stack is natural: synthesize, lower+execute, select.
  7. Open-problem alignment:

    • GC3 open problem #2 (automated tuning of r and tile size) and #4 (dynamic size-adaptive code generation) are precisely DynamICCL's contribution at the selection layer. Given a fixed library of GC3-IR variants, DynamICCL learns the (size x topology x load) -> best-IR mapping without requiring JIT compilation.
    • GC3 open problem #5 (synthesis-to-execution coupling) hints at a unified pipeline: SCCL/TACCL synthesize MILP-optimal algorithms, GC3 lowers and runs them, DynamICCL selects per call. DynamICCL's contribution makes this pipeline closed-loop.
  8. Exploration budget:

    • GC3 compilation is offline (15-60 min). DynamICCL must amortize exploration against a fixed, pre-compiled action set — discrete categorical policy over loaded GC3-IR variants, not parametric continuous over an infinite synthesized space.
    • The agent's exploration budget is therefore bounded by the loaded library size; this is much smaller than NCCL-tuner priors initially suggest, making PPO/DQN tractable.