GC3: An Optimizing Compiler for GPU Collective Communication

Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi (Microsoft Research, Redmond), Yifan Xiong (Microsoft Research Asia) | arXiv:2201.11840v3 (Jul 2022) — pre-print of the ASPLOS '23 MSCCLang paper | Code: github.com/microsoft/msccl-tools, github.com/microsoft/msccl

Problem

Communication is now the dominant cost of large-scale ML training. The paper cites DeepLight (~2 GB params) spending 79% of step time in collectives, versus only 3% for ResNet-50 (~100 MB). Vendor libraries (NCCL, RCCL) ship a small handful of fixed algorithm templates — Ring, double-binary Tree — that were hand-tuned for canonical NVLink-symmetric topologies and break down on irregular fabrics: NVSwitch islands with multi-NIC InfiniBand, asymmetric GPU-to-NIC ratios (e.g., 8 GPUs / 8 NICs on Azure NDv4), small-medium message sizes, and novel collective patterns like AllToNext that arise in expert-parallel and pipeline-parallel training. Prior synthesis tools (SCCL, TACCL, Blink) discover better algorithms but emit XML/CUDA without execution-level machinery — instruction fusion, tile-pipelining, threadblock-aware scheduling — so naive translation through multiple kernel launches squanders the algorithmic gains. Practitioners therefore face a productivity cliff: either accept NCCL's fixed templates, or invest weeks of CUDA work to match hand-written-kernel performance for each new (collective, topology, size) triple.

Core Insight

A chunk-oriented Python-embedded DSL plus an optimizing compiler that lowers programs through a Chunk DAG into an Instruction DAG — peephole-fusing point-to-point primitives (recvReduceCopy, recvReduceCopySend, recvCopySend, recvReduceSend) so intermediates stay register-resident, then greedy-scheduling onto threadblocks with priority depth + reverse_depth and a single cooperative-kernel interpreter — lets a 15-line DSL program match or exceed a 70-line hand-tuned CUDA implementation, achieving up to 1.9x AllReduce, 1.3x AllToAll, and 14.5x AllToNext speedups while preserving NCCL's API.

Method

GC3 is a three-layer system: a Python-embedded chunk-routing DSL, an optimizing compiler that produces GC3-IR, and an interpreter-based runtime layered into NCCL 2.8.4-1.

+---------------------------------------------------------------+
|  DSL (Python; chunk-oriented; postcondition-validated)        |
|   chunk(rank, buf, idx, count)                                |
|   c1.copy(dst_rank, dst_buf, dst_idx, ch=...)                 |
|   c1.reduce(c2, ch=...)                                       |
|   parallelize(N)  |  aggregation  |  ch=                      |
+---------------------------------+-----------------------------+
                                  v
+---------------------------------------------------------------+
|  Compiler                                                     |
|   1. Tracing       -> Chunk DAG (true + false deps)           |
|   2. Lowering      -> Instruction DAG (send/recv/reduce/copy) |
|   3. Fusion        -> rrc, rcs, rrcs, rrs (register-resident) |
|   4. Aggregation   -> contiguous chunks merged (amortize a)   |
|   5. Threadblock   -> one TB per (send-peer, recv-peer, ch)   |
|   6. Schedule      -> priority = dep_depth + rev_dep_depth    |
|   7. Sync          -> global-memory semaphores wait/set       |
|                                                               |
|  Output: GC3-IR (per-rank tree of TBs and instructions)       |
+---------------------------------+-----------------------------+
                                  v
+---------------------------------------------------------------+
|  GC3 Runtime (in NCCL 2.8.4-1)                                |
|   Single cudaLaunchCooperativeKernel; tiled outer loop over   |
|   chunkSize, inner loop over the instruction sequence         |
|   Inherits NCCL's Simple / LL / LL128 protocols and P2P       |
|   transports (NVLink, NVSwitch, IB, PCIe)                     |
+---------------------------------------------------------------+

The DSL enforces single-writer chunk semantics through reference invalidation; using a stale chunk reference is a compile-time error. Postconditions on collectives are auto-checked by the compiler. Authors express in GC3 a Ring AllReduce, an All Pairs low-latency AllReduce, a 4-phase Hierarchical AllReduce (intra-RS / inter-RS / inter-AG / intra-AG), a Two-Step AllToAll, and a custom AllToNext (rank i -> rank i+1) that fans across all available IB NICs.

Experimental Setup

Component	Value
Platform A	Azure-class node: 8x NVIDIA A100 (40 GB)
Platform A intra-node	12x 3rd-gen NVLink to 6 NVSwitches; 600 GB/s bi-dir
Platform A inter-node	2x HDR IB NICs at 25 GB/s each (per node)
Platform B	NVIDIA DGX-2: 16x V100 (32 GB)
Platform B intra-node	6x 2nd-gen NVLink to 6 NVSwitches
Platform B inter-node	1x HDR IB NIC per pair of GPUs
Multi-node sweeps	1 / 2 / 3 nodes up to 32 nodes (256x A100)
Software	NCCL 2.8.4-1 base, CUDA 11.x, single-kernel cooperative launch
Solver	none (heuristic compiler; no MILP/SMT)
Iterations	50 timed iterations after 20 warmup rounds
Baselines	NCCL Ring/Tree, hand-optimized CUDA (Two-Step AllToAll), naive composed kernels
Microbenchmarks	AllReduce, AllToAll, AllToNext sweeping buffer size from 1 KB to >512 MB
Workloads	GPT-3-class language model inference, MoE training, BERT
Metrics	Microbench latency (us); end-to-end GPU-time speedup; LOC

Headline Quantitative Results

AllReduce microbenchmark (vs. NCCL):

Algorithm	Buffer regime	Platform	Speedup
Ring AllReduce (GC3)	32 KB - 3 MB	8x A100	up to 1.9x
All Pairs AllReduce	1 KB - 1 MB	8x A100	up to 1.8x
All Pairs AllReduce	small (< few MB)	16x V100 (DGX-2)	up to 3.0x
Hierarchical 4-phase	medium-large	multi-node A100	matches/beats NCCL

AllToAll microbenchmark:

Buffer	Platform	vs. hand-CUDA	vs. NCCL
> 512 MB	16-node, 256x A100	1.3x	~20% (i.e., 1.20x)

AllToNext custom collective:

Platform	vs. hand-CUDA	Source of win
3-node, 24x A100	up to 14.5x	uses all available IB NICs vs. baseline's 1

End-to-end:

Workload	Scale	Speedup
LM inference (GPT-3 class)	8x A100	1.22x - 1.29x
MoE training	256x A100	1.10x - 1.89x

Expressiveness (lines of code):

Algorithm	GC3 DSL	Hand-optimized CUDA
Two-Step AllToAll	15-16 lines (Fig. 8)	~70 lines

Authoring time: 15 to 60 minutes per new algorithm (Section 7).

Cost model: the paper uses the canonical T = alpha + S * beta. Aggregation amortizes alpha by merging contiguous chunks; parallelize(N) splits transfers across N parallel threadblocks to drive bandwidth beta of high-bandwidth NVLink. No hardware-specific alpha/beta numbers are reported.

Limitations

SM-count ceiling. Cooperative-launch kernels require all threadblocks resident concurrently; the number of active TBs is capped by the GPU's SM count. Algorithms with extreme TB demand are infeasible.
Manual tuning of r and protocol. The parallelization factor r and protocol selection (Simple / LL / LL128) still require human exploration; Section 7.4 notes optimal r is workload-dependent.
Static / per-topology compilation. Each GC3-IR is generated for a specific (collective, rank-count, topology) triple. Size-dependent algorithm switching is delegated to a runtime table of pre-compiled IRs, not dynamic recompilation.
No compute-comm fusion. The DSL describes communication only; reduction is the only compute primitive. Overlap with backward-pass compute relies on existing framework-level scheduling (WFBP, etc.).
NVIDIA-centric. All measurements use CUDA 11 + NCCL 2.8.4-1 on V100 / A100; AMD ROCm/RCCL is not evaluated.
No automatic algorithm synthesis. Algorithms are authored by humans — GC3 is an executor of designs, not a discoverer (in contrast to SCCL and TACCL which synthesize MILP-optimal schedules).
NCCL-version coupled. The runtime is forked from NCCL 2.8.4-1; upgrading to NCCL 2.18+ requires re-porting the interpreter.

Open Problems Called Out

Compute-aware extensions. Future work explicitly extends the DSL to express compute scheduling alongside communication so that gradient production and inter-rank reduction can be co-scheduled.
Automated tuning of r and tiling. Heuristics or learned policies for the parallelization factor and tile size remain open.
Cross-collective fusion. Adjacent collectives in a step (e.g., AllToAll-AllToAll in MoE) are compiled independently; program-level fusion is unimplemented.
Dynamic / size-adaptive code generation. Today's design uses a discrete table of pre-compiled IRs; an online generator that JIT-specializes for observed message-size distributions is open.
Synthesis-to-execution coupling. Coupling SCCL/TACCL synthesis to GC3's compiler so discovered algorithms inherit fusion + pipelining is the natural integration.

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects, via the NCCL tuner-plugin API, per-collective algorithm (Ring / Tree / CollNet / NVLS), protocol (LL / LL128 / Simple), nChannels, numThreads, and chunkSize to minimize collective wall-clock time on HPC GPU clusters. GC3 sits at a different layer — it is the DSL+compiler+runtime that adds new algorithms into NCCL — but its design choices map directly onto DynamICCL's state vector, action space, and exploration priors. Note: GC3 is the arXiv pre-print of the ASPLOS '23 MSCCLang paper (same authors, same headline numbers), so many of the implications carry over from the 0033 summary; the emphasis below is on what GC3 makes more explicit.

Action-space expansion: each compiled GC3-IR is a discrete action. When GC3 is loaded on the cluster, every pre-compiled IR (Ring AR, All Pairs AR, Hierarchical AR, Two-Step AllToAll, AllToNext, ...) becomes a new entry in DynamICCL's algorithm enum. The 4-way Ring/Tree/CollNet/NVLS choice expands to a topology-conditioned bag of pre-compiled IRs; the agent's job is to learn which entry wins per (size, topology, load) regime — exactly the gap GC3 leaves open.
Action prior on parallelize(N) <-> nChannels / TB count. GC3 explicitly notes that one A100 thread block cannot saturate an NVLink and motivates parallelize(N) to spawn N parallel threadblock pairs. This is the same physical phenomenon as NCCL's nChannels. DynamICCL should bias toward small nChannels (1-2) at small message sizes (latency-bound, alpha-dominated) and toward nChannels ~= 4-8 at large message sizes (bandwidth-bound, beta-dominated), matching GC3's measured r sweep.
Action prior on protocol from buffer size. GC3 confirms NCCL's 3-way protocol family (Simple / LL / LL128) and uses LL to reduce latency for small chunks. DynamICCL's exploration prior: LL for < ~16 KiB, LL128 for ~16 KiB-1 MiB, Simple for >= 1 MiB.
State feature: critical-path proxy. GC3's scheduler computes priority = depth + reverse_depth over the Instruction DAG. DynamICCL cannot observe the IR-level DAG, but a coarse proxy — "is this collective on the critical path of the current step?" — can be derived from the recent-collective LSTM window and used to bias toward latency-optimal vs. bandwidth-optimal actions.
State feature: per-link-class capability vector. GC3's 14.5x AllToNext speedup comes purely from utilizing all available IB NICs per node instead of one. DynamICCL's topology fingerprint must capture per-GPU IB-NIC count (and intra-node NVLink/NVSwitch count), not just the four coarse classes (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet) — this is the load-balanced-NIC regime that single-NIC defaults systematically miss.
Reward shaping: collective wall-clock matches GC3's metric. GC3 reports microbench latency (us) and end-to-end training speedup; both are direct functions of collective wall-clock. The 1.22-1.29x LM-inference and 1.10-1.89x MoE-training wins show that collective-level improvements translate to step-level improvements without further normalization, supporting r = -collective_wall_clock_us as the primary signal.
Exploration budget: prefer offline-precomputed IRs to live recompilation. GC3 compilation is offline (15-60 min per algorithm; not a per-call decision). DynamICCL must amortize exploration against a fixed, pre-compiled action set; the policy should be discrete-categorical over loaded GC3-IR variants, not parametric-continuous over any synthesized space.
Research positioning: GC3 generates, DynamICCL selects. GC3 authors (and synthesizers like SCCL/TACCL feeding GC3-IR) populate the algorithm catalog; DynamICCL learns which catalog entry to invoke per call. A joint stack — SCCL/TACCL synthesize, GC3 lowers and executes, DynamICCL picks at runtime — is the natural composition and cleanly positions DynamICCL's contribution as the missing online selector that GC3's "static / per-topology compilation" limitation explicitly leaves open (Limitations, Open Problem #4).
Open-problem alignment: dynamic size-adaptive selection and automated r/tiling tuning. GC3's open problems #2 and #4 (automated tuning of r/tile-size and dynamic size-adaptive code generation) are precisely DynamICCL's contribution at the selection layer. Given a fixed library of GC3-IR variants, DynamICCL learns the (size x topology x load) -> best-IR mapping without requiring JIT compilation.