GC3: An Optimizing Compiler for GPU Collective Communication
Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi (Microsoft Research, Redmond), Yifan Xiong (Microsoft Research Asia) | arXiv:2201.11840v3 (Jul 2022) — pre-print of the ASPLOS '23 MSCCLang paper | Code: github.com/microsoft/msccl-tools, github.com/microsoft/msccl
Problem
Communication is now the dominant cost of large-scale ML training. The paper cites DeepLight (~2 GB params) spending 79% of step time in collectives, versus only 3% for ResNet-50 (~100 MB). Vendor libraries (NCCL, RCCL) ship a small handful of fixed algorithm templates — Ring, double-binary Tree — that were hand-tuned for canonical NVLink-symmetric topologies and break down on irregular fabrics: NVSwitch islands with multi-NIC InfiniBand, asymmetric GPU-to-NIC ratios (e.g., 8 GPUs / 8 NICs on Azure NDv4), small-medium message sizes, and novel collective patterns like AllToNext that arise in expert-parallel and pipeline-parallel training. Prior synthesis tools (SCCL, TACCL, Blink) discover better algorithms but emit XML/CUDA without execution-level machinery — instruction fusion, tile-pipelining, threadblock-aware scheduling — so naive translation through multiple kernel launches squanders the algorithmic gains. Practitioners therefore face a productivity cliff: either accept NCCL's fixed templates, or invest weeks of CUDA work to match hand-written-kernel performance for each new (collective, topology, size) triple.
Core Insight
A chunk-oriented Python-embedded DSL plus an optimizing compiler that
lowers programs through a Chunk DAG into an Instruction DAG —
peephole-fusing point-to-point primitives (recvReduceCopy,
recvReduceCopySend, recvCopySend,
recvReduceSend) so intermediates stay register-resident,
then greedy-scheduling onto threadblocks with priority
depth + reverse_depth and a single cooperative-kernel
interpreter — lets a 15-line DSL program match or exceed a 70-line
hand-tuned CUDA implementation, achieving up to 1.9x AllReduce, 1.3x
AllToAll, and 14.5x AllToNext speedups while preserving NCCL's API.
Method
GC3 is a three-layer system: a Python-embedded chunk-routing DSL, an optimizing compiler that produces GC3-IR, and an interpreter-based runtime layered into NCCL 2.8.4-1.
+---------------------------------------------------------------+
| DSL (Python; chunk-oriented; postcondition-validated) |
| chunk(rank, buf, idx, count) |
| c1.copy(dst_rank, dst_buf, dst_idx, ch=...) |
| c1.reduce(c2, ch=...) |
| parallelize(N) | aggregation | ch= |
+---------------------------------+-----------------------------+
v
+---------------------------------------------------------------+
| Compiler |
| 1. Tracing -> Chunk DAG (true + false deps) |
| 2. Lowering -> Instruction DAG (send/recv/reduce/copy) |
| 3. Fusion -> rrc, rcs, rrcs, rrs (register-resident) |
| 4. Aggregation -> contiguous chunks merged (amortize a) |
| 5. Threadblock -> one TB per (send-peer, recv-peer, ch) |
| 6. Schedule -> priority = dep_depth + rev_dep_depth |
| 7. Sync -> global-memory semaphores wait/set |
| |
| Output: GC3-IR (per-rank tree of TBs and instructions) |
+---------------------------------+-----------------------------+
v
+---------------------------------------------------------------+
| GC3 Runtime (in NCCL 2.8.4-1) |
| Single cudaLaunchCooperativeKernel; tiled outer loop over |
| chunkSize, inner loop over the instruction sequence |
| Inherits NCCL's Simple / LL / LL128 protocols and P2P |
| transports (NVLink, NVSwitch, IB, PCIe) |
+---------------------------------------------------------------+
The DSL enforces single-writer chunk semantics through reference
invalidation; using a stale chunk reference is a compile-time error.
Postconditions on collectives are auto-checked by the compiler. Authors
express in GC3 a Ring AllReduce, an All Pairs low-latency
AllReduce, a 4-phase Hierarchical AllReduce (intra-RS / inter-RS /
inter-AG / intra-AG), a Two-Step AllToAll, and a custom AllToNext (rank
i -> rank i+1) that fans across all
available IB NICs.
Experimental Setup
| Component | Value |
|---|---|
| Platform A | Azure-class node: 8x NVIDIA A100 (40 GB) |
| Platform A intra-node | 12x 3rd-gen NVLink to 6 NVSwitches; 600 GB/s bi-dir |
| Platform A inter-node | 2x HDR IB NICs at 25 GB/s each (per node) |
| Platform B | NVIDIA DGX-2: 16x V100 (32 GB) |
| Platform B intra-node | 6x 2nd-gen NVLink to 6 NVSwitches |
| Platform B inter-node | 1x HDR IB NIC per pair of GPUs |
| Multi-node sweeps | 1 / 2 / 3 nodes up to 32 nodes (256x A100) |
| Software | NCCL 2.8.4-1 base, CUDA 11.x, single-kernel cooperative launch |
| Solver | none (heuristic compiler; no MILP/SMT) |
| Iterations | 50 timed iterations after 20 warmup rounds |
| Baselines | NCCL Ring/Tree, hand-optimized CUDA (Two-Step AllToAll), naive composed kernels |
| Microbenchmarks | AllReduce, AllToAll, AllToNext sweeping buffer size from 1 KB to >512 MB |
| Workloads | GPT-3-class language model inference, MoE training, BERT |
| Metrics | Microbench latency (us); end-to-end GPU-time speedup; LOC |
Headline Quantitative Results
AllReduce microbenchmark (vs. NCCL):
| Algorithm | Buffer regime | Platform | Speedup |
|---|---|---|---|
| Ring AllReduce (GC3) | 32 KB - 3 MB | 8x A100 | up to 1.9x |
| All Pairs AllReduce | 1 KB - 1 MB | 8x A100 | up to 1.8x |
| All Pairs AllReduce | small (< few MB) | 16x V100 (DGX-2) | up to 3.0x |
| Hierarchical 4-phase | medium-large | multi-node A100 | matches/beats NCCL |
AllToAll microbenchmark:
| Buffer | Platform | vs. hand-CUDA | vs. NCCL |
|---|---|---|---|
| > 512 MB | 16-node, 256x A100 | 1.3x | ~20% (i.e., 1.20x) |
AllToNext custom collective:
| Platform | vs. hand-CUDA | Source of win |
|---|---|---|
| 3-node, 24x A100 | up to 14.5x | uses all available IB NICs vs. baseline's 1 |
End-to-end:
| Workload | Scale | Speedup |
|---|---|---|
| LM inference (GPT-3 class) | 8x A100 | 1.22x - 1.29x |
| MoE training | 256x A100 | 1.10x - 1.89x |
Expressiveness (lines of code):
| Algorithm | GC3 DSL | Hand-optimized CUDA |
|---|---|---|
| Two-Step AllToAll | 15-16 lines (Fig. 8) | ~70 lines |
Authoring time: 15 to 60 minutes per new algorithm (Section 7).
Cost model: the paper uses the canonical
T = alpha + S * beta. Aggregation amortizes alpha by
merging contiguous chunks; parallelize(N) splits transfers across
N parallel threadblocks to drive bandwidth
beta of high-bandwidth NVLink. No hardware-specific
alpha/beta numbers are reported.
Limitations
- SM-count ceiling. Cooperative-launch kernels require all threadblocks resident concurrently; the number of active TBs is capped by the GPU's SM count. Algorithms with extreme TB demand are infeasible.
- Manual tuning of
rand protocol. The parallelization factorrand protocol selection (Simple / LL / LL128) still require human exploration; Section 7.4 notes optimalris workload-dependent. - Static / per-topology compilation. Each GC3-IR is generated for a specific (collective, rank-count, topology) triple. Size-dependent algorithm switching is delegated to a runtime table of pre-compiled IRs, not dynamic recompilation.
- No compute-comm fusion. The DSL describes communication only; reduction is the only compute primitive. Overlap with backward-pass compute relies on existing framework-level scheduling (WFBP, etc.).
- NVIDIA-centric. All measurements use CUDA 11 + NCCL 2.8.4-1 on V100 / A100; AMD ROCm/RCCL is not evaluated.
- No automatic algorithm synthesis. Algorithms are authored by humans — GC3 is an executor of designs, not a discoverer (in contrast to SCCL and TACCL which synthesize MILP-optimal schedules).
- NCCL-version coupled. The runtime is forked from NCCL 2.8.4-1; upgrading to NCCL 2.18+ requires re-porting the interpreter.
Open Problems Called Out
- Compute-aware extensions. Future work explicitly extends the DSL to express compute scheduling alongside communication so that gradient production and inter-rank reduction can be co-scheduled.
- Automated tuning of
rand tiling. Heuristics or learned policies for the parallelization factor and tile size remain open. - Cross-collective fusion. Adjacent collectives in a step (e.g., AllToAll-AllToAll in MoE) are compiled independently; program-level fusion is unimplemented.
- Dynamic / size-adaptive code generation. Today's design uses a discrete table of pre-compiled IRs; an online generator that JIT-specializes for observed message-size distributions is open.
- Synthesis-to-execution coupling. Coupling SCCL/TACCL synthesis to GC3's compiler so discovered algorithms inherit fusion + pipelining is the natural integration.
Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that selects,
via the NCCL tuner-plugin API, per-collective algorithm
(Ring / Tree / CollNet / NVLS), protocol (LL / LL128 /
Simple), nChannels, numThreads, and
chunkSize to minimize collective wall-clock time on HPC GPU
clusters. GC3 sits at a different layer — it is the DSL+compiler+runtime
that adds new algorithms into NCCL — but its design choices map
directly onto DynamICCL's state vector, action space, and exploration
priors. Note: GC3 is the arXiv pre-print of the ASPLOS '23 MSCCLang
paper (same authors, same headline numbers), so many of the implications
carry over from the 0033 summary; the emphasis below is on what GC3
makes more explicit.
Action-space expansion: each compiled GC3-IR is a discrete action. When GC3 is loaded on the cluster, every pre-compiled IR (Ring AR, All Pairs AR, Hierarchical AR, Two-Step AllToAll, AllToNext, ...) becomes a new entry in DynamICCL's
algorithmenum. The 4-way Ring/Tree/CollNet/NVLS choice expands to a topology-conditioned bag of pre-compiled IRs; the agent's job is to learn which entry wins per (size, topology, load) regime — exactly the gap GC3 leaves open.Action prior on
parallelize(N)<->nChannels/ TB count. GC3 explicitly notes that one A100 thread block cannot saturate an NVLink and motivatesparallelize(N)to spawnNparallel threadblock pairs. This is the same physical phenomenon as NCCL'snChannels. DynamICCL should bias toward smallnChannels(1-2) at small message sizes (latency-bound, alpha-dominated) and towardnChannels~= 4-8 at large message sizes (bandwidth-bound, beta-dominated), matching GC3's measuredrsweep.Action prior on protocol from buffer size. GC3 confirms NCCL's 3-way protocol family (Simple / LL / LL128) and uses LL to reduce latency for small chunks. DynamICCL's exploration prior: LL for < ~16 KiB, LL128 for ~16 KiB-1 MiB, Simple for >= 1 MiB.
State feature: critical-path proxy. GC3's scheduler computes
priority = depth + reverse_depthover the Instruction DAG. DynamICCL cannot observe the IR-level DAG, but a coarse proxy — "is this collective on the critical path of the current step?" — can be derived from the recent-collective LSTM window and used to bias toward latency-optimal vs. bandwidth-optimal actions.State feature: per-link-class capability vector. GC3's 14.5x AllToNext speedup comes purely from utilizing all available IB NICs per node instead of one. DynamICCL's topology fingerprint must capture per-GPU IB-NIC count (and intra-node NVLink/NVSwitch count), not just the four coarse classes (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet) — this is the load-balanced-NIC regime that single-NIC defaults systematically miss.
Reward shaping: collective wall-clock matches GC3's metric. GC3 reports microbench latency (us) and end-to-end training speedup; both are direct functions of collective wall-clock. The 1.22-1.29x LM-inference and 1.10-1.89x MoE-training wins show that collective-level improvements translate to step-level improvements without further normalization, supporting
r = -collective_wall_clock_usas the primary signal.Exploration budget: prefer offline-precomputed IRs to live recompilation. GC3 compilation is offline (15-60 min per algorithm; not a per-call decision). DynamICCL must amortize exploration against a fixed, pre-compiled action set; the policy should be discrete-categorical over loaded GC3-IR variants, not parametric-continuous over any synthesized space.
Research positioning: GC3 generates, DynamICCL selects. GC3 authors (and synthesizers like SCCL/TACCL feeding GC3-IR) populate the algorithm catalog; DynamICCL learns which catalog entry to invoke per call. A joint stack — SCCL/TACCL synthesize, GC3 lowers and executes, DynamICCL picks at runtime — is the natural composition and cleanly positions DynamICCL's contribution as the missing online selector that GC3's "static / per-topology compilation" limitation explicitly leaves open (Limitations, Open Problem #4).
Open-problem alignment: dynamic size-adaptive selection and automated
r/tiling tuning. GC3's open problems #2 and #4 (automated tuning ofr/tile-size and dynamic size-adaptive code generation) are precisely DynamICCL's contribution at the selection layer. Given a fixed library of GC3-IR variants, DynamICCL learns the (size x topology x load) -> best-IR mapping without requiring JIT compilation.