GC3: An Optimizing Compiler for GPU Collective Communication

Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi (Microsoft Research, Redmond), Yifan Xiong (Microsoft Research Asia) | arXiv:2201.11840v3 (Jul 2022) — pre-print of the ASPLOS '23 MSCCLang paper | Code: github.com/microsoft/msccl-tools, github.com/microsoft/msccl


Problem

Communication is now the dominant cost of large-scale ML training. The paper cites DeepLight (~2 GB params) spending 79% of step time in collectives, versus only 3% for ResNet-50 (~100 MB). Vendor libraries (NCCL, RCCL) ship a small handful of fixed algorithm templates — Ring, double-binary Tree — that were hand-tuned for canonical NVLink-symmetric topologies and break down on irregular fabrics: NVSwitch islands with multi-NIC InfiniBand, asymmetric GPU-to-NIC ratios (e.g., 8 GPUs / 8 NICs on Azure NDv4), small-medium message sizes, and novel collective patterns like AllToNext that arise in expert-parallel and pipeline-parallel training. Prior synthesis tools (SCCL, TACCL, Blink) discover better algorithms but emit XML/CUDA without execution-level machinery — instruction fusion, tile-pipelining, threadblock-aware scheduling — so naive translation through multiple kernel launches squanders the algorithmic gains. Practitioners therefore face a productivity cliff: either accept NCCL's fixed templates, or invest weeks of CUDA work to match hand-written-kernel performance for each new (collective, topology, size) triple.


Core Insight

A chunk-oriented Python-embedded DSL plus an optimizing compiler that lowers programs through a Chunk DAG into an Instruction DAG — peephole-fusing point-to-point primitives (recvReduceCopy, recvReduceCopySend, recvCopySend, recvReduceSend) so intermediates stay register-resident, then greedy-scheduling onto threadblocks with priority depth + reverse_depth and a single cooperative-kernel interpreter — lets a 15-line DSL program match or exceed a 70-line hand-tuned CUDA implementation, achieving up to 1.9x AllReduce, 1.3x AllToAll, and 14.5x AllToNext speedups while preserving NCCL's API.


Method

GC3 is a three-layer system: a Python-embedded chunk-routing DSL, an optimizing compiler that produces GC3-IR, and an interpreter-based runtime layered into NCCL 2.8.4-1.

+---------------------------------------------------------------+
|  DSL (Python; chunk-oriented; postcondition-validated)        |
|   chunk(rank, buf, idx, count)                                |
|   c1.copy(dst_rank, dst_buf, dst_idx, ch=...)                 |
|   c1.reduce(c2, ch=...)                                       |
|   parallelize(N)  |  aggregation  |  ch=                      |
+---------------------------------+-----------------------------+
                                  v
+---------------------------------------------------------------+
|  Compiler                                                     |
|   1. Tracing       -> Chunk DAG (true + false deps)           |
|   2. Lowering      -> Instruction DAG (send/recv/reduce/copy) |
|   3. Fusion        -> rrc, rcs, rrcs, rrs (register-resident) |
|   4. Aggregation   -> contiguous chunks merged (amortize a)   |
|   5. Threadblock   -> one TB per (send-peer, recv-peer, ch)   |
|   6. Schedule      -> priority = dep_depth + rev_dep_depth    |
|   7. Sync          -> global-memory semaphores wait/set       |
|                                                               |
|  Output: GC3-IR (per-rank tree of TBs and instructions)       |
+---------------------------------+-----------------------------+
                                  v
+---------------------------------------------------------------+
|  GC3 Runtime (in NCCL 2.8.4-1)                                |
|   Single cudaLaunchCooperativeKernel; tiled outer loop over   |
|   chunkSize, inner loop over the instruction sequence         |
|   Inherits NCCL's Simple / LL / LL128 protocols and P2P       |
|   transports (NVLink, NVSwitch, IB, PCIe)                     |
+---------------------------------------------------------------+

The DSL enforces single-writer chunk semantics through reference invalidation; using a stale chunk reference is a compile-time error. Postconditions on collectives are auto-checked by the compiler. Authors express in GC3 a Ring AllReduce, an All Pairs low-latency AllReduce, a 4-phase Hierarchical AllReduce (intra-RS / inter-RS / inter-AG / intra-AG), a Two-Step AllToAll, and a custom AllToNext (rank i -> rank i+1) that fans across all available IB NICs.


Experimental Setup

Component Value
Platform A Azure-class node: 8x NVIDIA A100 (40 GB)
Platform A intra-node 12x 3rd-gen NVLink to 6 NVSwitches; 600 GB/s bi-dir
Platform A inter-node 2x HDR IB NICs at 25 GB/s each (per node)
Platform B NVIDIA DGX-2: 16x V100 (32 GB)
Platform B intra-node 6x 2nd-gen NVLink to 6 NVSwitches
Platform B inter-node 1x HDR IB NIC per pair of GPUs
Multi-node sweeps 1 / 2 / 3 nodes up to 32 nodes (256x A100)
Software NCCL 2.8.4-1 base, CUDA 11.x, single-kernel cooperative launch
Solver none (heuristic compiler; no MILP/SMT)
Iterations 50 timed iterations after 20 warmup rounds
Baselines NCCL Ring/Tree, hand-optimized CUDA (Two-Step AllToAll), naive composed kernels
Microbenchmarks AllReduce, AllToAll, AllToNext sweeping buffer size from 1 KB to >512 MB
Workloads GPT-3-class language model inference, MoE training, BERT
Metrics Microbench latency (us); end-to-end GPU-time speedup; LOC

Headline Quantitative Results

AllReduce microbenchmark (vs. NCCL):

Algorithm Buffer regime Platform Speedup
Ring AllReduce (GC3) 32 KB - 3 MB 8x A100 up to 1.9x
All Pairs AllReduce 1 KB - 1 MB 8x A100 up to 1.8x
All Pairs AllReduce small (< few MB) 16x V100 (DGX-2) up to 3.0x
Hierarchical 4-phase medium-large multi-node A100 matches/beats NCCL

AllToAll microbenchmark:

Buffer Platform vs. hand-CUDA vs. NCCL
> 512 MB 16-node, 256x A100 1.3x ~20% (i.e., 1.20x)

AllToNext custom collective:

Platform vs. hand-CUDA Source of win
3-node, 24x A100 up to 14.5x uses all available IB NICs vs. baseline's 1

End-to-end:

Workload Scale Speedup
LM inference (GPT-3 class) 8x A100 1.22x - 1.29x
MoE training 256x A100 1.10x - 1.89x

Expressiveness (lines of code):

Algorithm GC3 DSL Hand-optimized CUDA
Two-Step AllToAll 15-16 lines (Fig. 8) ~70 lines

Authoring time: 15 to 60 minutes per new algorithm (Section 7).

Cost model: the paper uses the canonical T = alpha + S * beta. Aggregation amortizes alpha by merging contiguous chunks; parallelize(N) splits transfers across N parallel threadblocks to drive bandwidth beta of high-bandwidth NVLink. No hardware-specific alpha/beta numbers are reported.


Limitations


Open Problems Called Out

  1. Compute-aware extensions. Future work explicitly extends the DSL to express compute scheduling alongside communication so that gradient production and inter-rank reduction can be co-scheduled.
  2. Automated tuning of r and tiling. Heuristics or learned policies for the parallelization factor and tile size remain open.
  3. Cross-collective fusion. Adjacent collectives in a step (e.g., AllToAll-AllToAll in MoE) are compiled independently; program-level fusion is unimplemented.
  4. Dynamic / size-adaptive code generation. Today's design uses a discrete table of pre-compiled IRs; an online generator that JIT-specializes for observed message-size distributions is open.
  5. Synthesis-to-execution coupling. Coupling SCCL/TACCL synthesis to GC3's compiler so discovered algorithms inherit fusion + pipelining is the natural integration.

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects, via the NCCL tuner-plugin API, per-collective algorithm (Ring / Tree / CollNet / NVLS), protocol (LL / LL128 / Simple), nChannels, numThreads, and chunkSize to minimize collective wall-clock time on HPC GPU clusters. GC3 sits at a different layer — it is the DSL+compiler+runtime that adds new algorithms into NCCL — but its design choices map directly onto DynamICCL's state vector, action space, and exploration priors. Note: GC3 is the arXiv pre-print of the ASPLOS '23 MSCCLang paper (same authors, same headline numbers), so many of the implications carry over from the 0033 summary; the emphasis below is on what GC3 makes more explicit.

  1. Action-space expansion: each compiled GC3-IR is a discrete action. When GC3 is loaded on the cluster, every pre-compiled IR (Ring AR, All Pairs AR, Hierarchical AR, Two-Step AllToAll, AllToNext, ...) becomes a new entry in DynamICCL's algorithm enum. The 4-way Ring/Tree/CollNet/NVLS choice expands to a topology-conditioned bag of pre-compiled IRs; the agent's job is to learn which entry wins per (size, topology, load) regime — exactly the gap GC3 leaves open.

  2. Action prior on parallelize(N) <-> nChannels / TB count. GC3 explicitly notes that one A100 thread block cannot saturate an NVLink and motivates parallelize(N) to spawn N parallel threadblock pairs. This is the same physical phenomenon as NCCL's nChannels. DynamICCL should bias toward small nChannels (1-2) at small message sizes (latency-bound, alpha-dominated) and toward nChannels ~= 4-8 at large message sizes (bandwidth-bound, beta-dominated), matching GC3's measured r sweep.

  3. Action prior on protocol from buffer size. GC3 confirms NCCL's 3-way protocol family (Simple / LL / LL128) and uses LL to reduce latency for small chunks. DynamICCL's exploration prior: LL for < ~16 KiB, LL128 for ~16 KiB-1 MiB, Simple for >= 1 MiB.

  4. State feature: critical-path proxy. GC3's scheduler computes priority = depth + reverse_depth over the Instruction DAG. DynamICCL cannot observe the IR-level DAG, but a coarse proxy — "is this collective on the critical path of the current step?" — can be derived from the recent-collective LSTM window and used to bias toward latency-optimal vs. bandwidth-optimal actions.

  5. State feature: per-link-class capability vector. GC3's 14.5x AllToNext speedup comes purely from utilizing all available IB NICs per node instead of one. DynamICCL's topology fingerprint must capture per-GPU IB-NIC count (and intra-node NVLink/NVSwitch count), not just the four coarse classes (NVLink-only / NVLink+PCIe / PCIe+IB / Ethernet) — this is the load-balanced-NIC regime that single-NIC defaults systematically miss.

  6. Reward shaping: collective wall-clock matches GC3's metric. GC3 reports microbench latency (us) and end-to-end training speedup; both are direct functions of collective wall-clock. The 1.22-1.29x LM-inference and 1.10-1.89x MoE-training wins show that collective-level improvements translate to step-level improvements without further normalization, supporting r = -collective_wall_clock_us as the primary signal.

  7. Exploration budget: prefer offline-precomputed IRs to live recompilation. GC3 compilation is offline (15-60 min per algorithm; not a per-call decision). DynamICCL must amortize exploration against a fixed, pre-compiled action set; the policy should be discrete-categorical over loaded GC3-IR variants, not parametric-continuous over any synthesized space.

  8. Research positioning: GC3 generates, DynamICCL selects. GC3 authors (and synthesizers like SCCL/TACCL feeding GC3-IR) populate the algorithm catalog; DynamICCL learns which catalog entry to invoke per call. A joint stack — SCCL/TACCL synthesize, GC3 lowers and executes, DynamICCL picks at runtime — is the natural composition and cleanly positions DynamICCL's contribution as the missing online selector that GC3's "static / per-topology compilation" limitation explicitly leaves open (Limitations, Open Problem #4).

  9. Open-problem alignment: dynamic size-adaptive selection and automated r/tiling tuning. GC3's open problems #2 and #4 (automated tuning of r/tile-size and dynamic size-adaptive code generation) are precisely DynamICCL's contribution at the selection layer. Given a fixed library of GC3-IR variants, DynamICCL learns the (size x topology x load) -> best-IR mapping without requiring JIT compilation.