gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters — Detailed Summary
Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur | UC Riverside / Argonne National Lab / Stevens / Iowa / Florida State / UC Merced | ICS 2024
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points.
Abstract
- GPU-aware collective communication is a major bottleneck on modern HPC platforms because aggregate GPU bandwidth far outpaces typical network bandwidth (100 Gbps Slingshot, etc.).
- Naive integration of lossy compression into collectives fails for two reasons: GPUs are underutilized when chunks are small, and uncontrolled error accumulates across reduction rounds.
- gZCCL is presented as the first general framework for designing GPU-aware, compression-enabled collectives with explicit accuracy controls.
- Evaluation on 512 NVIDIA A100 GPUs achieves up to 4.5x speedup over NCCL on Allreduce and 28.7x over Cray MPI on Scatter, with high-fidelity output.
1. Introduction
- High-message collective communication (Allreduce, Scatter, Broadcast) is the dominant performance bottleneck for many exascale-class scientific simulations and large-model deep-learning workloads.
- Lossy compression has emerged as a leverage point because bandwidth growth has not kept pace with GPU compute or memory throughput.
- Prior compression-augmented collective work (CPRP2P, C-Coll) suffers from CPU-centric data staging and/or unbounded error from fixed-rate compressors.
- gZCCL contributes (1) a GPU-centric, fully device-resident pipeline, (2) collective algorithms co-designed with the compressor's GPU-utilization characteristics, (3) error-bound-controlled reduction, and (4) extensive evaluation at up to 512 GPUs.
2. Background and Motivation
- Modern collectives (NCCL, Cray MPI, RCCL) typically use ring-based algorithms: a D-byte buffer is split into D/N chunks for an N-process group.
- The cuSZp lossy compressor's GPU kernel launch overhead becomes dominant when the per-chunk size D/N drops below ~5 MB; below that point, the GPU is underutilized and any compression benefit is erased.
- Prior CPU-staged designs incur device-to-host (D2H) and host-to-device (H2D) transfers that consume up to ~45% of total runtime in compressed-collective pipelines.
- Existing GPU-resident compressors include cuSZp (error-bounded) and ZFP (fixed-rate, no error bound). gZCCL chooses cuSZp because absolute-error control is necessary for application correctness.
- Ring topology compounds another problem under compression: error propagates through N-1 reduction rounds, so the worst-case error grows linearly with process count.
3. Design and Optimization
3.1 GPU-Centric Pipeline
- All buffers (send, receive, intermediate, compressed-staging) are pre-allocated on device at MPI_Init time.
- All reductions execute as device kernels — no host involvement except for MPI control-plane calls.
- Multi-stream CUDA execution overlaps compression kernels, peer sends/receives, and decompression+reduction kernels.
3.2 Multi-Stream cuSZp
- The original cuSZp uses a single default stream. The authors extend
the API with
cuSZp_compress_stream(... cudaStream_t stream)so that several compression invocations can run concurrently on independent streams. - This is essential for Scatter where the root must compress N per-rank blocks before dispatching them to the binomial tree.
3.3 Two Algorithm Frameworks
3.3.1 Collective Computation Framework (Allreduce)
- Standard ring-allreduce performs N-1 reduce-scatter steps + N-1 allgather steps on D/N-sized blocks. Under compression, this means many small kernels (low GPU utilization) and N-1 lossy rounds (large accumulated error).
- gZCCL switches to recursive doubling (ReDoub): in
each of log N rounds, rank pairs exchange and reduce full-size data
blocks. Two benefits:
- Block size per round is the full D, not D/N — GPU utilization is maximized.
- Only log N compression/decompression operations occur — error propagation is logarithmic in N rather than linear.
3.3.2 Collective Data Movement Framework (Scatter / Broadcast)
- The root multi-stream-compresses all N per-rank blocks in parallel, packs the compressed blobs into a contiguous staging buffer, and disperses through a binomial-tree routing pattern (log N hops).
- This avoids the per-chunk overhead of one-by-one compression and exploits the binomial tree's logarithmic step count.
3.3.3 Algorithm-Selection Heuristic
- gZCCL selects Recursive Doubling when D/N is small (typical exascale regime, D/N < ~5 MB) and Ring when chunks are large enough to amortize compressor overhead.
3.4 Pseudocode (gZ-Allreduce ReDoub)
for step s in 0 .. log2(N) - 1:
peer = rank XOR (1 << s)
compressed_local = cuSZp_compress_stream(local, ABS_BOUND, stream_a)
isend(compressed_local -> peer, stream_a)
irecv(compressed_peer <- peer, stream_b)
decompressed_peer = cuSZp_decompress_stream(compressed_peer, stream_b)
local = reduce_kernel(local, decompressed_peer, stream_b) // device-only
4. Implementation
- Implemented on top of MPICH's Abstract Device Interface, with a compression adapter shim that lets the framework call cuSZp (or potentially ZFP) without changing the collective algorithm code.
- Recursive doubling and binomial tree are implemented as new algorithm choices selectable via gZCCL's collective dispatch.
- Pre-allocated buffer pool sizes are configurable; default assumes worst-case message size known at MPI_Init.
- Multi-stream CUDA orchestration uses CUDA events to enforce dependencies between compression, communication, and reduction stages.
5. Evaluation
5.1 Setup
- Cluster: 512 NVIDIA A100 80GB GPUs across 128 nodes (4 GPUs per node).
- Interconnect: HPE Slingshot 10 at 100 Gbps.
- Baselines: NCCL (Allreduce), Cray MPI (Allreduce, Scatter).
- Compressor: cuSZp with absolute-error-bound (ABS) modes 1E-3, 1E-4, 1E-5.
- Datasets: RTM (Reverse Time Migration) 3D SEG/EAGE Overthrust seismic model; image-stacking dataset.
5.2 Allreduce Results
- gZ-Allreduce (ReDoub) at 512 GPUs:
- 4.5x speedup vs NCCL.
- 20.2x speedup vs Cray MPI.
- gZ-Allreduce (Ring) underperforms NCCL for messages < 50 MB (low GPU utilization in small chunks); ReDoub closes this gap.
5.3 Scatter Results
- gZ-Scatter at 16 GPUs on 646 MB message: 28.7x over Cray MPI.
- Speedup decreases somewhat as message size grows (saturation regime): 20.6x at ~smaller sizes drops to ~17.4x at 600 MB.
5.4 Compression Ratio and Accuracy
- RTM dataset compression ratio: 73.35x (Setting 1) and 63.94x (Setting 2) at ABS=1E-4.
- PSNR of reconstructed RTM volumes: 55–88 dB across configurations.
5.5 Image-Stacking Application Study
- gZ-Allreduce (ReDoub) achieves PSNR 57.80 at ABS=1E-4 — better visual quality than ring-based compressed allreduce due to fewer compression rounds.
- End-to-end application speedup: 1.69x over NCCL.
5.6 Sensitivity / Ablations
- Error bound 1E-3 → highest compression ratio, lowest fidelity.
- Error bound 1E-5 → lowest ratio, highest fidelity.
- Ring algorithm scales poorly past ~256 GPUs under compression; ReDoub is the preferred choice at scale.
6. Related Work
- CPRP2P, C-Coll: predecessor compressed-collective frameworks; both CPU-centric and/or use unbounded fixed-rate compressors.
- ZFP, MGARD, SZ family: compression libraries — gZCCL's contribution is at the collective-algorithm layer rather than the compressor itself.
- The paper does not address RL-based or autotuning frameworks for collective selection.
7. Conclusion and Future Work
- gZCCL is the first general framework integrating GPU-aware lossy compression with co-designed collective algorithms and explicit error control.
- Future directions: extending to FPGAs and AI accelerators; integration with additional MPI implementations.
Knobs Exposed by gZCCL (DynamICCL action-space candidates)
| Knob | Type | Tested values / role |
|---|---|---|
| Compress on/off | Binary | Active vs passthrough |
| Compressor backend | Enum | cuSZp (only one evaluated) |
| Error-bound mode | Enum | ABS (REL, PSNR not evaluated) |
| Absolute error bound | Float | 1E-3, 1E-4, 1E-5 |
| Allreduce algorithm | Enum | Ring vs Recursive Doubling |
| Scatter/Broadcast algorithm | Enum | Binomial Tree |
| GPU buffer pool size | Int | Pre-allocated at MPI_Init |
| Compression stream count | Int | Parallelism for multi-stream compress |
| Algorithm-switch threshold | Float (D/N) | Static rule: ~5 MB |
Latency / Bandwidth / Accuracy Trade-Offs
| Dimension | Effect |
|---|---|
| Wall-clock latency | Reduced 4.5x–20.2x (Allreduce); 17.4x–28.7x (Scatter) |
| Effective bandwidth | Multiplied by compression ratio (up to 73.35x byte savings) |
| GPU utilization | High under ReDoub (whole-block), low under Ring at small D/N |
| Numerical error | Bounded by ABS error (1E-3..1E-5); accumulation log N (ReDoub) vs N-1 (Ring) |
| PSNR (image-stacking) | 57.80 dB at ABS=1E-4 |
| PSNR (RTM seismic) | 55–88 dB |
| Tail behavior | Speedup tapers in saturation regime (>~600 MB) |
Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective parameters (algorithm, protocol, nChannels, numThreads) to minimize collective completion time on HPC GPU clusters. gZCCL expands the action space and the reward formulation in concrete ways.
Direct relevance mapping
| gZCCL element | DynamICCL implication |
|---|---|
| Compress on/off per collective | New binary action dimension |
| Error bound (ABS=1E-3..1E-5) | New continuous/discrete action dimension |
| Ring vs Recursive Doubling under compression | Algorithm selector must know about this branch |
| Binomial Tree for Scatter under compression | Topology selector must include compressed variants |
| D/N < 5 MB → ReDoub | Static threshold replaceable by learned policy |
| Multi-stream cuSZp count | Parallelism knob (analogous to nChannels in spirit) |
| Pre-allocated GPU buffer pool | Constraint to surface to RL policy as memory budget |
| Application-level PSNR / accuracy | Reward must become multi-objective (latency + fidelity) |
| Compression ratio per call | New observation feature for state vector |
| Per-call compressor kernel time | New observation feature, decomposes total latency |
Key lessons for DynamICCL
- Compression is a first-class action knob, not just an underlying transport detail: gZCCL shows that the choice to compress (and how aggressively) changes which collective algorithm is optimal — they cannot be tuned independently. DynamICCL should treat them as a joint action.
- Algorithm choice depends on compression state: Ring is fine without compression but loses to ReDoub when compression is active and D/N is small. DynamICCL's algorithm head must condition on the compression action.
- Reward must include accuracy when compression is active: pure latency reward is insufficient. A multi-objective formulation r = -latency - lambda * error or a constraint max -latency s.t. error < epsilon is appropriate.
- Static thresholds are exactly what RL replaces: gZCCL's hand-coded D/N < 5 MB rule for switching Ring → ReDoub is a textbook fixed heuristic that DynamICCL can supersede with a learned policy that observes live compressor kernel times.
- State features to add to DynamICCL: per-call compression ratio, compressor kernel duration, error-bound in effect, post-collective reconstruction PSNR (when measurable). All are observable and inform future action selection.
- Generalization caveat: gZCCL was evaluated only with ABS error bound and only with cuSZp; DynamICCL training would need to either fix that backend or bring multiple compressors into the action space.
- gZCCL is complementary, not competing: gZCCL is a static-policy library answering "how to integrate compression"; DynamICCL is a dynamic-policy agent answering "when and how aggressively to use it". Layered together, gZCCL provides the substrate of compressed collective implementations; DynamICCL provides the per-call selector across them.