gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters — Detailed Summary

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur | UC Riverside / Argonne National Lab / Stevens / Iowa / Florida State / UC Merced | ICS 2024

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points.

Abstract

GPU-aware collective communication is a major bottleneck on modern HPC platforms because aggregate GPU bandwidth far outpaces typical network bandwidth (100 Gbps Slingshot, etc.).
Naive integration of lossy compression into collectives fails for two reasons: GPUs are underutilized when chunks are small, and uncontrolled error accumulates across reduction rounds.
gZCCL is presented as the first general framework for designing GPU-aware, compression-enabled collectives with explicit accuracy controls.
Evaluation on 512 NVIDIA A100 GPUs achieves up to 4.5x speedup over NCCL on Allreduce and 28.7x over Cray MPI on Scatter, with high-fidelity output.

1. Introduction

High-message collective communication (Allreduce, Scatter, Broadcast) is the dominant performance bottleneck for many exascale-class scientific simulations and large-model deep-learning workloads.
Lossy compression has emerged as a leverage point because bandwidth growth has not kept pace with GPU compute or memory throughput.
Prior compression-augmented collective work (CPRP2P, C-Coll) suffers from CPU-centric data staging and/or unbounded error from fixed-rate compressors.
gZCCL contributes (1) a GPU-centric, fully device-resident pipeline, (2) collective algorithms co-designed with the compressor's GPU-utilization characteristics, (3) error-bound-controlled reduction, and (4) extensive evaluation at up to 512 GPUs.

2. Background and Motivation

Modern collectives (NCCL, Cray MPI, RCCL) typically use ring-based algorithms: a D-byte buffer is split into D/N chunks for an N-process group.
The cuSZp lossy compressor's GPU kernel launch overhead becomes dominant when the per-chunk size D/N drops below ~5 MB; below that point, the GPU is underutilized and any compression benefit is erased.
Prior CPU-staged designs incur device-to-host (D2H) and host-to-device (H2D) transfers that consume up to ~45% of total runtime in compressed-collective pipelines.
Existing GPU-resident compressors include cuSZp (error-bounded) and ZFP (fixed-rate, no error bound). gZCCL chooses cuSZp because absolute-error control is necessary for application correctness.
Ring topology compounds another problem under compression: error propagates through N-1 reduction rounds, so the worst-case error grows linearly with process count.

3. Design and Optimization

3.1 GPU-Centric Pipeline

All buffers (send, receive, intermediate, compressed-staging) are pre-allocated on device at MPI_Init time.
All reductions execute as device kernels — no host involvement except for MPI control-plane calls.
Multi-stream CUDA execution overlaps compression kernels, peer sends/receives, and decompression+reduction kernels.

3.2 Multi-Stream cuSZp

The original cuSZp uses a single default stream. The authors extend the API with cuSZp_compress_stream(... cudaStream_t stream) so that several compression invocations can run concurrently on independent streams.
This is essential for Scatter where the root must compress N per-rank blocks before dispatching them to the binomial tree.

3.3 Two Algorithm Frameworks

3.3.1 Collective Computation Framework (Allreduce)

Standard ring-allreduce performs N-1 reduce-scatter steps + N-1 allgather steps on D/N-sized blocks. Under compression, this means many small kernels (low GPU utilization) and N-1 lossy rounds (large accumulated error).
gZCCL switches to recursive doubling (ReDoub): in each of log N rounds, rank pairs exchange and reduce full-size data blocks. Two benefits:
- Block size per round is the full D, not D/N — GPU utilization is maximized.
- Only log N compression/decompression operations occur — error propagation is logarithmic in N rather than linear.

3.3.2 Collective Data Movement Framework (Scatter / Broadcast)

The root multi-stream-compresses all N per-rank blocks in parallel, packs the compressed blobs into a contiguous staging buffer, and disperses through a binomial-tree routing pattern (log N hops).
This avoids the per-chunk overhead of one-by-one compression and exploits the binomial tree's logarithmic step count.

3.3.3 Algorithm-Selection Heuristic

gZCCL selects Recursive Doubling when D/N is small (typical exascale regime, D/N < ~5 MB) and Ring when chunks are large enough to amortize compressor overhead.

3.4 Pseudocode (gZ-Allreduce ReDoub)

for step s in 0 .. log2(N) - 1:
    peer = rank XOR (1 << s)
    compressed_local = cuSZp_compress_stream(local, ABS_BOUND, stream_a)
    isend(compressed_local -> peer, stream_a)
    irecv(compressed_peer  <- peer, stream_b)
    decompressed_peer = cuSZp_decompress_stream(compressed_peer, stream_b)
    local = reduce_kernel(local, decompressed_peer, stream_b)  // device-only

4. Implementation

Implemented on top of MPICH's Abstract Device Interface, with a compression adapter shim that lets the framework call cuSZp (or potentially ZFP) without changing the collective algorithm code.
Recursive doubling and binomial tree are implemented as new algorithm choices selectable via gZCCL's collective dispatch.
Pre-allocated buffer pool sizes are configurable; default assumes worst-case message size known at MPI_Init.
Multi-stream CUDA orchestration uses CUDA events to enforce dependencies between compression, communication, and reduction stages.

5. Evaluation

5.1 Setup

Cluster: 512 NVIDIA A100 80GB GPUs across 128 nodes (4 GPUs per node).
Interconnect: HPE Slingshot 10 at 100 Gbps.
Baselines: NCCL (Allreduce), Cray MPI (Allreduce, Scatter).
Compressor: cuSZp with absolute-error-bound (ABS) modes 1E-3, 1E-4, 1E-5.
Datasets: RTM (Reverse Time Migration) 3D SEG/EAGE Overthrust seismic model; image-stacking dataset.

5.2 Allreduce Results

gZ-Allreduce (ReDoub) at 512 GPUs:
- 4.5x speedup vs NCCL.
- 20.2x speedup vs Cray MPI.
gZ-Allreduce (Ring) underperforms NCCL for messages < 50 MB (low GPU utilization in small chunks); ReDoub closes this gap.

5.3 Scatter Results

gZ-Scatter at 16 GPUs on 646 MB message: 28.7x over Cray MPI.
Speedup decreases somewhat as message size grows (saturation regime): 20.6x at ~smaller sizes drops to ~17.4x at 600 MB.

5.4 Compression Ratio and Accuracy

RTM dataset compression ratio: 73.35x (Setting 1) and 63.94x (Setting 2) at ABS=1E-4.
PSNR of reconstructed RTM volumes: 55–88 dB across configurations.

5.5 Image-Stacking Application Study

gZ-Allreduce (ReDoub) achieves PSNR 57.80 at ABS=1E-4 — better visual quality than ring-based compressed allreduce due to fewer compression rounds.
End-to-end application speedup: 1.69x over NCCL.

5.6 Sensitivity / Ablations

Error bound 1E-3 → highest compression ratio, lowest fidelity.
Error bound 1E-5 → lowest ratio, highest fidelity.
Ring algorithm scales poorly past ~256 GPUs under compression; ReDoub is the preferred choice at scale.

CPRP2P, C-Coll: predecessor compressed-collective frameworks; both CPU-centric and/or use unbounded fixed-rate compressors.
ZFP, MGARD, SZ family: compression libraries — gZCCL's contribution is at the collective-algorithm layer rather than the compressor itself.
The paper does not address RL-based or autotuning frameworks for collective selection.

7. Conclusion and Future Work

gZCCL is the first general framework integrating GPU-aware lossy compression with co-designed collective algorithms and explicit error control.
Future directions: extending to FPGAs and AI accelerators; integration with additional MPI implementations.

Knobs Exposed by gZCCL (DynamICCL action-space candidates)

Knob	Type	Tested values / role
Compress on/off	Binary	Active vs passthrough
Compressor backend	Enum	cuSZp (only one evaluated)
Error-bound mode	Enum	ABS (REL, PSNR not evaluated)
Absolute error bound	Float	1E-3, 1E-4, 1E-5
Allreduce algorithm	Enum	Ring vs Recursive Doubling
Scatter/Broadcast algorithm	Enum	Binomial Tree
GPU buffer pool size	Int	Pre-allocated at MPI_Init
Compression stream count	Int	Parallelism for multi-stream compress
Algorithm-switch threshold	Float (D/N)	Static rule: ~5 MB

Latency / Bandwidth / Accuracy Trade-Offs

Dimension	Effect
Wall-clock latency	Reduced 4.5x–20.2x (Allreduce); 17.4x–28.7x (Scatter)
Effective bandwidth	Multiplied by compression ratio (up to 73.35x byte savings)
GPU utilization	High under ReDoub (whole-block), low under Ring at small D/N
Numerical error	Bounded by ABS error (1E-3..1E-5); accumulation log N (ReDoub) vs N-1 (Ring)
PSNR (image-stacking)	57.80 dB at ABS=1E-4
PSNR (RTM seismic)	55–88 dB
Tail behavior	Speedup tapers in saturation regime (>~600 MB)

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective parameters (algorithm, protocol, nChannels, numThreads) to minimize collective completion time on HPC GPU clusters. gZCCL expands the action space and the reward formulation in concrete ways.

Direct relevance mapping

gZCCL element	DynamICCL implication
Compress on/off per collective	New binary action dimension
Error bound (ABS=1E-3..1E-5)	New continuous/discrete action dimension
Ring vs Recursive Doubling under compression	Algorithm selector must know about this branch
Binomial Tree for Scatter under compression	Topology selector must include compressed variants
D/N < 5 MB → ReDoub	Static threshold replaceable by learned policy
Multi-stream cuSZp count	Parallelism knob (analogous to nChannels in spirit)
Pre-allocated GPU buffer pool	Constraint to surface to RL policy as memory budget
Application-level PSNR / accuracy	Reward must become multi-objective (latency + fidelity)
Compression ratio per call	New observation feature for state vector
Per-call compressor kernel time	New observation feature, decomposes total latency

Key lessons for DynamICCL

Compression is a first-class action knob, not just an underlying transport detail: gZCCL shows that the choice to compress (and how aggressively) changes which collective algorithm is optimal — they cannot be tuned independently. DynamICCL should treat them as a joint action.
Algorithm choice depends on compression state: Ring is fine without compression but loses to ReDoub when compression is active and D/N is small. DynamICCL's algorithm head must condition on the compression action.
Reward must include accuracy when compression is active: pure latency reward is insufficient. A multi-objective formulation r = -latency - lambda * error or a constraint max -latency s.t. error < epsilon is appropriate.
Static thresholds are exactly what RL replaces: gZCCL's hand-coded D/N < 5 MB rule for switching Ring → ReDoub is a textbook fixed heuristic that DynamICCL can supersede with a learned policy that observes live compressor kernel times.
State features to add to DynamICCL: per-call compression ratio, compressor kernel duration, error-bound in effect, post-collective reconstruction PSNR (when measurable). All are observable and inform future action selection.
Generalization caveat: gZCCL was evaluated only with ABS error bound and only with cuSZp; DynamICCL training would need to either fix that backend or bring multiple compressors into the action space.
gZCCL is complementary, not competing: gZCCL is a static-policy library answering "how to integrate compression"; DynamICCL is a dynamic-policy agent answering "when and how aggressively to use it". Layered together, gZCCL provides the substrate of compressed collective implementations; DynamICCL provides the per-call selector across them.