gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur | UC Riverside / Argonne National Lab / Stevens / Iowa / Florida State / UC Merced | ICS 2024
Problem
GPU-aware collective communication (Allreduce, Scatter, Broadcast) over modest HPC interconnects such as HPE Slingshot 10 (100 Gbps) is bottlenecked by network bandwidth as message sizes grow into the hundreds of MB. Lossy compression is the obvious lever to shrink wire traffic, but two prior obstacles have blocked its adoption inside collective libraries: (1) existing compression-augmented collectives stage data through the host, so device-to-host transfers consume up to ~45% of runtime; (2) classical ring algorithms split a D-byte buffer into D/N chunks across N processes, and when D/N falls below ~5 MB the GPU lossy-compressor (cuSZp) kernel-launch overhead dominates and the GPU is severely underutilized. The result is that prior compressed-collective designs either lose to NCCL on small/medium messages or accumulate unbounded numerical error across N-1 reduction rounds.
Core Insight
Co-design the compression library with the collective algorithm itself: switch from ring topology to recursive-doubling for Allreduce so each round operates on the full data block (high GPU utilization) and so error propagates over only log N rounds instead of N-1. Combine this with a fully GPU-resident pipeline (pre-allocated device buffers, multi-stream cuSZp, device-only reduction) that removes host-staging overhead.
Method
+---------------------------------------------------------+
| User Applications (RTM, Image Stacking, DL training) |
+---------------------------------------------------------+
| gZCCL API gZ-Allreduce | gZ-Scatter | gZ-Bcast |
+----------------------------+--------------+-------------+
| Collective Computation | Collective Data Movement |
| Framework | Framework |
| (Recursive Doubling) | (Binomial Tree) |
| - log N comp/decomp ops | - multi-stream cuSZp |
| - whole-block per round | - packed contig buffer |
+----------------------------+----------------------------+
| Middleware: MPI P2P | Compression Adapter |
+----------------------------+----------------------------+
| Abstract Device Interface | Lossy Compression Library |
| | (cuSZp, error-bounded) |
+----------------------------+----------------------------+
Compression sits inline in the collective pipeline: each process compresses its local block on the GPU before every point-to-point exchange and decompresses on arrival, all within the same CUDA stream chain that drives reduction kernels. For Allreduce, gZCCL chooses recursive doubling (gZ-Allreduce-ReDoub): in each of log N rounds, peer pairs exchange and reduce full blocks; for Scatter, the root multi-stream-compresses all per-rank blocks in parallel, packs them into a contiguous send buffer, and disperses via a binomial tree.
The compressor is a modified cuSZp that exposes a
cuSZp_compress_stream entry point so multiple compression
jobs can run concurrently on independent CUDA streams, overlapping with
peer sends/receives.
Knobs Exposed
| Knob | Range / Meaning |
|---|---|
| Absolute error bound (ABS) | 1E-3, 1E-4, 1E-5 (tested values) |
| Algorithm choice | Ring vs Recursive Doubling (Allreduce); binomial tree (Scatter) |
| GPU buffer pool size | Pre-allocated at MPI_Init |
| Stream count | Parallelism for multi-stream compression |
| Message-size / process-count threshold | Switch to ReDoub when D/N < ~5 MB |
Results
Evaluation on 512 NVIDIA A100 80GB GPUs across 128 nodes, HPE Slingshot 10 interconnect (100 Gbps). Baselines: NCCL (Allreduce), Cray MPI (Allreduce/Scatter).
- Allreduce: up to 4.5x vs NCCL and 20.2x vs Cray MPI at 512 GPUs.
- Scatter: up to 28.7x vs Cray MPI at 16 GPUs on a 646 MB message.
- Compression ratio: up to 73.35x on RTM seismic data at ABS=1E-4.
- Accuracy: image-stacking PSNR of 57.80 dB at ABS=1E-4 (gZ-Allreduce-ReDoub outperforms ring-based compression on quality).
- Application speedup: 1.69x end-to-end over NCCL on image-stacking.
Limitations
- gZ-Allreduce (Ring) variant loses to NCCL for messages below 50 MB because D/N chunks become too small to amortize GPU kernel-launch overhead.
- Speedup tapers as messages grow into the saturation regime (e.g., Scatter gain drops from 20.6x to 17.4x at 600 MB).
- Only absolute-error-bound mode (ABS) is evaluated; REL and PSNR modes are not.
- Algorithm switching between Ring and ReDoub is currently a static choice (process count and message size threshold) rather than learned.
Relevance to DynamICCL
gZCCL directly expands DynamICCL's action space: alongside (algorithm, protocol, nChannels, numThreads), an RL-controlled NCCL configurator can now also pick (compress?, compressor-mode, error-bound, ring-vs-recursive-doubling).
- New action dimension — compression: a binary compress/no-compress flag per collective, plus an error-bound knob (ABS=1E-3..1E-5) becomes a tunable trade-off: compression slashes wire bytes but adds GPU kernel overhead and numerical error. Reward = collective time + accuracy penalty captures this directly.
- New algorithm dimension — recursive doubling under compression: gZCCL shows the optimal collective topology depends on whether compression is active and on D/N relative to GPU compressor overhead. DynamICCL's algorithm selector should know about this branch.
- Threshold rule (D/N < 5 MB) becomes a learnable boundary: gZCCL's static switch from Ring to ReDoub is exactly the kind of fixed heuristic DynamICCL replaces with a learned policy that can react to live timing observations.
- Reward must include numerical fidelity: unlike pure-throughput collectives, compressed collectives bring an accuracy axis. DynamICCL's reward function should generalize to a multi-objective (latency, bandwidth, error) form when compressed-mode actions are enabled.
- State features to add: per-collective compression ratio achieved, compressor kernel time, error bound in effect — all observable post-hoc and feedable into the policy network alongside existing timing features.