gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters — Detailed Summary

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur | UC Riverside / Argonne National Lab / Stevens / Iowa / Florida State / UC Merced | ICS 2024

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points.


Abstract


1. Introduction


2. Background and Motivation


3. Design and Optimization

3.1 GPU-Centric Pipeline

3.2 Multi-Stream cuSZp

3.3 Two Algorithm Frameworks

3.3.1 Collective Computation Framework (Allreduce)

3.3.2 Collective Data Movement Framework (Scatter / Broadcast)

3.3.3 Algorithm-Selection Heuristic

3.4 Pseudocode (gZ-Allreduce ReDoub)

for step s in 0 .. log2(N) - 1:
    peer = rank XOR (1 << s)
    compressed_local = cuSZp_compress_stream(local, ABS_BOUND, stream_a)
    isend(compressed_local -> peer, stream_a)
    irecv(compressed_peer  <- peer, stream_b)
    decompressed_peer = cuSZp_decompress_stream(compressed_peer, stream_b)
    local = reduce_kernel(local, decompressed_peer, stream_b)  // device-only

4. Implementation


5. Evaluation

5.1 Setup

5.2 Allreduce Results

5.3 Scatter Results

5.4 Compression Ratio and Accuracy

5.5 Image-Stacking Application Study

5.6 Sensitivity / Ablations



7. Conclusion and Future Work


Knobs Exposed by gZCCL (DynamICCL action-space candidates)

Knob Type Tested values / role
Compress on/off Binary Active vs passthrough
Compressor backend Enum cuSZp (only one evaluated)
Error-bound mode Enum ABS (REL, PSNR not evaluated)
Absolute error bound Float 1E-3, 1E-4, 1E-5
Allreduce algorithm Enum Ring vs Recursive Doubling
Scatter/Broadcast algorithm Enum Binomial Tree
GPU buffer pool size Int Pre-allocated at MPI_Init
Compression stream count Int Parallelism for multi-stream compress
Algorithm-switch threshold Float (D/N) Static rule: ~5 MB

Latency / Bandwidth / Accuracy Trade-Offs

Dimension Effect
Wall-clock latency Reduced 4.5x–20.2x (Allreduce); 17.4x–28.7x (Scatter)
Effective bandwidth Multiplied by compression ratio (up to 73.35x byte savings)
GPU utilization High under ReDoub (whole-block), low under Ring at small D/N
Numerical error Bounded by ABS error (1E-3..1E-5); accumulation log N (ReDoub) vs N-1 (Ring)
PSNR (image-stacking) 57.80 dB at ABS=1E-4
PSNR (RTM seismic) 55–88 dB
Tail behavior Speedup tapers in saturation regime (>~600 MB)

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects per-collective parameters (algorithm, protocol, nChannels, numThreads) to minimize collective completion time on HPC GPU clusters. gZCCL expands the action space and the reward formulation in concrete ways.

Direct relevance mapping

gZCCL element DynamICCL implication
Compress on/off per collective New binary action dimension
Error bound (ABS=1E-3..1E-5) New continuous/discrete action dimension
Ring vs Recursive Doubling under compression Algorithm selector must know about this branch
Binomial Tree for Scatter under compression Topology selector must include compressed variants
D/N < 5 MB → ReDoub Static threshold replaceable by learned policy
Multi-stream cuSZp count Parallelism knob (analogous to nChannels in spirit)
Pre-allocated GPU buffer pool Constraint to surface to RL policy as memory budget
Application-level PSNR / accuracy Reward must become multi-objective (latency + fidelity)
Compression ratio per call New observation feature for state vector
Per-call compressor kernel time New observation feature, decomposes total latency

Key lessons for DynamICCL

  1. Compression is a first-class action knob, not just an underlying transport detail: gZCCL shows that the choice to compress (and how aggressively) changes which collective algorithm is optimal — they cannot be tuned independently. DynamICCL should treat them as a joint action.
  2. Algorithm choice depends on compression state: Ring is fine without compression but loses to ReDoub when compression is active and D/N is small. DynamICCL's algorithm head must condition on the compression action.
  3. Reward must include accuracy when compression is active: pure latency reward is insufficient. A multi-objective formulation r = -latency - lambda * error or a constraint max -latency s.t. error < epsilon is appropriate.
  4. Static thresholds are exactly what RL replaces: gZCCL's hand-coded D/N < 5 MB rule for switching Ring → ReDoub is a textbook fixed heuristic that DynamICCL can supersede with a learned policy that observes live compressor kernel times.
  5. State features to add to DynamICCL: per-call compression ratio, compressor kernel duration, error-bound in effect, post-collective reconstruction PSNR (when measurable). All are observable and inform future action selection.
  6. Generalization caveat: gZCCL was evaluated only with ABS error bound and only with cuSZp; DynamICCL training would need to either fix that backend or bring multiple compressors into the action space.
  7. gZCCL is complementary, not competing: gZCCL is a static-policy library answering "how to integrate compression"; DynamICCL is a dynamic-policy agent answering "when and how aggressively to use it". Layered together, gZCCL provides the substrate of compressed collective implementations; DynamICCL provides the per-call selector across them.