A Survey of Methods for Collective Communication Optimization and Tuning

Udayanga Wickramasinghe, Andrew Lumsdaine | Indiana University & Pacific Northwest National Laboratory | arXiv:1611.06334 (2016)

Problem

Collective operations (Broadcast, Allreduce, Allgather, Alltoall, Scatter, Gather) are the dominant communication primitives in HPC and ML training, and they are frequently the application-level bottleneck. Each collective can be implemented by many different algorithms (Ring, Tree, Recursive Doubling, Bruck, Dissemination, etc.), each parameterized by knobs such as segment size, pipeline depth, topology mapping, and protocol. The "best" choice depends on message size, process count, network topology, link latency/bandwidth, and on-node hierarchy — yielding a combinatorial-explosion search space that grows with every new generation of hardware. Hand-tuning by MPI/CCL implementers does not scale to exascale, and brute-force empirical search is infeasible. This is the classical algorithm selection problem for collectives.

Core Insight

No single tuning methodology — analytical, empirical, graphical, or ML — is sufficient on its own. A unified, multidimensional pipeline that combines analytical performance models for coarse pruning with statistical/ML predictors for fine selection, fed by application-aware profiling, is required to make collective tuning tractable at exascale.

Method

The paper is a literature survey, not a new system; it organizes ~30 years of collective-tuning research into a four-class taxonomy and proposes a synthesis architecture (UMTAC) at the end.

The four tuning-method classes surveyed:

Analytical models — closed-form latency formulas (Hockney, LogP, LogGP, PLogP) parameterized by process count, message size, latency, and gap. Used to derive optimal segment sizes and predict crossover points between algorithms.
Statistical / empirical techniques — AEOS-style exhaustive offline search (LAM/MPI, ATLAS, FFTW) and online dynamic switching (STAR-MPI).
Graphical / logical encoding — Quad-tree decision structures over {process count, message size} that map regions to algorithm choices (Pjesivac-Grbovic).
Machine learning — Decision trees (C4.5, CART, ID3), regression trees (REPTree), Artificial Neural Networks, and rule-based runtime feedback loops.

The survey also covers application-centric tuning: compiler-driven non-blocking communication/computation overlap (CC-MPI, Hoefler's framework) and RDMA / one-sided optimizations (MVAPICH), which introduce additional knobs (buffer registration, queue pair counts, polling frequency, eager vs. rendezvous protocol).

The proposed Unified Multidimensional Tuning Architecture (UMTAC) chains application profiling, benchmark execution, data preprocessing, ensemble model generation, and model validation into a single hybrid tuning pipeline.

Results

The survey reports prior-work numbers rather than running new experiments:

Quad-tree decision structures reach within 10% of optimal with only 3 levels of depth (Pjesivac-Grbovic).
Regression trees (REPTree) on Jacobi and Integer Sort kernels achieve ~90% of the maximum possible performance gain.
Neural network predictors reach up to 95% of maximum gain but at high training-data cost.
Compiler-driven overlap yields 21% speedup on 3D FFT and ~16% on other scientific kernels (Hoefler).

Limitations

Analytical models are either too simple to capture congestion / noise or too parameter-heavy to fit in practice.
Empirical AEOS search does not scale to exascale parameter spaces.
Quad trees are limited to 2-D inputs (N, M) and miss sharp performance spikes / dips.
Neural networks are slow to train and offer little interpretability; decision trees overfit and impose axis-aligned rectangular decision regions that do not match real performance landscapes.
Static compiler analysis for overlap fails on pointer-heavy C/C++.
The survey predates NCCL, MSCCL, BytePS, Horovod, and modern GPU collectives, so RDMA discussion is InfiniBand/MPI-centric and the GPU-cluster setting is out of scope.

Relevance to DynamICCL

This survey is the canonical taxonomy of collective-tuning approaches and directly defines the problem DynamICCL's Agent-2 is solving. Key takeaways:

Confirms the action space: the configuration knobs the survey enumerates — algorithm, segment/chunk size, pipeline depth, topology mapping, communication protocol, RDMA buffer parameters — are the same set NCCL exposes (algorithm, protocol, nChannels, numThreads, chunkSize, numPipeOps). Agent-2's discrete combinatorial action space is exactly the one this paper surveys.
Positions DynamICCL in the taxonomy: DynamICCL is an online ML-based tuner with rule/feedback adaptation, occupying the most expressive cell of the survey's 4-class taxonomy. It inherits ML's accuracy advantage (~90-95% of max gain in prior work) at the cost of training overhead.
Supports the hybrid-modeling thesis: the survey's UMTAC vision (analytical priors + ML refinement) maps neatly onto DynamICCL's plausible architecture, where an analytical bandwidth model could prune actions before the RL policy chooses among the survivors.
Identifies prior baselines: STAR-MPI's measure/select dynamic loop is the closest classical analog to DynamICCL's online learning loop; the quad-tree approach is the strongest non-ML baseline to beat.
Highlights open problems DynamICCL targets directly: feature selection over modern HPC system signals, exascale-scale sparse search, and integrating low-level transport knobs with high-level algorithm selection are listed as open and remain so today.