A Survey of Methods for Collective Communication Optimization and Tuning — Detailed Summary

Udayanga Wickramasinghe, Andrew Lumsdaine | Indiana University & Pacific Northwest National Laboratory | arXiv:1611.06334 (2016)

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points. The paper is a literature survey; the detailed walkthrough therefore tracks the taxonomy and the configuration design space it maps out, since those are the action-space DynamICCL's Agent-2 must learn over.


Abstract


1. Introduction

Motivation:

The algorithm-selection problem:

Two scopes the paper distinguishes:


2. Collective Algorithms and Implementation

Algorithm taxonomy (rooted vs. non-rooted):

Algorithm-by-collective examples:

Segmentation as a fundamental knob:


3. Collective Tuning

This is the core of the survey, organized into four method classes.

3.1 Analytical Models

Strengths/weaknesses:

3.2 Statistical / Empirical Techniques

Strengths/weaknesses:

3.3 Graphical / Logic Encoding

Strengths/weaknesses:

3.4 Machine Learning

Strengths/weaknesses:


4. Application Centric Tuning

4.1 Programmable Overlap (Communication / Computation)

4.2 RDMA / One-sided Optimizations


5. Discussion — Unified Multidimensional Tuning Architecture (UMTAC)

Argument:

Proposed UMTAC pipeline (conceptual block diagram):

+-----------------+   +----------------+   +--------------------+
| Application     |-->| Benchmark      |-->| Data Preprocessing |
| Profiling       |   | Execution      |   | / Feature Eng.     |
+-----------------+   +----------------+   +--------------------+
                                                    |
                                                    v
                                       +----------------------------+
                                       | Ensemble Model Generation  |
                                       | (analytical + ML + rules)  |
                                       +----------------------------+
                                                    |
                                                    v
                                       +----------------------------+
                                       | Model Validation & Deploy  |
                                       +----------------------------+

6. Final Remarks


Configuration Knobs Identified by the Survey

This is the action-space the survey enumerates — exactly what DynamICCL's Agent-2 must select among.

Knob Description Typical impact How prior work tunes it
Algorithm choice Ring vs. Tree vs. Bruck vs. Recursive-Doubling, etc. Largest single performance lever; varies by P and M Decision trees, quad trees, ANN, AEOS lookup
Segment / chunk size Bytes per pipelined segment Controls pipeline efficiency Closed-form via LogP/PLogP derivative; empirical sweep
Pipeline depth # of outstanding segments Latency hiding vs. memory pressure Empirical
Process / topology mapping Logical-to-physical rank assignment Critical on hierarchical fabrics Topology-aware heuristics
Communication protocol Eager vs. Rendezvous (RDMA) Determines copy overhead and crossover Static threshold or rule-based
RDMA buffer registration size Bytes pinned for zero-copy Throughput on InfiniBand / RoCE Tuned at install time
Queue pair count # of concurrent RDMA QPs Concurrency and injection rate Hand-tuned
Polling frequency Busy-poll vs. event-driven CPU utilization vs. latency Hand-tuned
Overlap window size Computation chunk that hides comm Application speedup Compiler-driven (CC-MPI)

Surveyed Systems Reference Table

System Tuning approach Class
LAM/MPI AEOS offline search Empirical
STAR-MPI Dynamic measure/select online Empirical (online)
OpenMPI Hand-coded MCA rules + dynamic feedback Rule-based
MPICH Manual heuristic algorithm selection Heuristic
MVAPICH RDMA-verb-level optimization Transport tuning
ATLAS / FFTW AEOS (precedent in BLAS / FFT) Empirical
CC-MPI Compiler-driven overlap Application-centric
Hoefler's framework Static analysis for overlap Application-centric
Pjesivac-Grbovic Quad-tree decision encoding Graphical
REPTree-based tuner Regression trees ML
ANN tuner Neural network regression ML

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer. Agent-2 (the Config Agent) selects, per collective, an action tuple (algorithm, protocol, nChannels, numThreads) — and this survey is the canonical reference for what that action space is and how prior work has tried to choose it.

Survey concept DynamICCL mapping
Algorithm-selection problem Exactly the problem Agent-2 solves
(Algorithm, segment size, pipeline depth) action space NCCL's (algorithm, protocol, nChannels, numThreads, chunkSize, numPipeOps)
Quad-tree decision encoding Strongest non-ML baseline for Agent-2 to beat
STAR-MPI measure/select loop Classical online-learning analog of DynamICCL's RL loop
AEOS offline benchmarking DynamICCL's offline trace replay / pre-training corpus
ANN / regression-tree predictors Function-approximator class for Agent-2's policy/value networks
Rule-based runtime feedback Bandit/RL precursor — motivates Agent-2's online updates
LogP / PLogP analytical models Candidate priors for shaping reward or pruning the action space
Application-centric overlap Forward-looking direction: extend DynamICCL beyond a single collective
UMTAC ensemble pipeline Plausible meta-architecture: analytical pruner -> RL policy -> measurement validator
Listed open problems (feature selection, sparse search at exascale, transport+algorithm coordination) DynamICCL's research contribution targets these directly

Takeaways for the dissertation literature review:

  1. The configuration design space DynamICCL operates over is not novel in its dimensions; it is novel in being learned online from real cluster measurements rather than statically tuned.
  2. The survey provides a clear comparative baseline taxonomy — DynamICCL should position itself as online-ML / RL with rule-based feedback, the most expressive corner of the survey's 2x2 (online-vs-offline) x (model-vs-measurement) grid.
  3. The reported numbers (90-95% of max gain for ML predictors) define the performance bar Agent-2 must clear, while STAR-MPI's online-measurement overhead is the cost model Agent-2 must come in under.
  4. The survey's UMTAC vision validates DynamICCL's plausible architectural choice of combining an analytical bandwidth/latency prior with an RL policy trained on measurements — this is the survey's stated path forward, never fully realized.