Optimization of Collective Communication Operations in MPICH — Detailed Summary

Rajeev Thakur, Rolf Rabenseifner, William Gropp | Argonne National Laboratory / HLRS, University of Stuttgart | International Journal of High Performance Computing Applications (IJHPCA), 2005 | Sage Publications

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.


Abstract


1. Introduction



3. Cost Model


4. Algorithms

4.1 Allgather

4.1.1 Recursive Doubling

4.1.2 Bruck Algorithm

4.1.3 Performance and Cutoffs

4.2 Broadcast

4.3 All-to-All

4.4 Reduce-Scatter

4.5 Reduce and Allreduce


5. Further Optimization of Allreduce and Reduce

5.1 Vector Halving and Distance Doubling

5.2 Binary Blocks Algorithm

5.3 Ring Algorithm

5.4 Choosing the Fastest Algorithm

5.5 Comparison with Vendor's MPI


6. Conclusions and Future Work


Quantitative Results Summary

Collective Old algorithm New algorithm(s) Cutoff
Allgather Ring Bruck / Rec-Dbl / Ring 80 KB / 512 KB
Broadcast Binomial Tree Tree / Van de Geijn 12 KB
All-to-All Naive Irecv-Isend Bruck / Irecv-Isend / Pairwise 256 B / 32 KB
Reduce-Scatter Tree-then-Scatterv Rec-Halving / Pairwise 512 KB
Reduce Binomial Tree Tree / Rabenseifner 2 KB
Allreduce Reduce + Bcast Rec-Dbl / Rabenseifner message-size dependent
Platform Headline speedup vs. baseline
IBM SP, pure MPI, 8-64 KB 1.5x
IBM SP, pure MPI, larger buffers 2-5x
IBM SP, hybrid MPI + OpenMP 1.5-3x
Myrinet cluster, pure MPI 3-7x
Myrinet cluster, hybrid MPI + OpenMP 2-5x
Cray T3E, MPI_SUM 3-5x
Cray T3E, MPI_MAXLOC up to 100x
Cray T3E, Allreduce production up to 20% time reduction
Cray T3E, Reduce production up to 54% time reduction

Cost-Equation Reference Card

T_ring_allgather       = (p-1)*alpha + ((p-1)/p) * n * beta
T_rec_dbl_allgather    = lg(p)*alpha + ((p-1)/p) * n * beta
T_bruck_allgather      = ceil(lg p)*alpha + ((p-1)/p) * n * beta
T_tree_bcast           = ceil(lg p) * (alpha + n*beta)
T_vandegeijn_bcast     = (lg p + p - 1)*alpha + 2*((p-1)/p)*n*beta
T_a2a_bruck            = lg(p)*alpha + (n/2)*lg(p)*beta
T_a2a_pairwise         = (p-1)*alpha + n*beta
T_rec_half_redscat     = lg(p)*alpha + ((p-1)/p)*n*beta + ((p-1)/p)*n*gamma
T_pairwise_redscat     = (p-1)*alpha + ((p-1)/p)*n*beta + ((p-1)/p)*n*gamma
T_rabenseifner_reduce  = 2*lg(p)*alpha + 2*((p-1)/p)*n*beta + ((p-1)/p)*n*gamma
T_vh_dd_allreduce_p2   ~= 2*lg(p)*alpha + 2*n*beta + n*gamma
T_vh_dd_allreduce_npo2 ~= (3+2*floor(lg p))*alpha + 4*n*beta + (3/2)*n*gamma
T_ring_allreduce       = 2*(p-1)*alpha + 2*n*beta + n*gamma

Experimental Setup

Component Value
Myrinet cluster (Argonne) Linux cluster, Myrinet 2000, MPICH-GM
Heidelberg HELICS cluster Dual-CPU PCs, Myrinet
IBM SP (SDSC) Switch-connected, IBM's native MPI
Cray T3E 900 (Stuttgart, HLRS) Production system used for profiling and benchmarks
MPI implementations MPICH 1.2.6, MPICH2 0.971, IBM MPI, Cray MPT 1.4.0.4
Benchmarks SKaMPI, b_eff
Operators MPI_SUM, MPI_MAX, MPI_MAXLOC (derived datatype stress)
Workloads Pure MPI, hybrid MPI + OpenMP

Named Methods and Concepts


Limitations Stated by the Authors


Open Problems Explicitly Identified

  1. Automated, system-parameter-driven cutoff selection. The current policy uses fixed cutoffs (80 KB, 512 KB, 2 KB, 12 KB, 32 KB); replacing these with a runtime model would generalize across new hardware without rebenchmarking.
  2. Optimization of irregular collectivesMPI_Gatherv, MPI_Scatterv, MPI_Allgatherv, MPI_Alltoallv deserve their own regime-aware algorithm choices.
  3. Hierarchical-network optimizations for systems like TeraGrid: the flat cost model breaks down when latency varies by orders of magnitude between intra-rack and inter-site links.
  4. Integration with one-sided MPIMPI_Put / MPI_Get / MPI_Accumulate may enable lower-overhead implementations of some collectives.

Cross-Cutting Empirical Take-Aways

Take-away Evidence
No single algorithm is best across all (p, n, op) combinations Sec. 5.4 winning-algorithm matrix
Latency-optimal and bandwidth-optimal algorithms are different Tree vs Van de Geijn / Rabenseifner
Non-power-of-two p has its own performance cliff that demands separate algorithms Sec. 5.1 cost ratios; Bruck vs RecDbl tradeoff
Reduce-scatter + allgather is the right long-message pattern for Allreduce and Bcast Rabenseifner, Van de Geijn
Vendor-native MPIs are not always optimized — portable algorithms layered on top of point-to-point can beat them IBM SP and Cray T3E speedups
Static cutoffs are good enough for one platform but do not generalize Sec. 6 future work
Allreduce and Reduce alone account for ~40% of MPI time in production Sec. 1 motivation

Note on NCCL Tuning

This paper is the canonical statement of the message-size-dependent algorithm-selection problem that NCCL (and DynamICCL) inherits two decades later. The Section 5.4 "winning algorithm" matrix — a 2D map from (process count, message size) to a chosen algorithm — is structurally identical to NCCL's internal tuning table that picks algorithm (Ring vs Tree vs CollNet) and protocol (LL / LL128 / Simple). The paper's own Section 6 future-work item — "automatic determination of cutoffs based on system parameters instead of hand-tuned constants" — anticipates exactly the tuner-plugin problem: hand-tuned cutoffs do not generalize, and a learned or runtime-derived policy is needed. The 2 KB Reduce cutoff and 32 KB All-to-All cutoff are also remarkably close to NCCL's modern small/medium/large message thresholds, suggesting the underlying latency-vs-bandwidth physics has changed less than the hardware would suggest.