Optimization of Collective Communication Operations in MPICH

Rajeev Thakur, Rolf Rabenseifner, William Gropp | Argonne National Laboratory / HLRS, University of Stuttgart | International Journal of High Performance Computing Applications (IJHPCA), 2005 | Sage Publications


Problem

Collective communication is a heavily used component of MPI, yet the MPICH releases preceding this work shipped only rudimentary collective implementations. A five-year profiling study on a Cray T3E 900 showed that more than 40% of MPI time is spent inside MPI_Allreduce and MPI_Reduce, and 25% of execution time uses non-power-of-two process counts — so both reductions and non-power-of-two performance matter in practice. Existing collective implementations either picked a single algorithm regardless of message size, or used algorithms tuned only for power-of-two processes, leaving substantial performance on the table on real switched-network clusters such as Myrinet and the IBM SP.


Core Insight

No single algorithm is best across all (message size, process count, operator) combinations: latency-optimal short-message algorithms and bandwidth-optimal long-message algorithms must coexist, switched at empirically determined cutoffs, with separate variants for non-power-of- two process counts. Long-message reductions and broadcasts should be decomposed into reduce-scatter + allgather (or scatter + allgather) so the bandwidth term scales as ~2n rather than n * lg p.


Method

The authors adopt a simple Hockney-style cost model T = alpha + n * beta (with a per-byte compute term gamma for reductions) and use it to design a portfolio of algorithms per collective, all implemented as portable functions on top of MPI point-to-point primitives.

Per-collective algorithm portfolio (selected by message size + p)

  Allgather       : Bruck   |  Recursive Doubling  |  Ring
  Broadcast       : Binomial Tree              |  Van de Geijn
  All-to-All      : Bruck   |  Irecv-Isend     |  Pairwise
  Reduce-Scatter  : Recursive Halving          |  Pairwise
  Reduce          : Binomial Tree              |  Rabenseifner
  Allreduce       : Recursive Doubling         |  Rabenseifner
                                  ^                       ^
                              short messages       long messages

Section 5 develops a deeper analysis of Reduce and Allreduce, introducing vector halving / distance doubling, binary blocks (non-power-of-two), and ring variants, and codifies the result as a 2D (p, n) -> winning algorithm regime matrix.


Experimental Setup

Component Value
Myrinet cluster (Argonne) Linux cluster, Myrinet 2000, MPICH-GM
Heidelberg HELICS cluster Dual-CPU PCs, Myrinet
IBM SP (SDSC) Switch-connected, IBM's native MPI
Cray T3E 900 (Stuttgart, HLRS) Production system
MPI implementations compared MPICH 1.2.6, MPICH2 0.971, IBM MPI, Cray MPT 1.4.0.4
Benchmarks SKaMPI, b_eff
Operators MPI_SUM, MPI_MAX, MPI_MAXLOC
Workloads Pure MPI, hybrid MPI + OpenMP
Metric Wall-clock time per collective; speedup vs. baseline

Headline Quantitative Results

Algorithm cutoffs adopted in MPICH:

Collective Cutoffs
Allgather Bruck < 80 KB; Rec-Dbl < 512 KB; Ring >= 512 KB
Broadcast Tree < 12 KB; Van de Geijn otherwise
All-to-All Bruck <= 256 B/msg; Irecv-Isend up to 32 KB; Pairwise >= 32 KB
Reduce Tree <= 2 KB; Rabenseifner > 2 KB (predefined ops only)
Reduce-Scatter Recursive Halving <= 512 KB; Pairwise > 512 KB

Speedups vs. baselines:

Platform Speedup
IBM SP, pure MPI, 8-64 KB 1.5x
IBM SP, pure MPI, larger buffers 2-5x
IBM SP, hybrid MPI + OpenMP 1.5-3x
Myrinet cluster, pure MPI 3-7x
Myrinet cluster, hybrid MPI + OpenMP 2-5x
Cray T3E, MPI_SUM 3-5x
Cray T3E, MPI_MAXLOC up to 100x
Cray T3E, Allreduce production profiling up to 20% time reduction
Cray T3E, Reduce production profiling up to 54% time reduction

Cost reference (long-message Allreduce):


Limitations


Open Problems

  1. Automated, system-parameter-driven cutoff selection. Replace the fixed cutoffs (80 KB, 512 KB, 12 KB, 32 KB, 2 KB) with a runtime model that generalizes across new hardware without rebenchmarking.
  2. Optimization of irregular collectives — the "v" variants deserve their own regime-aware algorithm choices.
  3. Hierarchical-network optimizations for systems like TeraGrid, where the flat cost model breaks down once latency varies by orders of magnitude between intra-rack and inter-site links.
  4. Integration with one-sided MPIMPI_Put / MPI_Get / MPI_Accumulate may enable lower-overhead implementations of some collectives.

Note on NCCL Tuning

This paper is the canonical statement of the message-size-dependent algorithm-selection problem that NCCL inherits two decades later. The Section 5.4 "winning algorithm" matrix — a 2D map from (process count, message size) to a chosen algorithm — is structurally identical to NCCL's internal tuning table that picks algorithm (Ring vs Tree vs CollNet) and protocol (LL / LL128 / Simple). The paper's own future-work item — "automatic determination of cutoffs based on system parameters instead of hand-tuned constants" — anticipates exactly the tuner-plugin problem: hand-tuned cutoffs do not generalize, and a learned or runtime-derived policy is needed. The 2 KB Reduce cutoff and 32 KB All-to-All cutoff are also remarkably close to NCCL's modern small/medium/large message thresholds, suggesting the underlying latency-vs-bandwidth physics has changed less than the hardware would suggest.