Optimization of Collective Communication Operations in MPICH

Rajeev Thakur, Rolf Rabenseifner, William Gropp | Argonne National Laboratory / HLRS, University of Stuttgart | International Journal of High Performance Computing Applications (IJHPCA), 2005 | Sage Publications

Problem

Collective communication is a heavily used component of MPI, yet the MPICH releases preceding this work shipped only rudimentary collective implementations. A five-year profiling study on a Cray T3E 900 showed that more than 40% of MPI time is spent inside MPI_Allreduce and MPI_Reduce, and 25% of execution time uses non-power-of-two process counts — so both reductions and non-power-of-two performance matter in practice. Existing collective implementations either picked a single algorithm regardless of message size, or used algorithms tuned only for power-of-two processes, leaving substantial performance on the table on real switched-network clusters such as Myrinet and the IBM SP.

Core Insight

No single algorithm is best across all (message size, process count, operator) combinations: latency-optimal short-message algorithms and bandwidth-optimal long-message algorithms must coexist, switched at empirically determined cutoffs, with separate variants for non-power-of- two process counts. Long-message reductions and broadcasts should be decomposed into reduce-scatter + allgather (or scatter + allgather) so the bandwidth term scales as ~2n rather than n * lg p.

Method

The authors adopt a simple Hockney-style cost model T = alpha + n * beta (with a per-byte compute term gamma for reductions) and use it to design a portfolio of algorithms per collective, all implemented as portable functions on top of MPI point-to-point primitives.

Per-collective algorithm portfolio (selected by message size + p)

  Allgather       : Bruck   |  Recursive Doubling  |  Ring
  Broadcast       : Binomial Tree              |  Van de Geijn
  All-to-All      : Bruck   |  Irecv-Isend     |  Pairwise
  Reduce-Scatter  : Recursive Halving          |  Pairwise
  Reduce          : Binomial Tree              |  Rabenseifner
  Allreduce       : Recursive Doubling         |  Rabenseifner
                                  ^                       ^
                              short messages       long messages

Section 5 develops a deeper analysis of Reduce and Allreduce, introducing vector halving / distance doubling, binary blocks (non-power-of-two), and ring variants, and codifies the result as a 2D (p, n) -> winning algorithm regime matrix.

Experimental Setup

Component	Value
Myrinet cluster (Argonne)	Linux cluster, Myrinet 2000, MPICH-GM
Heidelberg HELICS cluster	Dual-CPU PCs, Myrinet
IBM SP (SDSC)	Switch-connected, IBM's native MPI
Cray T3E 900 (Stuttgart, HLRS)	Production system
MPI implementations compared	MPICH 1.2.6, MPICH2 0.971, IBM MPI, Cray MPT 1.4.0.4
Benchmarks	SKaMPI, b_eff
Operators	`MPI_SUM`, `MPI_MAX`, `MPI_MAXLOC`
Workloads	Pure MPI, hybrid MPI + OpenMP
Metric	Wall-clock time per collective; speedup vs. baseline

Headline Quantitative Results

Algorithm cutoffs adopted in MPICH:

Collective	Cutoffs
Allgather	Bruck `< 80 KB`; Rec-Dbl `< 512 KB`; Ring `>= 512 KB`
Broadcast	Tree `< 12 KB`; Van de Geijn otherwise
All-to-All	Bruck `<= 256 B/msg`; Irecv-Isend up to 32 KB; Pairwise `>= 32 KB`
Reduce	Tree `<= 2 KB`; Rabenseifner `> 2 KB` (predefined ops only)
Reduce-Scatter	Recursive Halving `<= 512 KB`; Pairwise `> 512 KB`

Speedups vs. baselines:

Platform	Speedup
IBM SP, pure MPI, 8-64 KB	1.5x
IBM SP, pure MPI, larger buffers	2-5x
IBM SP, hybrid MPI + OpenMP	1.5-3x
Myrinet cluster, pure MPI	3-7x
Myrinet cluster, hybrid MPI + OpenMP	2-5x
Cray T3E, `MPI_SUM`	3-5x
Cray T3E, `MPI_MAXLOC`	up to 100x
Cray T3E, Allreduce production profiling	up to 20% time reduction
Cray T3E, Reduce production profiling	up to 54% time reduction

Cost reference (long-message Allreduce):

Vector halving / distance doubling, power-of-two: ~ 2 lg(p) alpha + 2 n beta + n gamma
Same, non-power-of-two: ~ (3 + 2 floor(lg p)) alpha + 4 n beta + (3/2) n gamma — bandwidth and compute terms roughly double, exposing a "p must be power of two" performance cliff.
Ring Allreduce: 2 (p-1) alpha + 2 n beta + n gamma — bandwidth-optimal, latency-poor.

Limitations

The cost model is flat: it ignores hierarchical / multi-level network structure (NUMA, intra-rack vs. inter-rack), which can yield suboptimal choices on real hierarchical networks.
Cutoffs are static and hand-tuned per platform; an automated, system-parameter-driven cutoff search is left to future work.
Irregular ("v") collectives (MPI_Gatherv, MPI_Scatterv, MPI_Allgatherv, MPI_Alltoallv) use regular techniques without targeted optimization.
Splitting user-defined derived datatypes across the rounds of a recursive algorithm is described as "tricky" and is not fully implemented for all cases.
The bake-off is limited to Myrinet, IBM SP, and Cray T3E; modern InfiniBand / RoCE / GPU-direct interconnects are out of scope.

Open Problems

Automated, system-parameter-driven cutoff selection. Replace the fixed cutoffs (80 KB, 512 KB, 12 KB, 32 KB, 2 KB) with a runtime model that generalizes across new hardware without rebenchmarking.
Optimization of irregular collectives — the "v" variants deserve their own regime-aware algorithm choices.
Hierarchical-network optimizations for systems like TeraGrid, where the flat cost model breaks down once latency varies by orders of magnitude between intra-rack and inter-site links.
Integration with one-sided MPI — MPI_Put / MPI_Get / MPI_Accumulate may enable lower-overhead implementations of some collectives.

Note on NCCL Tuning

This paper is the canonical statement of the message-size-dependent algorithm-selection problem that NCCL inherits two decades later. The Section 5.4 "winning algorithm" matrix — a 2D map from (process count, message size) to a chosen algorithm — is structurally identical to NCCL's internal tuning table that picks algorithm (Ring vs Tree vs CollNet) and protocol (LL / LL128 / Simple). The paper's own future-work item — "automatic determination of cutoffs based on system parameters instead of hand-tuned constants" — anticipates exactly the tuner-plugin problem: hand-tuned cutoffs do not generalize, and a learned or runtime-derived policy is needed. The 2 KB Reduce cutoff and 32 KB All-to-All cutoff are also remarkably close to NCCL's modern small/medium/large message thresholds, suggesting the underlying latency-vs-bandwidth physics has changed less than the hardware would suggest.