Optimization of Collective Communication Operations in MPICH
Rajeev Thakur, Rolf Rabenseifner, William Gropp | Argonne National Laboratory / HLRS, University of Stuttgart | International Journal of High Performance Computing Applications (IJHPCA), 2005 | Sage Publications
Problem
Collective communication is a heavily used component of MPI, yet the
MPICH releases preceding this work shipped only rudimentary collective
implementations. A five-year profiling study on a Cray T3E 900 showed
that more than 40% of MPI time is spent inside
MPI_Allreduce and MPI_Reduce, and 25% of
execution time uses non-power-of-two process counts — so both reductions
and non-power-of-two performance matter in practice. Existing collective
implementations either picked a single algorithm regardless of message
size, or used algorithms tuned only for power-of-two processes, leaving
substantial performance on the table on real switched-network clusters
such as Myrinet and the IBM SP.
Core Insight
No single algorithm is best across all
(message size, process count, operator) combinations:
latency-optimal short-message algorithms and bandwidth-optimal
long-message algorithms must coexist, switched at empirically determined
cutoffs, with separate variants for non-power-of- two process counts.
Long-message reductions and broadcasts should be decomposed into
reduce-scatter + allgather (or scatter + allgather) so the bandwidth
term scales as ~2n rather than n * lg p.
Method
The authors adopt a simple Hockney-style cost model
T = alpha + n * beta (with a per-byte compute term
gamma for reductions) and use it to design a portfolio of
algorithms per collective, all implemented as portable functions on top
of MPI point-to-point primitives.
Per-collective algorithm portfolio (selected by message size + p)
Allgather : Bruck | Recursive Doubling | Ring
Broadcast : Binomial Tree | Van de Geijn
All-to-All : Bruck | Irecv-Isend | Pairwise
Reduce-Scatter : Recursive Halving | Pairwise
Reduce : Binomial Tree | Rabenseifner
Allreduce : Recursive Doubling | Rabenseifner
^ ^
short messages long messages
Section 5 develops a deeper analysis of Reduce and
Allreduce, introducing vector halving / distance
doubling, binary blocks (non-power-of-two),
and ring variants, and codifies the result as a 2D
(p, n) -> winning algorithm regime matrix.
Experimental Setup
| Component | Value |
|---|---|
| Myrinet cluster (Argonne) | Linux cluster, Myrinet 2000, MPICH-GM |
| Heidelberg HELICS cluster | Dual-CPU PCs, Myrinet |
| IBM SP (SDSC) | Switch-connected, IBM's native MPI |
| Cray T3E 900 (Stuttgart, HLRS) | Production system |
| MPI implementations compared | MPICH 1.2.6, MPICH2 0.971, IBM MPI, Cray MPT 1.4.0.4 |
| Benchmarks | SKaMPI, b_eff |
| Operators | MPI_SUM, MPI_MAX,
MPI_MAXLOC |
| Workloads | Pure MPI, hybrid MPI + OpenMP |
| Metric | Wall-clock time per collective; speedup vs. baseline |
Headline Quantitative Results
Algorithm cutoffs adopted in MPICH:
| Collective | Cutoffs |
|---|---|
| Allgather | Bruck < 80 KB; Rec-Dbl < 512 KB;
Ring >= 512 KB |
| Broadcast | Tree < 12 KB; Van de Geijn otherwise |
| All-to-All | Bruck <= 256 B/msg; Irecv-Isend up to 32 KB;
Pairwise >= 32 KB |
| Reduce | Tree <= 2 KB; Rabenseifner > 2 KB
(predefined ops only) |
| Reduce-Scatter | Recursive Halving <= 512 KB; Pairwise
> 512 KB |
Speedups vs. baselines:
| Platform | Speedup |
|---|---|
| IBM SP, pure MPI, 8-64 KB | 1.5x |
| IBM SP, pure MPI, larger buffers | 2-5x |
| IBM SP, hybrid MPI + OpenMP | 1.5-3x |
| Myrinet cluster, pure MPI | 3-7x |
| Myrinet cluster, hybrid MPI + OpenMP | 2-5x |
Cray T3E, MPI_SUM |
3-5x |
Cray T3E, MPI_MAXLOC |
up to 100x |
| Cray T3E, Allreduce production profiling | up to 20% time reduction |
| Cray T3E, Reduce production profiling | up to 54% time reduction |
Cost reference (long-message Allreduce):
- Vector halving / distance doubling, power-of-two:
~ 2 lg(p) alpha + 2 n beta + n gamma - Same, non-power-of-two:
~ (3 + 2 floor(lg p)) alpha + 4 n beta + (3/2) n gamma— bandwidth and compute terms roughly double, exposing a "p must be power of two" performance cliff. - Ring Allreduce:
2 (p-1) alpha + 2 n beta + n gamma— bandwidth-optimal, latency-poor.
Limitations
- The cost model is flat: it ignores hierarchical / multi-level network structure (NUMA, intra-rack vs. inter-rack), which can yield suboptimal choices on real hierarchical networks.
- Cutoffs are static and hand-tuned per platform; an automated, system-parameter-driven cutoff search is left to future work.
- Irregular ("v") collectives (
MPI_Gatherv,MPI_Scatterv,MPI_Allgatherv,MPI_Alltoallv) use regular techniques without targeted optimization. - Splitting user-defined derived datatypes across the rounds of a recursive algorithm is described as "tricky" and is not fully implemented for all cases.
- The bake-off is limited to Myrinet, IBM SP, and Cray T3E; modern InfiniBand / RoCE / GPU-direct interconnects are out of scope.
Open Problems
- Automated, system-parameter-driven cutoff selection. Replace the fixed cutoffs (80 KB, 512 KB, 12 KB, 32 KB, 2 KB) with a runtime model that generalizes across new hardware without rebenchmarking.
- Optimization of irregular collectives — the "v" variants deserve their own regime-aware algorithm choices.
- Hierarchical-network optimizations for systems like TeraGrid, where the flat cost model breaks down once latency varies by orders of magnitude between intra-rack and inter-site links.
- Integration with one-sided MPI —
MPI_Put/MPI_Get/MPI_Accumulatemay enable lower-overhead implementations of some collectives.
Note on NCCL Tuning
This paper is the canonical statement of the message-size-dependent
algorithm-selection problem that NCCL inherits two decades later. The
Section 5.4 "winning algorithm" matrix — a 2D map from
(process count, message size) to a chosen algorithm — is
structurally identical to NCCL's internal tuning table that picks
algorithm (Ring vs Tree vs CollNet) and
protocol (LL / LL128 / Simple). The paper's own future-work
item — "automatic determination of cutoffs based on system parameters
instead of hand-tuned constants" — anticipates exactly the tuner-plugin
problem: hand-tuned cutoffs do not generalize, and a learned or
runtime-derived policy is needed. The 2 KB Reduce cutoff and 32 KB
All-to-All cutoff are also remarkably close to NCCL's modern
small/medium/large message thresholds, suggesting the underlying
latency-vs-bandwidth physics has changed less than the hardware would
suggest.