Optimization of Collective Communication Operations in MPICH — Detailed Summary

Rajeev Thakur, Rolf Rabenseifner, William Gropp | Argonne National Laboratory / HLRS, University of Stuttgart | International Journal of High Performance Computing Applications (IJHPCA), 2005 | Sage Publications

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.

Abstract

The authors describe efforts to improve the performance of collective communication operations in MPICH for clusters connected by switched networks (Myrinet, IBM SP).
Multiple algorithms are used per collective, selected by message size: short-message algorithms minimize latency, long-message algorithms minimize bandwidth.
New algorithms are introduced for allgather, broadcast, all-to-all, reduce-scatter, reduce, and allreduce.
Performance results on a Myrinet-connected Linux cluster and an IBM SP show significant improvements over old MPICH algorithms, and in many cases over IBM's native MPI library.
Detailed optimization is provided for allreduce and reduce, particularly for long messages and non-power-of-two process counts.
The paper concludes that achieving best performance across the full parameter space requires choosing different algorithms per (message size, process count) regime — there is no single best algorithm.

1. Introduction

Collective communication is a heavily used component of MPI and a major source of optimization potential; previous MPICH releases shipped only rudimentary collective implementations.
The initial target is clusters connected by switches such as Myrinet and the IBM SP. The strategy is to identify, improve, or develop new algorithms and implement them efficiently on top of MPI point-to-point.
Multiple algorithms are used per operation depending on message size: one optimized for latency (short messages), another for bandwidth (long messages); experimentally determined cutoffs select between them.
A five-year profiling study on a Cray T3E 900 motivates the work: more than 40% of MPI time is spent inside MPI_Allreduce and MPI_Reduce, and 25% of execution time uses non-power-of-two process counts — so non-power-of-two performance matters in practice.
The paper is organized as follows: related work (Sec. 2), cost model (Sec. 3), new algorithms and their performance (Sec. 4), detailed optimization of reduce / allreduce (Sec. 5), conclusions and future work (Sec. 6).

Earlier collective-algorithm research focused on specific topologies (hypercube, mesh, fat tree) and minimized link/node contention plus hop distance. Cited threads include automatically tuned algorithms (Vadhiyar et al.), wide-area optimizations (Argonne / Holland groups), SMP-cluster optimizations, and message-size-dependent algorithms (Van de Geijn, Rabenseifner, Kale).
Specific named algorithms predating this work: dissemination allgather (Benson et al.), efficient short-message allgather and all-to-all variants (Bruck et al.), and reduce-scatter analysis in the LogGP model (Iannello).

3. Cost Model

A simple latency-bandwidth cost model is adopted, similar to the Hockney / Van de Geijn formulation; the authors argue that more elaborate models such as LogP / LogGP are unnecessary for the design decisions made here.
The base equation is:
```
T = alpha + n * beta
```
where alpha is per-message latency, beta is per-byte transfer time, and n is bytes. Assumptions: distance-independent latency, bidirectional links, single-ported NICs, and a per-byte computation cost gamma for reduction operators.
For the Section 5 reduction analysis the model is refined to distinguish bidirectional (alpha + n*beta) from unidirectional (alpha_uni + n*beta_uni) link usage. Two ratios are defined:
```
f_alpha = alpha_uni / alpha
f_beta  = beta_uni  / beta
```
typically in the range 0.5 to 1.0 depending on hardware.

4. Algorithms

Performance is measured with the SKaMPI benchmark on two main platforms: an Argonne Myrinet 2000 Linux cluster running MPICH-GM, and an IBM SP at SDSC running IBM's native MPI. All algorithms are implemented as portable functions on top of MPI point-to-point primitives so they ride on whatever transport the host MPI provides.

4.1 Allgather

Definition: MPI_Allgather gathers data from every process to every process. The "old" MPICH used a ring method: data is forwarded around a virtual ring in p - 1 steps.
Cost of the ring method:
```
T_ring = (p - 1) * alpha + ((p - 1) / p) * n * beta
```
The bandwidth term is fixed (every process must receive n/p data from p - 1 peers), but latency can be reduced from p - 1 rounds to lg p rounds.

4.1.1 Recursive Doubling

In recursive doubling, processes that are distance 1 apart exchange their data in round 1, distance 2 in round 2, distance 4 in round 3, and so on. Cost:
```
T_rec_dbl = lg(p) * alpha + ((p - 1) / p) * n * beta
```
For non-power-of-two process counts, an extra round of communication is used inside the largest power-of-two peer subset to ensure correctness, bounding the step count to 2 * floor(lg p).

4.1.2 Bruck Algorithm

Bruck's algorithm is a variant of dissemination that avoids noncontiguous data issues. In step k, process i sends to (i - 2^k) mod p and receives from (i + 2^k) mod p.
The full procedure: a local copy, then lg p communication steps, then a final local memory shift to permute data into the canonical output order. Cost:
```
T_bruck = ceil(lg p) * alpha + ((p - 1) / p) * n * beta
```

4.1.3 Performance and Cutoffs

Bruck wins on non-power-of-two process counts because its ceil(lg p) step bound dominates recursive doubling's 2 * floor(lg p). Recursive doubling wins on power-of-two counts because Bruck's final memory permutation is wasted work.
MPICH cutoff policy:
- Bruck for short messages (< 80 KB total) and non-power-of-two process counts.
- Recursive doubling for power-of-two and short / medium messages (< 512 KB).
- Ring for long messages (>= 512 KB) or medium non-power-of-two, because nearest-neighbor traffic achieves higher effective bandwidth on Myrinet and the IBM SP.

4.2 Broadcast

Old MPICH used a binomial tree: at each step, a process holding the data sends to a partner that does not yet have it. Cost:
```
T_tree = ceil(lg p) * (alpha + n * beta)
```
Excellent for short messages (logarithmic latency), poor for long messages because the bandwidth term is paid lg p times.
For long messages, Van de Geijn's algorithm decomposes broadcast into a scatter (binomial) followed by an allgather (ring). Cost:
```
T_vandegeijn = (lg p + p - 1) * alpha + 2 * ((p - 1) / p) * n * beta
```
The bandwidth term drops to ~2n instead of n * lg p, an improvement factor of roughly (lg p) / 2.
MPICH cutoffs: binomial tree for < 12 KB or fewer than 8 processes; Van de Geijn otherwise.

4.3 All-to-All

The old algorithm posted all MPI_Irecvs and then all MPI_Isends. The new algorithm uses a (rank + i) mod p schedule so every round has uniform pairwise traffic, avoiding rank-0 hotspots.
Short messages (<= 256 bytes/msg) use Bruck's store-and-forward algorithm in ceil(lg p) steps. Cost:
```
T_a2a_bruck = lg(p) * alpha + (n / 2) * lg(p) * beta
```
Bruck's all-to-all internals: a local rotation, then lg p rounds in which a process sends data destined for ranks whose k-th bit is 1, followed by an inverse rotation to restore canonical order.
Medium messages (256 bytes to 32 KB / msg) use a simple irecv- isend schedule with the rotated index pattern.
Long messages (>= 32 KB / msg) use pairwise exchange: XOR-based pairing for power-of-two p; rank +/- k pairing otherwise. Cost:
```
T_long = (p - 1) * alpha + n * beta
```

4.4 Reduce-Scatter

Old MPICH: binomial tree reduce to rank 0, then linear scatterv. Cost:

T_old = (lg p + p - 1) * alpha
      + (lg p + (p - 1) / p) * n * beta
      + n * lg(p) * gamma

Commutative, short messages (<= 512 KB): recursive halving. Communication distance halves each step (p/2, p/4, ...). Cost:
```
T_rec_half = lg(p) * alpha
           + ((p - 1) / p) * n * beta
           + ((p - 1) / p) * n * gamma
```
For non-power-of-two p, an initial reduction to the nearest power- of-two subset is performed before recursive halving begins.
Non-commutative, short messages (< 512 bytes): recursive doubling variant that interleaves local reductions, since reordering operands is forbidden.

Long messages (>= 512 KB commutative; >= 512 bytes non-commutative): pairwise exchange. Cost:

T_long = (p - 1) * alpha
       + ((p - 1) / p) * n * beta
       + ((p - 1) / p) * n * gamma

4.5 Reduce and Allreduce

Old MPICH Reduce was a binomial tree; old Allreduce was Reduce followed by Broadcast.

Rabenseifner's algorithm for long messages implements Reduce as a reduce-scatter followed by a gather. Cost:

T_rabenseifner = 2 * lg(p) * alpha
               + 2 * ((p - 1) / p) * n * beta
               + ((p - 1) / p) * n * gamma

Cutoffs for Reduce: Rabenseifner for long messages (> 2 KB) with predefined operators; binomial tree for short messages (<= 2 KB) or user-defined operators.
Cutoffs for Allreduce: recursive doubling for short messages or user-defined ops; Rabenseifner (reduce-scatter + allgather) for long predefined-op messages.

5. Further Optimization of Allreduce and Reduce

Five algorithms are studied side-by-side: binomial tree, recursive doubling, recursive halving / doubling, binary blocks, and ring.
Two terminology conventions are introduced explicitly:
- Recursive vector halving / doubling — the per-process vector size changes every step.
- Recursive distance halving / doubling — the communication partner distance changes every step.

5.1 Vector Halving and Distance Doubling

The algorithm is a reduce-scatter (vector halving + distance doubling) followed by either a gather (for Reduce) or an allgather (for Allreduce).
For non-power-of-two p, define r = p - 2^floor(lg p). The first r "extra" processes are folded in via an initial reduction so that the main algorithm can run on a power-of-two subset.

Power-of-two Allreduce cost:

T ~= 2 * lg(p) * alpha + 2 * n * beta + n * gamma

Non-power-of-two Allreduce cost:
```
T ~= (3 + 2 * floor(lg p)) * alpha + 4 * n * beta + (3/2) * n * gamma
```
The bandwidth and compute terms roughly double for non-power-of-two, exposing the well-known "p must be power of two" performance cliff.

5.2 Binary Blocks Algorithm

Decomposes a non-power-of-two p into a sum of power-of-two blocks (the binary representation of p). Each block runs internal recursive halving / doubling, then blocks are merged.
Performance is governed by the load imbalance metric delta_expo_max: the maximum gap between consecutive set bits in the binary representation of p. Smaller gaps yield more balanced blocks and better speedup.

5.3 Ring Algorithm

Pairwise exchange for the reduce-scatter phase, then a ring allgather. Excellent bandwidth for very large vectors but latency scales linearly with p. Cost:
```
T_all_ring = 2 * (p - 1) * alpha + 2 * n * beta + n * gamma
```

5.4 Choosing the Fastest Algorithm

On the Cray T3E:
- Recursive doubling is best for vectors <= 32 bytes.
- Binomial tree is best for <= 1 KB.
- Ring is competitive at some sizes when p < 32.
The binary blocks algorithm beats vector halving / distance doubling when both delta_expo_max < lg(vector_length) / 2.0 - 2.5 and the vector is >= 16 KB.
Figure 14 in the paper presents a (P, message size) -> winning algorithm matrix that codifies the regime map.

5.5 Comparison with Vendor's MPI

Vendor baselines: IBM SP native MPI; Heidelberg HELICS Myrinet cluster MPI; Cray T3E native library.
IBM SP: new MPICH algorithms are 1.5x faster for buffers between 8 KB and 64 KB (pure MPI), and 2 to 5x faster for larger buffers. Hybrid (MPI + OpenMP) sees 1.5 to 3x speedups.
Myrinet cluster (HELICS, Heidelberg): 3 to 7x faster for pure MPI; 2 to 5x faster for hybrid MPI + OpenMP.
Cray T3E: 3 to 5x faster for MPI_SUM; up to 100x faster for MPI_MAXLOC (Cray's native library handles the derived datatype poorly). Production profiling shows up to 20% reduction in Allreduce time and up to 54% reduction in Reduce time.

6. Conclusions and Future Work

Optimized collective communications provide substantial benefits, but no single algorithm is best across all (message size, process count, operator) combinations; multi-algorithm selection at runtime is essential.
Future work directions:
- Automatic determination of cutoffs based on system parameters instead of hand-tuned constants.
- Optimization of the irregular ("v") collectives (Gatherv, Scatterv, Allgatherv, Alltoallv) that currently inherit less-tuned implementations.
- Hierarchical-network optimizations for grid environments such as TeraGrid.
- Integration with one-sided communication primitives.

Quantitative Results Summary

Collective	Old algorithm	New algorithm(s)	Cutoff
Allgather	Ring	Bruck / Rec-Dbl / Ring	80 KB / 512 KB
Broadcast	Binomial Tree	Tree / Van de Geijn	12 KB
All-to-All	Naive Irecv-Isend	Bruck / Irecv-Isend / Pairwise	256 B / 32 KB
Reduce-Scatter	Tree-then-Scatterv	Rec-Halving / Pairwise	512 KB
Reduce	Binomial Tree	Tree / Rabenseifner	2 KB
Allreduce	Reduce + Bcast	Rec-Dbl / Rabenseifner	message-size dependent

Platform	Headline speedup vs. baseline
IBM SP, pure MPI, 8-64 KB	1.5x
IBM SP, pure MPI, larger buffers	2-5x
IBM SP, hybrid MPI + OpenMP	1.5-3x
Myrinet cluster, pure MPI	3-7x
Myrinet cluster, hybrid MPI + OpenMP	2-5x
Cray T3E, `MPI_SUM`	3-5x
Cray T3E, `MPI_MAXLOC`	up to 100x
Cray T3E, Allreduce production	up to 20% time reduction
Cray T3E, Reduce production	up to 54% time reduction

Cost-Equation Reference Card

T_ring_allgather       = (p-1)*alpha + ((p-1)/p) * n * beta
T_rec_dbl_allgather    = lg(p)*alpha + ((p-1)/p) * n * beta
T_bruck_allgather      = ceil(lg p)*alpha + ((p-1)/p) * n * beta
T_tree_bcast           = ceil(lg p) * (alpha + n*beta)
T_vandegeijn_bcast     = (lg p + p - 1)*alpha + 2*((p-1)/p)*n*beta
T_a2a_bruck            = lg(p)*alpha + (n/2)*lg(p)*beta
T_a2a_pairwise         = (p-1)*alpha + n*beta
T_rec_half_redscat     = lg(p)*alpha + ((p-1)/p)*n*beta + ((p-1)/p)*n*gamma
T_pairwise_redscat     = (p-1)*alpha + ((p-1)/p)*n*beta + ((p-1)/p)*n*gamma
T_rabenseifner_reduce  = 2*lg(p)*alpha + 2*((p-1)/p)*n*beta + ((p-1)/p)*n*gamma
T_vh_dd_allreduce_p2   ~= 2*lg(p)*alpha + 2*n*beta + n*gamma
T_vh_dd_allreduce_npo2 ~= (3+2*floor(lg p))*alpha + 4*n*beta + (3/2)*n*gamma
T_ring_allreduce       = 2*(p-1)*alpha + 2*n*beta + n*gamma

Experimental Setup

Component	Value
Myrinet cluster (Argonne)	Linux cluster, Myrinet 2000, MPICH-GM
Heidelberg HELICS cluster	Dual-CPU PCs, Myrinet
IBM SP (SDSC)	Switch-connected, IBM's native MPI
Cray T3E 900 (Stuttgart, HLRS)	Production system used for profiling and benchmarks
MPI implementations	MPICH 1.2.6, MPICH2 0.971, IBM MPI, Cray MPT 1.4.0.4
Benchmarks	SKaMPI, b_eff
Operators	`MPI_SUM`, `MPI_MAX`, `MPI_MAXLOC` (derived datatype stress)
Workloads	Pure MPI, hybrid MPI + OpenMP

Named Methods and Concepts

Algorithms: recursive doubling, recursive halving, Bruck's algorithm, Van de Geijn broadcast (scatter + allgather), Rabenseifner reduce (reduce-scatter + gather), binary blocks decomposition, dissemination barrier variant.
Concepts: vector halving vs. distance halving; pairwise exchange; XOR-based pairing for power-of-two; rank +/- k pairing for non- power-of-two; nearest-neighbor traffic patterns; LogP / LogGP cost models (referenced but not used); SMP-cluster awareness; TeraGrid / hierarchical-network awareness.

Limitations Stated by the Authors

The cost model is flat — it ignores hierarchical / multi-level network structure (e.g., NUMA + intra-rack + inter-rack). May yield suboptimal choices on hierarchical networks.
Cutoffs are static, hand-tuned per platform; an automated system-parameter-driven cutoff search is left to future work.
Irregular ("v") collectives (MPI_Gatherv, MPI_Alltoallv, etc.) use regular techniques without targeted optimization.
Handling of derived datatypes inside recursive algorithms — splitting user-defined types across rounds — is described as "tricky" and is not fully implemented for all cases.
The bake-off is limited to a small set of HPC platforms (Myrinet, IBM SP, Cray T3E); modern InfiniBand / RoCE / GPU-direct interconnects are out of scope.

Open Problems Explicitly Identified

Automated, system-parameter-driven cutoff selection. The current policy uses fixed cutoffs (80 KB, 512 KB, 2 KB, 12 KB, 32 KB); replacing these with a runtime model would generalize across new hardware without rebenchmarking.
Optimization of irregular collectives — MPI_Gatherv, MPI_Scatterv, MPI_Allgatherv, MPI_Alltoallv deserve their own regime-aware algorithm choices.
Hierarchical-network optimizations for systems like TeraGrid: the flat cost model breaks down when latency varies by orders of magnitude between intra-rack and inter-site links.
Integration with one-sided MPI — MPI_Put / MPI_Get / MPI_Accumulate may enable lower-overhead implementations of some collectives.

Cross-Cutting Empirical Take-Aways

Take-away	Evidence
No single algorithm is best across all `(p, n, op)` combinations	Sec. 5.4 winning-algorithm matrix
Latency-optimal and bandwidth-optimal algorithms are different	Tree vs Van de Geijn / Rabenseifner
Non-power-of-two `p` has its own performance cliff that demands separate algorithms	Sec. 5.1 cost ratios; Bruck vs RecDbl tradeoff
Reduce-scatter + allgather is the right long-message pattern for `Allreduce` and `Bcast`	Rabenseifner, Van de Geijn
Vendor-native MPIs are not always optimized — portable algorithms layered on top of point-to-point can beat them	IBM SP and Cray T3E speedups
Static cutoffs are good enough for one platform but do not generalize	Sec. 6 future work
Allreduce and Reduce alone account for ~40% of MPI time in production	Sec. 1 motivation

Note on NCCL Tuning

This paper is the canonical statement of the message-size-dependent algorithm-selection problem that NCCL (and DynamICCL) inherits two decades later. The Section 5.4 "winning algorithm" matrix — a 2D map from (process count, message size) to a chosen algorithm — is structurally identical to NCCL's internal tuning table that picks algorithm (Ring vs Tree vs CollNet) and protocol (LL / LL128 / Simple). The paper's own Section 6 future-work item — "automatic determination of cutoffs based on system parameters instead of hand-tuned constants" — anticipates exactly the tuner-plugin problem: hand-tuned cutoffs do not generalize, and a learned or runtime-derived policy is needed. The 2 KB Reduce cutoff and 32 KB All-to-All cutoff are also remarkably close to NCCL's modern small/medium/large message thresholds, suggesting the underlying latency-vs-bandwidth physics has changed less than the hardware would suggest.