Optimization of Collective Communication Operations in MPICH — Detailed Summary
Rajeev Thakur, Rolf Rabenseifner, William Gropp | Argonne National Laboratory / HLRS, University of Stuttgart | International Journal of High Performance Computing Applications (IJHPCA), 2005 | Sage Publications
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.
Abstract
- The authors describe efforts to improve the performance of collective communication operations in MPICH for clusters connected by switched networks (Myrinet, IBM SP).
- Multiple algorithms are used per collective, selected by message size: short-message algorithms minimize latency, long-message algorithms minimize bandwidth.
- New algorithms are introduced for
allgather,broadcast,all-to-all,reduce-scatter,reduce, andallreduce. - Performance results on a Myrinet-connected Linux cluster and an IBM SP show significant improvements over old MPICH algorithms, and in many cases over IBM's native MPI library.
- Detailed optimization is provided for
allreduceandreduce, particularly for long messages and non-power-of-two process counts. - The paper concludes that achieving best performance across the full parameter space requires choosing different algorithms per (message size, process count) regime — there is no single best algorithm.
1. Introduction
- Collective communication is a heavily used component of MPI and a major source of optimization potential; previous MPICH releases shipped only rudimentary collective implementations.
- The initial target is clusters connected by switches such as Myrinet and the IBM SP. The strategy is to identify, improve, or develop new algorithms and implement them efficiently on top of MPI point-to-point.
- Multiple algorithms are used per operation depending on message size: one optimized for latency (short messages), another for bandwidth (long messages); experimentally determined cutoffs select between them.
- A five-year profiling study on a Cray T3E 900 motivates the work:
more than 40% of MPI time is spent inside
MPI_AllreduceandMPI_Reduce, and 25% of execution time uses non-power-of-two process counts — so non-power-of-two performance matters in practice. - The paper is organized as follows: related work (Sec. 2), cost model (Sec. 3), new algorithms and their performance (Sec. 4), detailed optimization of reduce / allreduce (Sec. 5), conclusions and future work (Sec. 6).
2. Related Work
- Earlier collective-algorithm research focused on specific topologies (hypercube, mesh, fat tree) and minimized link/node contention plus hop distance. Cited threads include automatically tuned algorithms (Vadhiyar et al.), wide-area optimizations (Argonne / Holland groups), SMP-cluster optimizations, and message-size-dependent algorithms (Van de Geijn, Rabenseifner, Kale).
- Specific named algorithms predating this work: dissemination allgather (Benson et al.), efficient short-message allgather and all-to-all variants (Bruck et al.), and reduce-scatter analysis in the LogGP model (Iannello).
3. Cost Model
A simple latency-bandwidth cost model is adopted, similar to the Hockney / Van de Geijn formulation; the authors argue that more elaborate models such as LogP / LogGP are unnecessary for the design decisions made here.
The base equation is:
T = alpha + n * betawhere
alphais per-message latency,betais per-byte transfer time, andnis bytes. Assumptions: distance-independent latency, bidirectional links, single-ported NICs, and a per-byte computation costgammafor reduction operators.For the Section 5 reduction analysis the model is refined to distinguish bidirectional
(alpha + n*beta)from unidirectional(alpha_uni + n*beta_uni)link usage. Two ratios are defined:f_alpha = alpha_uni / alpha f_beta = beta_uni / betatypically in the range
0.5to1.0depending on hardware.
4. Algorithms
- Performance is measured with the SKaMPI benchmark on two main platforms: an Argonne Myrinet 2000 Linux cluster running MPICH-GM, and an IBM SP at SDSC running IBM's native MPI. All algorithms are implemented as portable functions on top of MPI point-to-point primitives so they ride on whatever transport the host MPI provides.
4.1 Allgather
Definition:
MPI_Allgathergathers data from every process to every process. The "old" MPICH used a ring method: data is forwarded around a virtual ring inp - 1steps.Cost of the ring method:
T_ring = (p - 1) * alpha + ((p - 1) / p) * n * betaThe bandwidth term is fixed (every process must receive
n/pdata fromp - 1peers), but latency can be reduced fromp - 1rounds tolg prounds.
4.1.1 Recursive Doubling
In recursive doubling, processes that are distance 1 apart exchange their data in round 1, distance 2 in round 2, distance 4 in round 3, and so on. Cost:
T_rec_dbl = lg(p) * alpha + ((p - 1) / p) * n * betaFor non-power-of-two process counts, an extra round of communication is used inside the largest power-of-two peer subset to ensure correctness, bounding the step count to
2 * floor(lg p).
4.1.2 Bruck Algorithm
Bruck's algorithm is a variant of dissemination that avoids noncontiguous data issues. In step
k, processisends to(i - 2^k) mod pand receives from(i + 2^k) mod p.The full procedure: a local copy, then
lg pcommunication steps, then a final local memory shift to permute data into the canonical output order. Cost:T_bruck = ceil(lg p) * alpha + ((p - 1) / p) * n * beta
4.1.3 Performance and Cutoffs
- Bruck wins on non-power-of-two process counts because its
ceil(lg p)step bound dominates recursive doubling's2 * floor(lg p). Recursive doubling wins on power-of-two counts because Bruck's final memory permutation is wasted work. - MPICH cutoff policy:
- Bruck for short messages (
< 80 KBtotal) and non-power-of-two process counts. - Recursive doubling for power-of-two and short /
medium messages (
< 512 KB). - Ring for long messages (
>= 512 KB) or medium non-power-of-two, because nearest-neighbor traffic achieves higher effective bandwidth on Myrinet and the IBM SP.
- Bruck for short messages (
4.2 Broadcast
Old MPICH used a binomial tree: at each step, a process holding the data sends to a partner that does not yet have it. Cost:
T_tree = ceil(lg p) * (alpha + n * beta)Excellent for short messages (logarithmic latency), poor for long messages because the bandwidth term is paid
lg ptimes.For long messages, Van de Geijn's algorithm decomposes broadcast into a scatter (binomial) followed by an allgather (ring). Cost:
T_vandegeijn = (lg p + p - 1) * alpha + 2 * ((p - 1) / p) * n * betaThe bandwidth term drops to
~2ninstead ofn * lg p, an improvement factor of roughly(lg p) / 2.MPICH cutoffs: binomial tree for
< 12 KBor fewer than 8 processes; Van de Geijn otherwise.
4.3 All-to-All
The old algorithm posted all
MPI_Irecvs and then allMPI_Isends. The new algorithm uses a(rank + i) mod pschedule so every round has uniform pairwise traffic, avoiding rank-0 hotspots.Short messages (
<= 256 bytes/msg) use Bruck's store-and-forward algorithm inceil(lg p)steps. Cost:T_a2a_bruck = lg(p) * alpha + (n / 2) * lg(p) * betaBruck's all-to-all internals: a local rotation, then
lg prounds in which a process sends data destined for ranks whosek-th bit is1, followed by an inverse rotation to restore canonical order.Medium messages (256 bytes to 32 KB / msg) use a simple
irecv-isendschedule with the rotated index pattern.Long messages (
>= 32 KB / msg) use pairwise exchange: XOR-based pairing for power-of-twop;rank +/- kpairing otherwise. Cost:T_long = (p - 1) * alpha + n * beta
4.4 Reduce-Scatter
Old MPICH: binomial tree reduce to rank 0, then linear scatterv. Cost:
T_old = (lg p + p - 1) * alpha + (lg p + (p - 1) / p) * n * beta + n * lg(p) * gammaCommutative, short messages (
<= 512 KB): recursive halving. Communication distance halves each step (p/2, p/4, ...). Cost:T_rec_half = lg(p) * alpha + ((p - 1) / p) * n * beta + ((p - 1) / p) * n * gammaFor non-power-of-two
p, an initial reduction to the nearest power- of-two subset is performed before recursive halving begins.Non-commutative, short messages (
< 512 bytes): recursive doubling variant that interleaves local reductions, since reordering operands is forbidden.Long messages (
>= 512 KBcommutative;>= 512 bytesnon-commutative): pairwise exchange. Cost:T_long = (p - 1) * alpha + ((p - 1) / p) * n * beta + ((p - 1) / p) * n * gamma
4.5 Reduce and Allreduce
Old MPICH
Reducewas a binomial tree; oldAllreducewas Reduce followed by Broadcast.Rabenseifner's algorithm for long messages implements
Reduceas a reduce-scatter followed by a gather. Cost:T_rabenseifner = 2 * lg(p) * alpha + 2 * ((p - 1) / p) * n * beta + ((p - 1) / p) * n * gammaCutoffs for
Reduce: Rabenseifner for long messages (> 2 KB) with predefined operators; binomial tree for short messages (<= 2 KB) or user-defined operators.Cutoffs for
Allreduce: recursive doubling for short messages or user-defined ops; Rabenseifner (reduce-scatter + allgather) for long predefined-op messages.
5. Further Optimization of Allreduce and Reduce
- Five algorithms are studied side-by-side: binomial tree, recursive doubling, recursive halving / doubling, binary blocks, and ring.
- Two terminology conventions are introduced explicitly:
- Recursive vector halving / doubling — the per-process vector size changes every step.
- Recursive distance halving / doubling — the communication partner distance changes every step.
5.1 Vector Halving and Distance Doubling
The algorithm is a reduce-scatter (vector halving + distance doubling) followed by either a gather (for
Reduce) or an allgather (forAllreduce).For non-power-of-two
p, definer = p - 2^floor(lg p). The firstr"extra" processes are folded in via an initial reduction so that the main algorithm can run on a power-of-two subset.Power-of-two
Allreducecost:T ~= 2 * lg(p) * alpha + 2 * n * beta + n * gammaNon-power-of-two
Allreducecost:T ~= (3 + 2 * floor(lg p)) * alpha + 4 * n * beta + (3/2) * n * gammaThe bandwidth and compute terms roughly double for non-power-of-two, exposing the well-known "p must be power of two" performance cliff.
5.2 Binary Blocks Algorithm
- Decomposes a non-power-of-two
pinto a sum of power-of-two blocks (the binary representation ofp). Each block runs internal recursive halving / doubling, then blocks are merged. - Performance is governed by the load imbalance metric
delta_expo_max: the maximum gap between consecutive set bits in the binary representation ofp. Smaller gaps yield more balanced blocks and better speedup.
5.3 Ring Algorithm
Pairwise exchange for the reduce-scatter phase, then a ring allgather. Excellent bandwidth for very large vectors but latency scales linearly with
p. Cost:T_all_ring = 2 * (p - 1) * alpha + 2 * n * beta + n * gamma
5.4 Choosing the Fastest Algorithm
- On the Cray T3E:
- Recursive doubling is best for vectors
<= 32 bytes. - Binomial tree is best for
<= 1 KB. - Ring is competitive at some sizes when
p < 32.
- Recursive doubling is best for vectors
- The binary blocks algorithm beats vector halving / distance doubling
when both
delta_expo_max < lg(vector_length) / 2.0 - 2.5and the vector is>= 16 KB. - Figure 14 in the paper presents a (P, message size) -> winning algorithm matrix that codifies the regime map.
5.5 Comparison with Vendor's MPI
- Vendor baselines: IBM SP native MPI; Heidelberg HELICS Myrinet cluster MPI; Cray T3E native library.
- IBM SP: new MPICH algorithms are 1.5x faster for buffers between 8 KB and 64 KB (pure MPI), and 2 to 5x faster for larger buffers. Hybrid (MPI + OpenMP) sees 1.5 to 3x speedups.
- Myrinet cluster (HELICS, Heidelberg): 3 to 7x faster for pure MPI; 2 to 5x faster for hybrid MPI + OpenMP.
- Cray T3E: 3 to 5x faster for
MPI_SUM; up to 100x faster forMPI_MAXLOC(Cray's native library handles the derived datatype poorly). Production profiling shows up to 20% reduction in Allreduce time and up to 54% reduction in Reduce time.
6. Conclusions and Future Work
- Optimized collective communications provide substantial benefits,
but no single algorithm is best across all
(message size, process count, operator)combinations; multi-algorithm selection at runtime is essential. - Future work directions:
- Automatic determination of cutoffs based on system parameters instead of hand-tuned constants.
- Optimization of the irregular ("v") collectives
(
Gatherv,Scatterv,Allgatherv,Alltoallv) that currently inherit less-tuned implementations. - Hierarchical-network optimizations for grid environments such as TeraGrid.
- Integration with one-sided communication primitives.
Quantitative Results Summary
| Collective | Old algorithm | New algorithm(s) | Cutoff |
|---|---|---|---|
| Allgather | Ring | Bruck / Rec-Dbl / Ring | 80 KB / 512 KB |
| Broadcast | Binomial Tree | Tree / Van de Geijn | 12 KB |
| All-to-All | Naive Irecv-Isend | Bruck / Irecv-Isend / Pairwise | 256 B / 32 KB |
| Reduce-Scatter | Tree-then-Scatterv | Rec-Halving / Pairwise | 512 KB |
| Reduce | Binomial Tree | Tree / Rabenseifner | 2 KB |
| Allreduce | Reduce + Bcast | Rec-Dbl / Rabenseifner | message-size dependent |
| Platform | Headline speedup vs. baseline |
|---|---|
| IBM SP, pure MPI, 8-64 KB | 1.5x |
| IBM SP, pure MPI, larger buffers | 2-5x |
| IBM SP, hybrid MPI + OpenMP | 1.5-3x |
| Myrinet cluster, pure MPI | 3-7x |
| Myrinet cluster, hybrid MPI + OpenMP | 2-5x |
Cray T3E, MPI_SUM |
3-5x |
Cray T3E, MPI_MAXLOC |
up to 100x |
| Cray T3E, Allreduce production | up to 20% time reduction |
| Cray T3E, Reduce production | up to 54% time reduction |
Cost-Equation Reference Card
T_ring_allgather = (p-1)*alpha + ((p-1)/p) * n * beta
T_rec_dbl_allgather = lg(p)*alpha + ((p-1)/p) * n * beta
T_bruck_allgather = ceil(lg p)*alpha + ((p-1)/p) * n * beta
T_tree_bcast = ceil(lg p) * (alpha + n*beta)
T_vandegeijn_bcast = (lg p + p - 1)*alpha + 2*((p-1)/p)*n*beta
T_a2a_bruck = lg(p)*alpha + (n/2)*lg(p)*beta
T_a2a_pairwise = (p-1)*alpha + n*beta
T_rec_half_redscat = lg(p)*alpha + ((p-1)/p)*n*beta + ((p-1)/p)*n*gamma
T_pairwise_redscat = (p-1)*alpha + ((p-1)/p)*n*beta + ((p-1)/p)*n*gamma
T_rabenseifner_reduce = 2*lg(p)*alpha + 2*((p-1)/p)*n*beta + ((p-1)/p)*n*gamma
T_vh_dd_allreduce_p2 ~= 2*lg(p)*alpha + 2*n*beta + n*gamma
T_vh_dd_allreduce_npo2 ~= (3+2*floor(lg p))*alpha + 4*n*beta + (3/2)*n*gamma
T_ring_allreduce = 2*(p-1)*alpha + 2*n*beta + n*gamma
Experimental Setup
| Component | Value |
|---|---|
| Myrinet cluster (Argonne) | Linux cluster, Myrinet 2000, MPICH-GM |
| Heidelberg HELICS cluster | Dual-CPU PCs, Myrinet |
| IBM SP (SDSC) | Switch-connected, IBM's native MPI |
| Cray T3E 900 (Stuttgart, HLRS) | Production system used for profiling and benchmarks |
| MPI implementations | MPICH 1.2.6, MPICH2 0.971, IBM MPI, Cray MPT 1.4.0.4 |
| Benchmarks | SKaMPI, b_eff |
| Operators | MPI_SUM, MPI_MAX, MPI_MAXLOC
(derived datatype stress) |
| Workloads | Pure MPI, hybrid MPI + OpenMP |
Named Methods and Concepts
- Algorithms: recursive doubling, recursive halving, Bruck's algorithm, Van de Geijn broadcast (scatter + allgather), Rabenseifner reduce (reduce-scatter + gather), binary blocks decomposition, dissemination barrier variant.
- Concepts: vector halving vs. distance halving;
pairwise exchange; XOR-based pairing for power-of-two;
rank +/- kpairing for non- power-of-two; nearest-neighbor traffic patterns; LogP / LogGP cost models (referenced but not used); SMP-cluster awareness; TeraGrid / hierarchical-network awareness.
Limitations Stated by the Authors
- The cost model is flat — it ignores hierarchical / multi-level network structure (e.g., NUMA + intra-rack + inter-rack). May yield suboptimal choices on hierarchical networks.
- Cutoffs are static, hand-tuned per platform; an automated system-parameter-driven cutoff search is left to future work.
- Irregular ("v") collectives (
MPI_Gatherv,MPI_Alltoallv, etc.) use regular techniques without targeted optimization. - Handling of derived datatypes inside recursive algorithms — splitting user-defined types across rounds — is described as "tricky" and is not fully implemented for all cases.
- The bake-off is limited to a small set of HPC platforms (Myrinet, IBM SP, Cray T3E); modern InfiniBand / RoCE / GPU-direct interconnects are out of scope.
Open Problems Explicitly Identified
- Automated, system-parameter-driven cutoff selection. The current policy uses fixed cutoffs (80 KB, 512 KB, 2 KB, 12 KB, 32 KB); replacing these with a runtime model would generalize across new hardware without rebenchmarking.
- Optimization of irregular collectives —
MPI_Gatherv,MPI_Scatterv,MPI_Allgatherv,MPI_Alltoallvdeserve their own regime-aware algorithm choices. - Hierarchical-network optimizations for systems like TeraGrid: the flat cost model breaks down when latency varies by orders of magnitude between intra-rack and inter-site links.
- Integration with one-sided MPI —
MPI_Put/MPI_Get/MPI_Accumulatemay enable lower-overhead implementations of some collectives.
Cross-Cutting Empirical Take-Aways
| Take-away | Evidence |
|---|---|
No single algorithm is best across all (p, n, op)
combinations |
Sec. 5.4 winning-algorithm matrix |
| Latency-optimal and bandwidth-optimal algorithms are different | Tree vs Van de Geijn / Rabenseifner |
Non-power-of-two p has its own performance cliff that
demands separate algorithms |
Sec. 5.1 cost ratios; Bruck vs RecDbl tradeoff |
Reduce-scatter + allgather is the right long-message pattern for
Allreduce and Bcast |
Rabenseifner, Van de Geijn |
| Vendor-native MPIs are not always optimized — portable algorithms layered on top of point-to-point can beat them | IBM SP and Cray T3E speedups |
| Static cutoffs are good enough for one platform but do not generalize | Sec. 6 future work |
| Allreduce and Reduce alone account for ~40% of MPI time in production | Sec. 1 motivation |
Note on NCCL Tuning
This paper is the canonical statement of the message-size-dependent
algorithm-selection problem that NCCL (and DynamICCL) inherits two
decades later. The Section 5.4 "winning algorithm" matrix — a 2D map
from (process count, message size) to a chosen algorithm —
is structurally identical to NCCL's internal tuning table that picks
algorithm (Ring vs Tree vs CollNet) and
protocol (LL / LL128 / Simple). The paper's own Section 6
future-work item — "automatic determination of cutoffs based on system
parameters instead of hand-tuned constants" — anticipates exactly the
tuner-plugin problem: hand-tuned cutoffs do not generalize, and a
learned or runtime-derived policy is needed. The 2 KB Reduce cutoff and
32 KB All-to-All cutoff are also remarkably close to NCCL's modern
small/medium/large message thresholds, suggesting the underlying
latency-vs-bandwidth physics has changed less than the hardware would
suggest.