Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design — Detailed Summary
Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. (DK) Panda | The Ohio State University | SC17 (International Conference for High Performance Computing, Networking, Storage and Analysis), Denver, CO, Nov 12-17, 2017 | DOI: 10.1145/3126908.3126954
Note on filename: the PDF is filed under
0050_MVAPICH2-GDR.pdfbut the actual paper is the SC17 DPML Allreduce work by Bayatpour et al. The implementation is done inside the MVAPICH2 codebase (the OSU MPI stack from which MVAPICH2-GDR derives), which likely explains the filename.
Per-section summary organized by paper headings. Each section provides paragraph-level bullets and preserves all quantitative results, equations, table notations, and named methods.
Abstract
- Existing MPI_Allreduce designs do not exploit the vast parallelism of modern multi-/many-core processors (Intel Xeon, Xeon Phi/KNL) or the increases in message-rate and end-to-end throughput of modern interconnects (InfiniBand, Omni-Path).
- The paper proposes a high-performance Data Partitioning-based Multi-Leader (DPML) design for MPI_Allreduce that exploits intra-node parallelism and the high inter-node throughput available from IB and Omni-Path simultaneously.
- DPML is theoretically modeled to compare communication and computation cost against pure recursive doubling and a single-leader hierarchical baseline.
- Microbenchmark results: up to 3.5x performance improvement on MPI_Allreduce on multiple HPC systems at scale.
- Application-level results: up to 35% improvement for HPCG and 60% for miniAMR.
Keywords: MPI_Allreduce, Data Partitioning, Multi-Leader, SHArP, Collectives, MPI.
1. Introduction
Background and motivation:
- Modern supercomputers deliver multi-petaflop performance and scale to tens of thousands of processors; growth has been driven by multi-/many-core architectures and commodity high-performance RDMA-enabled interconnects such as InfiniBand and Omni-Path.
- MPI is the dominant programming model for HPC; the MPI Standard offers point-to-point, collective, and synchronization primitives, but MPI_Allreduce in particular accounts for the largest share of collective time in many production workloads. A 5-year study by Rabenseifner reported that 37% of the time in MPI routines was in MPI_Allreduce.
- Allreduce is no longer a small-message-only operation. While traditional scientific codes call it on small vectors, newer fields including deep learning use medium-to-large message reductions extensively.
Limitations of existing designs:
- State-of-the-art MPI implementations follow a hierarchical strategy with one or two leader processes per node aggregating data from local processes via shared memory and then communicating across nodes; this approach underuses the increased core count of modern CPUs (Intel KNL has 64-72 cores).
- Modern interconnects like Omni-Path expose very high message rates for small/medium messages and can sustain multiple concurrent transfers from one node, but as message size grows the per-pair throughput stops scaling — naively adding more concurrent senders does not help.
- Therefore: a designer needs to (a) parallelize compute across many cores for medium/large reductions and (b) carefully match the number of concurrent inter-node transfers to the message-size regime of the interconnect.
Hardware offload:
- Vendors (Mellanox) have introduced hardware-offload primitives — Core-direct and SHArP — that move communication and reduction compute to the network fabric. SHArP performs reduction inside switches as data ascends a reduction tree. These features are most effective on small messages, but must be combined carefully with intra-node schemes.
Driving research question (verbatim from paper):
Can we design novel communication algorithms for Allreduce primitive that takes advantage of the vast amount of parallelism available in modern multi-/many-core architectures as well as high throughput and high-end features exposed by modern interconnects like InfiniBand and Omni-Path to deliver best performance and scalability?
Concrete design challenges raised:
- Effectively use multiple leaders per node to load-balance compute and accelerate communication.
- Use high-throughput interconnect characteristics together with multiple leaders for large messages.
- Exploit emerging hardware (KNL, Omni-Path) for new collective protocols.
- Accelerate Allreduce with SHArP.
- Design hybrid schemes that pick the optimal algorithm for each message size.
- Demonstrate microbenchmark and application-level benefit.
1.1 Contributions
- Design and implement multi-/many-core-aware DPML designs for MPI_Allreduce.
- Design multi-leader collectives that take advantage of high throughput and network offload features (SHArP) of modern interconnects.
- Theoretical modeling and analysis of the DPML framework.
- Evaluation on three HPC architectures: Xeon + InfiniBand, Xeon + Omni-Path, Xeon Phi (KNL) + Omni-Path.
- Demonstrate >=3.5x microbenchmark Allreduce latency improvement; up to 80% small-message improvement using SHArP socket-leader design; application-level gains up to 60% for miniAMR and 35% for HPCG.
- Claimed to be the first paper to analyze a data-partitioning multi-leader design for MPI collectives in conjunction with modern high-end interconnect features.
2. Background
2.1 Reduction Collectives in MPI
- Reduction collectives MPI_Reduce and MPI_Allreduce combine data across a group using ops like sum, max, or a user-defined op; widely used in HPC codes (miniAMR, HPCG) and increasingly in DL.
- MPI_Reduce returns the result to one root process; MPI_Allreduce returns the result to all processes — which makes Allreduce more communication-intensive because it requires more communication steps.
- State-of-the-art implementations (MPICH2, Open-MPI, MVAPICH2) all use shared-memory aware hierarchical algorithms. MVAPICH2's flow: (i) processes within a node form a "shared memory communicator"; (ii) one process per node is selected as the leader and joins a "leader-communicator"; (iii) intra-node SHM-based reduction accumulates local data at the leader; (iv) leaders perform inter-node point-to-point reduction; (v) leaders SHM-broadcast the final result back to local ranks.
2.2 Overview of SHArP
- SHArP (Scalable Hierarchical Aggregation Protocol) offloads reduction to the IB switch fabric. Reduction occurs as data climbs the reduction tree; data volume contracts as it ascends rather than the CPU repeatedly seeing intermediate results.
- A SHArP "reduction tree" is a set of network elements organized as hierarchical data objects describing data topologies and collective groups; leaves are data sources, interior vertices are aggregation nodes.
- The technology lets data be manipulated while in flight in the network, rather than waiting for it to reach a CPU.
3. Communication Characteristics of Modern Architectures
The authors first measure raw point-to-point characteristics of
intra-node shared-memory and inter-node IB/OPA fabrics using
osu_mbw_mr from the OSU Microbenchmark Suite. The benchmark
records throughput of N concurrent pairs relative to one pair.
Intra-node (shared memory on KNL — Figure 1(a)):
- Relative throughput stays close to the number of communicating pairs even at large message sizes — shared memory supports very high concurrency.
- Implication: shallow hierarchies with many children per parent are preferable to deep hierarchies for intra-node aggregation.
Inter-node InfiniBand (Figure 1(b), Mellanox EDR 100Gb/s):
- Multiple processes per node performing concurrent transfers improve total per-node throughput across all message sizes.
- Implication: increasing parallelism (more leaders) is beneficial across the entire message-size spectrum on IB.
Inter-node Omni-Path (Figure 1(c)/(d), 100Gb/s):
- Concurrent communication helps on small messages but the relative throughput collapses to ~1 for large messages — naive parallelism does not help, the link is bandwidth-limited.
- Implication: large messages on OPA must be partitioned into smaller chunks that individually fall within the message-rate-limited zone, and those smaller chunks dispatched by multiple processes.
Reasoning about existing algorithms:
- Recursive Doubling is flat; doubles the distance between communicating pairs each step; optimal in compute-cost terms because the computation decreases each round, but relies on point-to-point ops with extra copies.
- Hierarchical (single-leader): workers SHM-copy into a
leader process; the leader performs
ppn-1reductions (computationally expensive on KNL), and only one process per node communicates — leaving inter-node concurrent throughput on the table.
4. Proposed Designs
4.1 Data Partitioning-based Multi-Leader Allreduce (DPML)
DPML designates l leader processes per node which (a)
share computation and (b) drive l concurrent inter-node
transfers. The collective runs in four phases (Figure 2):
Phase 1 — Local Copy to Shared Memory:
- Every process splits its input vector into
lpartitions (one per leader) and copies each partition into a shared-memory region; the j-th partition from process with local rankilands atstart_addr(Leader_j) + i * sizeof(partition). - Equivalent to
lindependent gathers running in parallel.
Phase 2 — Intra-node Reduction by Leaders:
- Each
Leader_jreduces the j-th partitions from allppnlocal processes in parallel; total compute load is now spread overlcores instead of 1. - Leader_j is responsible for
ppn-1reductions on an/l-byte buffer.
Phase 3 — Inter-node Allreduce by Leaders:
- Each leader independently runs an inter-node Allreduce with the same-index leader on every other node (e.g., Leader_0 on each node forms one communicator, Leader_1 forms another). The Allreduce algorithm used for this inner step is dynamically chosen by the MPI library based on message size, system size, and architecture (e.g., recursive-doubling or reduce-scatter + allgather).
Phase 4 — Local Copy to Individual Processes:
- Each leader holds the fully-reduced n/l-byte partition; this is
SHM-copied back to the receive buffer of every local rank — equivalent
to
lparallel broadcasts.
4.2 Taking Advantage of High Message Rate: Pipelined Data Transfer
The Omni-Path message-size-vs-throughput curve has three zones:
- Zone A (smallest messages): throughput is roughly independent of message size and scales near-linearly with concurrency. Limited by message rate.
- Zone C (largest messages): throughput is independent of concurrency and does not improve with more parallelism. Limited by bandwidth.
- Zone B (medium): transition zone — throughput depends on both concurrency and message size and is not directly limited by either rate or bandwidth.
DPML's design implications:
- If the original message lands in Zone B, partitioning (Phase 3 sees
n/l-byte chunks) shifts the per-leader transfer into Zone A and benefits fully from concurrency. Example: a 16 KB message split across 16 leaders becomes 1 KB per leader and benefits from increased parallelism. - The number of leaders per node is bounded by
ppn. In practice the chosenlis smaller thanppnto avoid creating extremely small per-leader chunks. - For very large messages the per-leader chunks may still be in Zone C — no further improvement from leader parallelism alone.
DPML-Pipelined:
- For very large messages, each leader further partitions its
post-Phase-2 buffer into
ksub-partitions (related inversely to message size and number of leaders), and dispatches them with non-blocking Allreduce calls followed by awaitallto overlap communication. - Cost equation:
T_comm_k = k * lg(h) * (a + (n*b)/(l*k) + (n*c)/(l*k)) = lg(h) * (a*k + (n*b)/l + (n*c)/l)
4.3 Taking Advantage of Advanced Network Offload: SHArP
- IB's relative throughput does not collapse for large messages (Fig 1(b)), so DPML-Pipelined is not expected to help on IB. Instead, for small messages on IB the paper integrates SHArP.
- SHArP supports only a small number of concurrent communicators, so using all DPML leaders for SHArP would be infeasible.
Two SHArP integration designs:
Node-level Leader (one leader per node):
- One leader per node creates the SHArP communicator, gathers from local processes, transmits to the switch, and broadcasts the result back. SHArP itself distributes computation across the IB switches.
- Drawback: in dual-socket nodes, gathering across sockets incurs a QPI inter-socket cost that dominates on multi-core machines.
Socket-level Leader (one leader per socket; ppn/2 processes per leader):
- One leader on each socket, communicating only with local-socket ranks; total number of processes participating in SHArP stays small (within SHArP's concurrency budget) but the QPI cost is avoided.
- Each leader uses its local closest HCA — beneficial on multi-HCA machines because each HCA is exercised concurrently.
5. Modeling the Cost of Allreduce Operations
5.1 Cost Model (extends Rabenseifner [25])
Treats shared-memory and inter-node costs separately under a full-duplex assumption.
Notations (Table 1):
| Symbol | Description |
|---|---|
| p | Number of MPI processes |
| h | Number of nodes |
| l | Number of leader processes per node |
| n | Size of input vector in bytes |
| a | Startup time per inter-node message |
| b | Transfer time per byte for inter-node messages |
| a' | Startup time for shared-memory copy |
| b' | Transfer time per byte for shared-memory copy |
| c | Computation cost per byte of one reduction op |
| k | Number of sub-partitions used in DPML-Pipelined |
Pure recursive doubling (Eq. 1):
T_{r.d} = ceil(lg p) * (a + n*b + n*c)
5.2 Phases of the Algorithm (Eqs. 2-7)
- Copy to local leaders:
T_copy = l * (a' + b' * (n/l)) - Intra-node reduction by leaders:
T_comp = ((p / (h*l)) - 1) * n * c - Inter-node Allreduce by leaders (recursive doubling inner step):
T_comm = ceil(lg h) * (a + (n*b)/l + (n*c)/l) - Local copy back:
T_bcast = l * (a' + b' * (n/l))
Total (Eq. 7):
T_allreduce = 2 * l * (a' + b' * (n/l)) + ((p / (h*l)) - 1) * n * c + ceil(lg h) * (a + (n*b)/l + (n*c)/l)
DPML-Pipelined (Eq. 5):
T_comm_k = lg(h) * (a*k + (n*b)/l + (n*c)/l)
5.3 Discussion
- Because
a' << aandb' << b, the operation is dominated by Phase 2 (intra-node compute) and Phase 3 (inter-node comm). - Compared with pure recursive doubling, DPML cuts the number of
communication steps from
lg ptolg h— a major savings on many-core systems where p >> h. - The size of each inter-node message is reduced by a factor of
l. - Combined effect: increasing
lis expected to lower the latency for medium and large messages wheren >> 1(compute and comm both scale withn).
6. Performance Evaluation
6.1 Experimental Setup
- MPI implementation: MVAPICH2 codebase modified to add DPML.
- Baselines: MVAPICH2-2.2 ("MVAPICH2") and Intel MPI 2017.1.132 ("Intel MPI"), both selecting algorithms internally based on message size, system size, and CPU/interconnect.
- Mode: full subscription (all cores busy) unless noted.
- Iterations: at least 1,000 microbenchmark iterations; 5 application runs averaged.
Cluster summary (also visualized in Figure 3 — KNL + IB excluded due to unavailability of large clusters with that combination):
| Cluster | Processor | Nodes | Cores/node | RAM | Interconnect |
|---|---|---|---|---|---|
| A — Xeon + IB w/ SHArP | Dual Haswell, 2.40 GHz | 40 | 28 (1,120 total) | 128 GB | Mellanox MT4115 EDR ConnectX-4 (100 Gbps), PCIe Gen3, CentOS 7.2.1511, kernel 3.10.0.2827.10.1.el7, OFED 3.4-2 |
| B — Xeon + IB w/o SHArP | Dell PowerEdge C6320, Dual Broadwell E5-2680 v4, 2.40 GHz | 648 | 28 | 128 GB | Mellanox MT4115 EDR ConnectX-4 (100 Gbps), PCIe Gen3 |
| C — Xeon + Omni-Path | Dual Haswell, 2.30 GHz | 752 | 28 | 128 GB DDR4 | Intel Omni-Path 100 Series (100 Gbps), PCIe Gen3 |
| D — KNL + Omni-Path | Intel Xeon Phi 7250 (68 cores, 4 HW threads/core, 1.4 GHz, 9 racks) | 508 | up to 64 used (no oversubscription) | 96 GB DDR4 + 16 GB MCDRAM cache mode | OPA fat-tree, 8 core-switches, 320 leaf switches, 5/4 oversubscription; 112 GB local SSD; CentOS 7 |
6.2 Impact of Number of Leaders on Performance
Tested with l ∈ {1, 2, 4, 8, 16} on Xeon+IB (Fig 4,
Cluster A 448 procs/16 nodes/28 ppn), Xeon+IB (Fig 5, Cluster B 1,792
procs/64 nodes/28 ppn), Xeon+OPA (Fig 6, Cluster C 1,792/64/28), KNL+OPA
(Fig 7, Cluster D 1,024/32/32). Op = MPI_SUM, datatype = MPI_FLOAT.
- Small messages (<1 KB): more leaders does not improve performance and may slightly degrade it — compute parallelism gives no benefit because per-leader compute is already tiny.
- Medium and large messages: more leaders systematically reduce latency because (a) compute is shared across cores and (b) concurrent transfers exploit the interconnect.
- Headline numbers:
- Cluster B (Xeon+IB), 512 KB Allreduce: 16 leaders are 4.9x lower latency than 1 leader.
- Cluster C (Xeon+OPA), 512 KB Allreduce: 16 leaders are 4.3x lower latency than 1 leader.
- Best leader count for 8 KB: 4 on Clusters A & B (Xeon+IB); 16 on Clusters C & D (Xeon/KNL + OPA).
- Optimal
ldepends jointly on message size, system size, and the CPU/interconnect combination — motivating dynamic algorithm selection.
6.3 Impact of SHArP on Communication Performance
(Cluster A, 16 nodes, Figure 8; node-leader vs socket-leader designs.)
- 1 process per node (Fig 8(a)): SHArP-based reduces are up to 2.5x faster than the default host-based scheme; benefit shrinks once message size grows past ~2 KB and disappears at 4 KB. This confirms SHArP is effective primarily for small messages.
- 4 processes per node (Fig 8(b)): node-leader and socket-leader SHArP designs are up to 80% and 100% faster (1.8x and 2x) than the default host-based design.
- 28 ppn / full subscription (Fig 8(c)): socket-leader provides up to 73% improvement over default host-based; node-leader provides up to 46%. Socket-leader wins more strongly at high process count because it avoids the QPI cross-socket cost.
- Summary: node-leader SHArP is preferable when ppn is small; socket-leader SHArP wins as ppn grows and inter-socket cost matters.
6.4 Comparison with State-of-the-art MPI Libraries
(Figure 9, optimal l chosen empirically per cluster;
Intel MPI not available on Clusters A and B.)
- Cluster A (448 procs / 16 nodes): DPML up to 3.59x faster than default MVAPICH2.
- Cluster B (1,792 / 64): DPML up to 3.08x faster than default MVAPICH2.
- Cluster C (1,792 / 64): DPML up to 2.98x faster than Intel MPI and 1.4x faster than MVAPICH2.
- Cluster D (2,048 / 32 ppn = 64): DPML up to 2.3x faster than Intel MPI and 3.31x faster than MVAPICH2.
- Large-scale (Figure 10, Cluster D, 10,240 processes / 160 nodes): DPML outperforms MVAPICH2 by 207% and Intel MPI by 48%, demonstrating scalability.
6.5 Performance of HPCG
- HPCG calls MPI_Allreduce with small messages (in
DDOT). SHArP-friendly regime. - Cluster A, 28 ppn, with 56-448 processes (Figure 11(a)):
- Node-leader and socket-leader SHArP designs improve average DDOT timing by up to 35% at 56 processes and up to 10% at 224 processes.
- Improvement decreases as scale grows because the
countargument to Allreduce is unchanged with process count, so the percent of total time spent in Allreduce shrinks. (HPCG is weak-scaled.)
- Only Mellanox-based systems (Cluster A) support SHArP, so HPCG measurements are restricted to Cluster A.
6.6 Performance of miniAMR
- miniAMR is a 3D stencil with Adaptive Mesh Refinement; many
large-message Allreduce calls; the
Overall Mesh Refinement time(averaged across processes, with mesh-refinement frequency = 1,000 making it >98% of application time) is the metric. - Cluster C, 28 ppn (Figure 11(b)): DPML provides up to 40% gain over MVAPICH2 and up to 20% over Intel MPI at 1,792 processes.
- Cluster D, 32 ppn (Figure 11(c)): DPML provides up to 60% gain over MVAPICH2 and up to 20% over Intel MPI at up to 2,048 processes.
- Authors note technical issues prevented a complete miniAMR data set on Clusters A and B.
7. Related Work
Three categories surveyed:
Modeling and redesigning collective algorithms:
- Rabenseifner [25] modeled and proposed a new Reduce/Allreduce algorithm; the paper extends his shared-memory-vs-network differentiation.
- Pjesivac-Grbovic et al. [23] adapted point-to-point models (Hockney, LogP/LogGP, PLogP) to collectives in MPICH and introduced a tree-based split-binary broadcast.
- Thakur et al. [26] presented MPICH algorithm-selection heuristics by message size and process count.
- This work extends those models with shared-memory cost and proposes a new data-partitioning multi-leader algorithm designed for SHArP.
Hardware offloading mechanisms:
- Bloch et al. [8] described SHArP technology.
- Kandalla et al. [13] used Mellanox Core-Direct for non-blocking collective offload.
Shared-memory-based collectives:
- Li et al. [15, 16] optimized collectives for NUMA, including a performance model for shared-memory-based collectives.
- Zhang et al. [27, 28] focused on SR-IOV-enabled IB clusters and virtualized environments, designing intra-VM Shmem-based MPI for efficient intra-node communication on virtualized HPC.
8. Conclusion and Future Work
- MPI_Allreduce is critical for scientific applications; existing hierarchical designs cannot fully exploit modern multi-/many-core architectures or modern interconnect features (high message rate, high concurrent throughput, in-network reduction).
- The paper proposed, modeled, and analyzed a multi-/many-core-aware DPML design for MPI_Allreduce, with SHArP integration at node and socket granularity.
- Evaluations on three architectures (Xeon+IB, Xeon+OPA, KNL+OPA) showed microbenchmark gains up to 3.5x and application-level gains up to 35% (HPCG) and 60% (miniAMR).
- Future work: explore DPML for other blocking and non-blocking collectives, and integrate SHArP with non-blocking collectives.
Appendix A — Named Methods, Algorithms, and Benchmarks
Named designs introduced:
- DPML — Data Partitioning-based Multi-Leader Allreduce.
- DPML-Pipelined — DPML + per-leader sub-partition non-blocking Allreduce.
- Node-level Leader SHArP design.
- Socket-level Leader SHArP design.
Existing/baseline algorithms used:
- Recursive Doubling (used as inner-step Allreduce within DPML Phase 3 and as an analytical baseline).
- Reduce-Scatter + Allgather (alternative inner-step algorithm).
- Single-leader hierarchical Allreduce (MVAPICH2 default baseline).
Hardware-offload primitives:
- SHArP (Scalable Hierarchical Aggregation Protocol).
- Core-Direct (referenced via prior work).
Microbenchmarks and applications:
osu_mbw_mr(concurrent-pair throughput, OSU Microbenchmarks Suite).osu_allreduce(Allreduce latency).- HPCG (
DDOTaveraged execution time). - miniAMR (
Overall Mesh Refinement time, mesh refinement frequency 1,000).
Appendix B — Quantitative Results Compendium
| Regime | Result |
|---|---|
| 512 KB Allreduce, Cluster B (Xeon+IB) | 16 leaders 4.9x faster than 1 leader |
| 512 KB Allreduce, Cluster C (Xeon+OPA) | 16 leaders 4.3x faster than 1 leader |
Best l for 8 KB on IB clusters |
l = 4 |
Best l for 8 KB on OPA clusters |
l = 16 |
| 1 ppn, small messages, SHArP | 2.5x over host-based |
| 4 ppn, small messages, node/socket-leader SHArP | 1.8x / 2x over host-based |
| 28 ppn, small messages, socket-leader SHArP | 73% over host-based |
| 28 ppn, small messages, node-leader SHArP | 46% over host-based |
| Cluster A vs MVAPICH2 default | up to 3.59x |
| Cluster B vs MVAPICH2 default | up to 3.08x |
| Cluster C vs Intel MPI / MVAPICH2 | up to 2.98x / 1.4x |
| Cluster D vs Intel MPI / MVAPICH2 | up to 2.3x / 3.31x |
| 10,240 procs, Cluster D vs MVAPICH2 / Intel MPI | 207% / 48% |
| HPCG, 56 procs, Cluster A | 35% improvement |
| HPCG, 224 procs, Cluster A | 10% improvement |
| miniAMR, 1,792 procs, Cluster C | 40% over MVAPICH2 / 20% over Intel MPI |
| miniAMR, up to 2,048 procs, Cluster D | 60% over MVAPICH2 / 20% over Intel MPI |
Note on NCCL Tuning
DPML's central observation — that the optimal number of concurrent
leaders per node depends jointly on message size, process count, and the
interconnect's throughput-vs-concurrency curve — is the same trade-off
NCCL exposes via nChannels. The 3-zone characterization of
Omni-Path (Zone A rate-limited / Zone B transition / Zone C
bandwidth-limited) is a hardware-grounded explanation for why a fixed
nChannels is suboptimal across message sizes: small
messages live in a regime where extra channels buy throughput, but large
messages on a saturated link gain nothing from more channels. The
paper's measured 4.3-4.9x latency reduction at 512 KB when going from 1
to 16 leaders is direct evidence that medium/large collectives benefit
from aggressive parallelism, while the small-message result (more
leaders hurt) is direct evidence that latency-bound regimes prefer the
minimum channel count — a pattern any per-collective NCCL config
selector should respect.