Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design — Detailed Summary

Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. (DK) Panda | The Ohio State University | SC17 (International Conference for High Performance Computing, Networking, Storage and Analysis), Denver, CO, Nov 12-17, 2017 | DOI: 10.1145/3126908.3126954

Note on filename: the PDF is filed under 0050_MVAPICH2-GDR.pdf but the actual paper is the SC17 DPML Allreduce work by Bayatpour et al. The implementation is done inside the MVAPICH2 codebase (the OSU MPI stack from which MVAPICH2-GDR derives), which likely explains the filename.

Per-section summary organized by paper headings. Each section provides paragraph-level bullets and preserves all quantitative results, equations, table notations, and named methods.

Abstract

Existing MPI_Allreduce designs do not exploit the vast parallelism of modern multi-/many-core processors (Intel Xeon, Xeon Phi/KNL) or the increases in message-rate and end-to-end throughput of modern interconnects (InfiniBand, Omni-Path).
The paper proposes a high-performance Data Partitioning-based Multi-Leader (DPML) design for MPI_Allreduce that exploits intra-node parallelism and the high inter-node throughput available from IB and Omni-Path simultaneously.
DPML is theoretically modeled to compare communication and computation cost against pure recursive doubling and a single-leader hierarchical baseline.
Microbenchmark results: up to 3.5x performance improvement on MPI_Allreduce on multiple HPC systems at scale.
Application-level results: up to 35% improvement for HPCG and 60% for miniAMR.

Keywords: MPI_Allreduce, Data Partitioning, Multi-Leader, SHArP, Collectives, MPI.

1. Introduction

Background and motivation:

Modern supercomputers deliver multi-petaflop performance and scale to tens of thousands of processors; growth has been driven by multi-/many-core architectures and commodity high-performance RDMA-enabled interconnects such as InfiniBand and Omni-Path.
MPI is the dominant programming model for HPC; the MPI Standard offers point-to-point, collective, and synchronization primitives, but MPI_Allreduce in particular accounts for the largest share of collective time in many production workloads. A 5-year study by Rabenseifner reported that 37% of the time in MPI routines was in MPI_Allreduce.
Allreduce is no longer a small-message-only operation. While traditional scientific codes call it on small vectors, newer fields including deep learning use medium-to-large message reductions extensively.

Limitations of existing designs:

State-of-the-art MPI implementations follow a hierarchical strategy with one or two leader processes per node aggregating data from local processes via shared memory and then communicating across nodes; this approach underuses the increased core count of modern CPUs (Intel KNL has 64-72 cores).
Modern interconnects like Omni-Path expose very high message rates for small/medium messages and can sustain multiple concurrent transfers from one node, but as message size grows the per-pair throughput stops scaling — naively adding more concurrent senders does not help.
Therefore: a designer needs to (a) parallelize compute across many cores for medium/large reductions and (b) carefully match the number of concurrent inter-node transfers to the message-size regime of the interconnect.

Hardware offload:

Vendors (Mellanox) have introduced hardware-offload primitives — Core-direct and SHArP — that move communication and reduction compute to the network fabric. SHArP performs reduction inside switches as data ascends a reduction tree. These features are most effective on small messages, but must be combined carefully with intra-node schemes.

Driving research question (verbatim from paper):

Can we design novel communication algorithms for Allreduce primitive that takes advantage of the vast amount of parallelism available in modern multi-/many-core architectures as well as high throughput and high-end features exposed by modern interconnects like InfiniBand and Omni-Path to deliver best performance and scalability?

Concrete design challenges raised:

Effectively use multiple leaders per node to load-balance compute and accelerate communication.
Use high-throughput interconnect characteristics together with multiple leaders for large messages.
Exploit emerging hardware (KNL, Omni-Path) for new collective protocols.
Accelerate Allreduce with SHArP.
Design hybrid schemes that pick the optimal algorithm for each message size.
Demonstrate microbenchmark and application-level benefit.

1.1 Contributions

Design and implement multi-/many-core-aware DPML designs for MPI_Allreduce.
Design multi-leader collectives that take advantage of high throughput and network offload features (SHArP) of modern interconnects.
Theoretical modeling and analysis of the DPML framework.
Evaluation on three HPC architectures: Xeon + InfiniBand, Xeon + Omni-Path, Xeon Phi (KNL) + Omni-Path.
Demonstrate >=3.5x microbenchmark Allreduce latency improvement; up to 80% small-message improvement using SHArP socket-leader design; application-level gains up to 60% for miniAMR and 35% for HPCG.
Claimed to be the first paper to analyze a data-partitioning multi-leader design for MPI collectives in conjunction with modern high-end interconnect features.

2. Background

2.1 Reduction Collectives in MPI

Reduction collectives MPI_Reduce and MPI_Allreduce combine data across a group using ops like sum, max, or a user-defined op; widely used in HPC codes (miniAMR, HPCG) and increasingly in DL.
MPI_Reduce returns the result to one root process; MPI_Allreduce returns the result to all processes — which makes Allreduce more communication-intensive because it requires more communication steps.
State-of-the-art implementations (MPICH2, Open-MPI, MVAPICH2) all use shared-memory aware hierarchical algorithms. MVAPICH2's flow: (i) processes within a node form a "shared memory communicator"; (ii) one process per node is selected as the leader and joins a "leader-communicator"; (iii) intra-node SHM-based reduction accumulates local data at the leader; (iv) leaders perform inter-node point-to-point reduction; (v) leaders SHM-broadcast the final result back to local ranks.

2.2 Overview of SHArP

SHArP (Scalable Hierarchical Aggregation Protocol) offloads reduction to the IB switch fabric. Reduction occurs as data climbs the reduction tree; data volume contracts as it ascends rather than the CPU repeatedly seeing intermediate results.
A SHArP "reduction tree" is a set of network elements organized as hierarchical data objects describing data topologies and collective groups; leaves are data sources, interior vertices are aggregation nodes.
The technology lets data be manipulated while in flight in the network, rather than waiting for it to reach a CPU.

3. Communication Characteristics of Modern Architectures

The authors first measure raw point-to-point characteristics of intra-node shared-memory and inter-node IB/OPA fabrics using osu_mbw_mr from the OSU Microbenchmark Suite. The benchmark records throughput of N concurrent pairs relative to one pair.

Intra-node (shared memory on KNL — Figure 1(a)):

Relative throughput stays close to the number of communicating pairs even at large message sizes — shared memory supports very high concurrency.
Implication: shallow hierarchies with many children per parent are preferable to deep hierarchies for intra-node aggregation.

Inter-node InfiniBand (Figure 1(b), Mellanox EDR 100Gb/s):

Multiple processes per node performing concurrent transfers improve total per-node throughput across all message sizes.
Implication: increasing parallelism (more leaders) is beneficial across the entire message-size spectrum on IB.

Inter-node Omni-Path (Figure 1(c)/(d), 100Gb/s):

Concurrent communication helps on small messages but the relative throughput collapses to ~1 for large messages — naive parallelism does not help, the link is bandwidth-limited.
Implication: large messages on OPA must be partitioned into smaller chunks that individually fall within the message-rate-limited zone, and those smaller chunks dispatched by multiple processes.

Reasoning about existing algorithms:

Recursive Doubling is flat; doubles the distance between communicating pairs each step; optimal in compute-cost terms because the computation decreases each round, but relies on point-to-point ops with extra copies.
Hierarchical (single-leader): workers SHM-copy into a leader process; the leader performs ppn-1 reductions (computationally expensive on KNL), and only one process per node communicates — leaving inter-node concurrent throughput on the table.

4. Proposed Designs

4.1 Data Partitioning-based Multi-Leader Allreduce (DPML)

DPML designates l leader processes per node which (a) share computation and (b) drive l concurrent inter-node transfers. The collective runs in four phases (Figure 2):

Phase 1 — Local Copy to Shared Memory:

Every process splits its input vector into l partitions (one per leader) and copies each partition into a shared-memory region; the j-th partition from process with local rank i lands at start_addr(Leader_j) + i * sizeof(partition).
Equivalent to l independent gathers running in parallel.

Phase 2 — Intra-node Reduction by Leaders:

Each Leader_j reduces the j-th partitions from all ppn local processes in parallel; total compute load is now spread over l cores instead of 1.
Leader_j is responsible for ppn-1 reductions on a n/l-byte buffer.

Phase 3 — Inter-node Allreduce by Leaders:

Each leader independently runs an inter-node Allreduce with the same-index leader on every other node (e.g., Leader_0 on each node forms one communicator, Leader_1 forms another). The Allreduce algorithm used for this inner step is dynamically chosen by the MPI library based on message size, system size, and architecture (e.g., recursive-doubling or reduce-scatter + allgather).

Phase 4 — Local Copy to Individual Processes:

Each leader holds the fully-reduced n/l-byte partition; this is SHM-copied back to the receive buffer of every local rank — equivalent to l parallel broadcasts.

4.2 Taking Advantage of High Message Rate: Pipelined Data Transfer

The Omni-Path message-size-vs-throughput curve has three zones:

Zone A (smallest messages): throughput is roughly independent of message size and scales near-linearly with concurrency. Limited by message rate.
Zone C (largest messages): throughput is independent of concurrency and does not improve with more parallelism. Limited by bandwidth.
Zone B (medium): transition zone — throughput depends on both concurrency and message size and is not directly limited by either rate or bandwidth.

DPML's design implications:

If the original message lands in Zone B, partitioning (Phase 3 sees n/l-byte chunks) shifts the per-leader transfer into Zone A and benefits fully from concurrency. Example: a 16 KB message split across 16 leaders becomes 1 KB per leader and benefits from increased parallelism.
The number of leaders per node is bounded by ppn. In practice the chosen l is smaller than ppn to avoid creating extremely small per-leader chunks.
For very large messages the per-leader chunks may still be in Zone C — no further improvement from leader parallelism alone.

DPML-Pipelined:

For very large messages, each leader further partitions its post-Phase-2 buffer into k sub-partitions (related inversely to message size and number of leaders), and dispatches them with non-blocking Allreduce calls followed by a waitall to overlap communication.
Cost equation: T_comm_k = k * lg(h) * (a + (n*b)/(l*k) + (n*c)/(l*k)) = lg(h) * (a*k + (n*b)/l + (n*c)/l)

4.3 Taking Advantage of Advanced Network Offload: SHArP

IB's relative throughput does not collapse for large messages (Fig 1(b)), so DPML-Pipelined is not expected to help on IB. Instead, for small messages on IB the paper integrates SHArP.
SHArP supports only a small number of concurrent communicators, so using all DPML leaders for SHArP would be infeasible.

Two SHArP integration designs:

Node-level Leader (one leader per node):
- One leader per node creates the SHArP communicator, gathers from local processes, transmits to the switch, and broadcasts the result back. SHArP itself distributes computation across the IB switches.
- Drawback: in dual-socket nodes, gathering across sockets incurs a QPI inter-socket cost that dominates on multi-core machines.
Socket-level Leader (one leader per socket; ppn/2 processes per leader):
- One leader on each socket, communicating only with local-socket ranks; total number of processes participating in SHArP stays small (within SHArP's concurrency budget) but the QPI cost is avoided.
- Each leader uses its local closest HCA — beneficial on multi-HCA machines because each HCA is exercised concurrently.

5. Modeling the Cost of Allreduce Operations

5.1 Cost Model (extends Rabenseifner [25])

Treats shared-memory and inter-node costs separately under a full-duplex assumption.

Notations (Table 1):

Symbol	Description
p	Number of MPI processes
h	Number of nodes
l	Number of leader processes per node
n	Size of input vector in bytes
a	Startup time per inter-node message
b	Transfer time per byte for inter-node messages
a'	Startup time for shared-memory copy
b'	Transfer time per byte for shared-memory copy
c	Computation cost per byte of one reduction op
k	Number of sub-partitions used in DPML-Pipelined

Pure recursive doubling (Eq. 1): T_{r.d} = ceil(lg p) * (a + n*b + n*c)

5.2 Phases of the Algorithm (Eqs. 2-7)

Copy to local leaders: T_copy = l * (a' + b' * (n/l))
Intra-node reduction by leaders: T_comp = ((p / (h*l)) - 1) * n * c
Inter-node Allreduce by leaders (recursive doubling inner step): T_comm = ceil(lg h) * (a + (n*b)/l + (n*c)/l)
Local copy back: T_bcast = l * (a' + b' * (n/l))

Total (Eq. 7): T_allreduce = 2 * l * (a' + b' * (n/l)) + ((p / (h*l)) - 1) * n * c + ceil(lg h) * (a + (n*b)/l + (n*c)/l)

DPML-Pipelined (Eq. 5): T_comm_k = lg(h) * (a*k + (n*b)/l + (n*c)/l)

5.3 Discussion

Because a' << a and b' << b, the operation is dominated by Phase 2 (intra-node compute) and Phase 3 (inter-node comm).
Compared with pure recursive doubling, DPML cuts the number of communication steps from lg p to lg h — a major savings on many-core systems where p >> h.
The size of each inter-node message is reduced by a factor of l.
Combined effect: increasing l is expected to lower the latency for medium and large messages where n >> 1 (compute and comm both scale with n).

6. Performance Evaluation

6.1 Experimental Setup

MPI implementation: MVAPICH2 codebase modified to add DPML.
Baselines: MVAPICH2-2.2 ("MVAPICH2") and Intel MPI 2017.1.132 ("Intel MPI"), both selecting algorithms internally based on message size, system size, and CPU/interconnect.
Mode: full subscription (all cores busy) unless noted.
Iterations: at least 1,000 microbenchmark iterations; 5 application runs averaged.

Cluster summary (also visualized in Figure 3 — KNL + IB excluded due to unavailability of large clusters with that combination):

Cluster	Processor	Nodes	Cores/node	RAM	Interconnect
A — Xeon + IB w/ SHArP	Dual Haswell, 2.40 GHz	40	28 (1,120 total)	128 GB	Mellanox MT4115 EDR ConnectX-4 (100 Gbps), PCIe Gen3, CentOS 7.2.1511, kernel 3.10.0.2827.10.1.el7, OFED 3.4-2
B — Xeon + IB w/o SHArP	Dell PowerEdge C6320, Dual Broadwell E5-2680 v4, 2.40 GHz	648	28	128 GB	Mellanox MT4115 EDR ConnectX-4 (100 Gbps), PCIe Gen3
C — Xeon + Omni-Path	Dual Haswell, 2.30 GHz	752	28	128 GB DDR4	Intel Omni-Path 100 Series (100 Gbps), PCIe Gen3
D — KNL + Omni-Path	Intel Xeon Phi 7250 (68 cores, 4 HW threads/core, 1.4 GHz, 9 racks)	508	up to 64 used (no oversubscription)	96 GB DDR4 + 16 GB MCDRAM cache mode	OPA fat-tree, 8 core-switches, 320 leaf switches, 5/4 oversubscription; 112 GB local SSD; CentOS 7

6.2 Impact of Number of Leaders on Performance

Tested with l ∈ {1, 2, 4, 8, 16} on Xeon+IB (Fig 4, Cluster A 448 procs/16 nodes/28 ppn), Xeon+IB (Fig 5, Cluster B 1,792 procs/64 nodes/28 ppn), Xeon+OPA (Fig 6, Cluster C 1,792/64/28), KNL+OPA (Fig 7, Cluster D 1,024/32/32). Op = MPI_SUM, datatype = MPI_FLOAT.

Small messages (<1 KB): more leaders does not improve performance and may slightly degrade it — compute parallelism gives no benefit because per-leader compute is already tiny.
Medium and large messages: more leaders systematically reduce latency because (a) compute is shared across cores and (b) concurrent transfers exploit the interconnect.
Headline numbers:
- Cluster B (Xeon+IB), 512 KB Allreduce: 16 leaders are 4.9x lower latency than 1 leader.
- Cluster C (Xeon+OPA), 512 KB Allreduce: 16 leaders are 4.3x lower latency than 1 leader.
- Best leader count for 8 KB: 4 on Clusters A & B (Xeon+IB); 16 on Clusters C & D (Xeon/KNL + OPA).
Optimal l depends jointly on message size, system size, and the CPU/interconnect combination — motivating dynamic algorithm selection.

6.3 Impact of SHArP on Communication Performance

(Cluster A, 16 nodes, Figure 8; node-leader vs socket-leader designs.)

1 process per node (Fig 8(a)): SHArP-based reduces are up to 2.5x faster than the default host-based scheme; benefit shrinks once message size grows past ~2 KB and disappears at 4 KB. This confirms SHArP is effective primarily for small messages.
4 processes per node (Fig 8(b)): node-leader and socket-leader SHArP designs are up to 80% and 100% faster (1.8x and 2x) than the default host-based design.
28 ppn / full subscription (Fig 8(c)): socket-leader provides up to 73% improvement over default host-based; node-leader provides up to 46%. Socket-leader wins more strongly at high process count because it avoids the QPI cross-socket cost.
Summary: node-leader SHArP is preferable when ppn is small; socket-leader SHArP wins as ppn grows and inter-socket cost matters.

6.4 Comparison with State-of-the-art MPI Libraries

(Figure 9, optimal l chosen empirically per cluster; Intel MPI not available on Clusters A and B.)

Cluster A (448 procs / 16 nodes): DPML up to 3.59x faster than default MVAPICH2.
Cluster B (1,792 / 64): DPML up to 3.08x faster than default MVAPICH2.
Cluster C (1,792 / 64): DPML up to 2.98x faster than Intel MPI and 1.4x faster than MVAPICH2.
Cluster D (2,048 / 32 ppn = 64): DPML up to 2.3x faster than Intel MPI and 3.31x faster than MVAPICH2.
Large-scale (Figure 10, Cluster D, 10,240 processes / 160 nodes): DPML outperforms MVAPICH2 by 207% and Intel MPI by 48%, demonstrating scalability.

6.5 Performance of HPCG

HPCG calls MPI_Allreduce with small messages (in DDOT). SHArP-friendly regime.
Cluster A, 28 ppn, with 56-448 processes (Figure 11(a)):
- Node-leader and socket-leader SHArP designs improve average DDOT timing by up to 35% at 56 processes and up to 10% at 224 processes.
- Improvement decreases as scale grows because the count argument to Allreduce is unchanged with process count, so the percent of total time spent in Allreduce shrinks. (HPCG is weak-scaled.)
Only Mellanox-based systems (Cluster A) support SHArP, so HPCG measurements are restricted to Cluster A.

6.6 Performance of miniAMR

miniAMR is a 3D stencil with Adaptive Mesh Refinement; many large-message Allreduce calls; the Overall Mesh Refinement time (averaged across processes, with mesh-refinement frequency = 1,000 making it >98% of application time) is the metric.
Cluster C, 28 ppn (Figure 11(b)): DPML provides up to 40% gain over MVAPICH2 and up to 20% over Intel MPI at 1,792 processes.
Cluster D, 32 ppn (Figure 11(c)): DPML provides up to 60% gain over MVAPICH2 and up to 20% over Intel MPI at up to 2,048 processes.
Authors note technical issues prevented a complete miniAMR data set on Clusters A and B.

Three categories surveyed:

Modeling and redesigning collective algorithms:

Rabenseifner [25] modeled and proposed a new Reduce/Allreduce algorithm; the paper extends his shared-memory-vs-network differentiation.
Pjesivac-Grbovic et al. [23] adapted point-to-point models (Hockney, LogP/LogGP, PLogP) to collectives in MPICH and introduced a tree-based split-binary broadcast.
Thakur et al. [26] presented MPICH algorithm-selection heuristics by message size and process count.
This work extends those models with shared-memory cost and proposes a new data-partitioning multi-leader algorithm designed for SHArP.

Hardware offloading mechanisms:

Bloch et al. [8] described SHArP technology.
Kandalla et al. [13] used Mellanox Core-Direct for non-blocking collective offload.

Shared-memory-based collectives:

Li et al. [15, 16] optimized collectives for NUMA, including a performance model for shared-memory-based collectives.
Zhang et al. [27, 28] focused on SR-IOV-enabled IB clusters and virtualized environments, designing intra-VM Shmem-based MPI for efficient intra-node communication on virtualized HPC.

8. Conclusion and Future Work

MPI_Allreduce is critical for scientific applications; existing hierarchical designs cannot fully exploit modern multi-/many-core architectures or modern interconnect features (high message rate, high concurrent throughput, in-network reduction).
The paper proposed, modeled, and analyzed a multi-/many-core-aware DPML design for MPI_Allreduce, with SHArP integration at node and socket granularity.
Evaluations on three architectures (Xeon+IB, Xeon+OPA, KNL+OPA) showed microbenchmark gains up to 3.5x and application-level gains up to 35% (HPCG) and 60% (miniAMR).
Future work: explore DPML for other blocking and non-blocking collectives, and integrate SHArP with non-blocking collectives.

Appendix A — Named Methods, Algorithms, and Benchmarks

Named designs introduced:

DPML — Data Partitioning-based Multi-Leader Allreduce.
DPML-Pipelined — DPML + per-leader sub-partition non-blocking Allreduce.
Node-level Leader SHArP design.
Socket-level Leader SHArP design.

Existing/baseline algorithms used:

Recursive Doubling (used as inner-step Allreduce within DPML Phase 3 and as an analytical baseline).
Reduce-Scatter + Allgather (alternative inner-step algorithm).
Single-leader hierarchical Allreduce (MVAPICH2 default baseline).

Hardware-offload primitives:

SHArP (Scalable Hierarchical Aggregation Protocol).
Core-Direct (referenced via prior work).

Microbenchmarks and applications:

osu_mbw_mr (concurrent-pair throughput, OSU Microbenchmarks Suite).
osu_allreduce (Allreduce latency).
HPCG (DDOT averaged execution time).
miniAMR (Overall Mesh Refinement time, mesh refinement frequency 1,000).

Appendix B — Quantitative Results Compendium

Regime	Result
512 KB Allreduce, Cluster B (Xeon+IB)	16 leaders 4.9x faster than 1 leader
512 KB Allreduce, Cluster C (Xeon+OPA)	16 leaders 4.3x faster than 1 leader
Best `l` for 8 KB on IB clusters	l = 4
Best `l` for 8 KB on OPA clusters	l = 16
1 ppn, small messages, SHArP	2.5x over host-based
4 ppn, small messages, node/socket-leader SHArP	1.8x / 2x over host-based
28 ppn, small messages, socket-leader SHArP	73% over host-based
28 ppn, small messages, node-leader SHArP	46% over host-based
Cluster A vs MVAPICH2 default	up to 3.59x
Cluster B vs MVAPICH2 default	up to 3.08x
Cluster C vs Intel MPI / MVAPICH2	up to 2.98x / 1.4x
Cluster D vs Intel MPI / MVAPICH2	up to 2.3x / 3.31x
10,240 procs, Cluster D vs MVAPICH2 / Intel MPI	207% / 48%
HPCG, 56 procs, Cluster A	35% improvement
HPCG, 224 procs, Cluster A	10% improvement
miniAMR, 1,792 procs, Cluster C	40% over MVAPICH2 / 20% over Intel MPI
miniAMR, up to 2,048 procs, Cluster D	60% over MVAPICH2 / 20% over Intel MPI

Note on NCCL Tuning

DPML's central observation — that the optimal number of concurrent leaders per node depends jointly on message size, process count, and the interconnect's throughput-vs-concurrency curve — is the same trade-off NCCL exposes via nChannels. The 3-zone characterization of Omni-Path (Zone A rate-limited / Zone B transition / Zone C bandwidth-limited) is a hardware-grounded explanation for why a fixed nChannels is suboptimal across message sizes: small messages live in a regime where extra channels buy throughput, but large messages on a saturated link gain nothing from more channels. The paper's measured 4.3-4.9x latency reduction at 512 KB when going from 1 to 16 leaders is direct evidence that medium/large collectives benefit from aggressive parallelism, while the small-message result (more leaders hurt) is direct evidence that latency-bound regimes prefer the minimum channel count — a pattern any per-collective NCCL config selector should respect.