Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design

Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. (DK) Panda | The Ohio State University | SC17, Denver, CO, Nov 2017 | DOI: 10.1145/3126908.3126954

Note: filed under 0050_MVAPICH2-GDR.pdf; the actual paper is the SC17 DPML Allreduce work, implemented inside the MVAPICH2 codebase from which MVAPICH2-GDR derives.

Problem

Existing MPI_Allreduce designs in MPICH2, Open-MPI, and MVAPICH2 use a hierarchical strategy with one (or two) leader process(es) per node that gathers local data via shared memory, performs the inter-node reduction, and shared-memory-broadcasts the result back. This wastes the parallelism of modern multi-/many-core CPUs (Intel KNL has 64-72 cores), forces a single core to do ppn-1 reductions, and only exercises one inter-node transfer at a time even when the interconnect (Mellanox EDR IB or Intel Omni-Path at 100 Gb/s) can sustain many concurrent transfers from one node. Allreduce is the largest single source of MPI time in production HPC (37% per a 5-year study cited from Rabenseifner) and is increasingly used on medium-and-large messages by deep learning workloads — so this gap directly caps both scientific and DL scaling.

Core Insight

A multi-leader design that partitions each message into l chunks and assigns one leader per chunk simultaneously (a) parallelizes intra-node reduction compute over l cores, (b) reduces each per-leader inter-node message size by a factor of l, and (c) drives l concurrent inter-node transfers — which together cut Allreduce communication steps from lg p to lg h and exploit the interconnect's per-pair-vs-concurrent-pair curve in whatever regime (rate-limited Zone A, transition Zone B, or bandwidth-limited Zone C) the per-leader chunk lands in.

Method

DPML runs in four phases (Figure 2 of the paper):

Local Copy to Shared Memory: every process splits its input into l partitions and copies them into specific SHM regions; equivalent to l parallel gathers.
Intra-node Reduction by Leaders: each Leader_j reduces ppn-1 buffers of size n/l in parallel, sharing compute over l cores.
Inter-node Allreduce by Leaders: each leader runs an inter-node Allreduce with the same-index leader on every other node, using the library's dynamically chosen inner algorithm (recursive doubling or reduce-scatter + allgather).
Local Copy to Individual Processes: SHM-broadcast the fully-reduced partitions back to all local ranks.

DPML-Pipelined further sub-partitions each leader's post-Phase-2 buffer into k chunks, dispatching them via non-blocking Allreduce + waitall. Designed for very-large messages on Omni-Path that would otherwise still sit in the bandwidth-limited Zone C.

SHArP integration for small messages on IB:

Node-level leader: one leader/node calls SHArP; SHArP distributes reduction across IB switches. Suffers QPI cost on dual-socket nodes.
Socket-level leader: one leader/socket (typically ppn/2 per leader) avoids QPI and uses each node's nearest HCA — beneficial on multi-HCA systems.

A theoretical cost model extends Rabenseifner's by separating shared-memory costs (a', b') from inter-node costs (a, b), yielding the total DPML cost T_allreduce = 2*l*(a' + b'*(n/l)) + ((p/(h*l)) - 1)*n*c + ceil(lg h)*(a + (n*b)/l + (n*c)/l), showing communication steps reduced from lg p to lg h and per-step message size reduced by a factor of l.

Experimental Setup

Component	Value
MPI implementation	DPML inside MVAPICH2 codebase
Baselines	MVAPICH2-2.2 ("MVAPICH2"), Intel MPI 2017.1.132 ("Intel MPI")
Mode	Full subscription
Iterations	1,000+ microbench, 5 application runs avg
Cluster A	40 nodes, Dual Haswell 14c @ 2.40 GHz, 128 GB, Mellanox EDR ConnectX-4 100 Gb/s, SHArP-capable, CentOS 7.2.1511, OFED 3.4-2
Cluster B	648 nodes, Dell C6320, Dual Broadwell E5-2680v4 14c @ 2.40 GHz, 128 GB, Mellanox EDR ConnectX-4 100 Gb/s
Cluster C	752 nodes, Dual Haswell 14c @ 2.30 GHz, 128 GB DDR4, Intel Omni-Path 100 Series 100 Gb/s
Cluster D	508 nodes, Intel Xeon Phi 7250 KNL (68c, 4 HWT, 1.4 GHz), 96 GB DDR4 + 16 GB MCDRAM cache mode, Omni-Path fat-tree (8 core / 320 leaf, 5/4 oversubscription), CentOS 7
Op / dtype	MPI_SUM / MPI_FLOAT
Microbenchmarks	`osu_mbw_mr`, `osu_allreduce`
Applications	HPCG (`DDOT` time), miniAMR (`Overall Mesh Refinement` time at refinement frequency = 1,000)
Leader sweep	l ∈ {1, 2, 4, 8, 16}
Largest scale	10,240 procs / 160 nodes on Cluster D

Headline Quantitative Results

Architectural microbenchmark (Section 3 / Figure 1):

Intra-node SHM (KNL): relative throughput tracks number of pairs across all message sizes — shallow hierarchies preferred.
IB EDR: concurrent transfers improve total throughput across all message sizes.
Omni-Path: concurrent transfers help small messages only; relative throughput collapses to ~1 for large messages — must partition.

Number-of-leaders sweep:

512 KB Allreduce, Cluster B (Xeon + IB): 16 leaders 4.9x faster than 1 leader.
512 KB Allreduce, Cluster C (Xeon + OPA): 16 leaders 4.3x faster than 1 leader.
Best l for 8 KB: 4 on IB clusters; 16 on OPA clusters.
For sub-1-KB messages, more leaders does not help and can slightly degrade performance.

SHArP designs (Cluster A, 16 nodes, Figure 8):

1 ppn: SHArP up to 2.5x over host-based; benefit fades by 4 KB.
4 ppn: node-leader and socket-leader SHArP up to 80% / 100% faster than host-based.
28 ppn: socket-leader 73% faster, node-leader 46% faster.

Comparison with state-of-the-art MPI libraries (Figure 9):

Cluster A: DPML up to 3.59x vs MVAPICH2.
Cluster B: up to 3.08x vs MVAPICH2.
Cluster C: up to 2.98x vs Intel MPI, 1.4x vs MVAPICH2.
Cluster D: up to 2.3x vs Intel MPI, 3.31x vs MVAPICH2.
10,240 procs / Cluster D: DPML beats MVAPICH2 by 207% and Intel MPI by 48%.

Application-level (Figure 11):

HPCG (Cluster A, weak-scaled, 56-448 procs): up to 35% DDOT improvement at 56 procs; 10% at 224 procs (decreases with scale because Allreduce count is fixed and its share of total time shrinks).
miniAMR Cluster C (1,792 procs, 28 ppn): up to 40% over MVAPICH2, 20% over Intel MPI.
miniAMR Cluster D (up to 2,048 procs, 32 ppn): up to 60% over MVAPICH2, 20% over Intel MPI.

Limitations

Allreduce only — DPML is not yet ported to other blocking or non-blocking collectives (Reduce, Bcast, Allgather, Ialltoall, etc.); flagged as future work.
KNL + InfiniBand combination excluded due to lack of large-cluster availability with that pairing.
SHArP measurements restricted to Mellanox-equipped Cluster A; HPCG results only reported for Cluster A for the same reason.
Optimal l chosen empirically per cluster and message size; the paper does not produce a closed-form selector or runtime auto-tuner.
DPML-Pipelined k is described in terms of inverse-scaling with message size and l but no programmatic rule for choosing k is given.
miniAMR results incomplete on Clusters A and B due to "technical issues."
All workloads are CPU-based; despite the filename, the paper does not evaluate GPU collectives, GPUDirect RDMA, or NCCL.

Open Problems

Apply DPML to additional blocking and non-blocking collectives — Reduce, Allgather, Alltoall, Iallreduce — and quantify benefit.
Integrate SHArP with non-blocking collectives.
Develop a runtime auto-selector that picks l (and k for DPML-Pipelined) from message size, system size, CPU/interconnect identity, and ppn rather than relying on offline empirical tuning.
Extend the cost model to handle the SHArP socket-leader case and multi-HCA NUMA topologies analytically.
Re-evaluate on the (excluded) KNL + InfiniBand combination and on newer architectures with GPU acceleration / GPUDirect RDMA.

Note on NCCL Tuning

DPML's measurement of a 4.3-4.9x latency reduction at 512 KB when going from 1 to 16 leaders, paired with the finding that more leaders degrade sub-1-KB Allreduce, is a direct empirical statement of the same trade-off that NCCL exposes via nChannels: latency-bound small messages prefer minimum channel count, bandwidth/throughput-bound large messages prefer maximum, and the optimum depends on the interconnect's throughput-vs-concurrency curve (the paper's Zone A / Zone B / Zone C characterization on Omni-Path). The result that the optimal l is 4 on IB but 16 on OPA at the same 8 KB message size is a concrete example of why per-collective channel selection must condition on the interconnect identity, not just the message size.