Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design
Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. (DK) Panda | The Ohio State University | SC17, Denver, CO, Nov 2017 | DOI: 10.1145/3126908.3126954
Note: filed under
0050_MVAPICH2-GDR.pdf; the actual paper is the SC17 DPML Allreduce work, implemented inside the MVAPICH2 codebase from which MVAPICH2-GDR derives.
Problem
Existing MPI_Allreduce designs in MPICH2, Open-MPI, and MVAPICH2 use
a hierarchical strategy with one (or two) leader process(es) per node
that gathers local data via shared memory, performs the inter-node
reduction, and shared-memory-broadcasts the result back. This wastes the
parallelism of modern multi-/many-core CPUs (Intel KNL has 64-72 cores),
forces a single core to do ppn-1 reductions, and only
exercises one inter-node transfer at a time even when the interconnect
(Mellanox EDR IB or Intel Omni-Path at 100 Gb/s) can sustain many
concurrent transfers from one node. Allreduce is the largest single
source of MPI time in production HPC (37% per a 5-year study cited from
Rabenseifner) and is increasingly used on medium-and-large messages by
deep learning workloads — so this gap directly caps both scientific and
DL scaling.
Core Insight
A multi-leader design that partitions each message into
l chunks and assigns one leader per chunk simultaneously
(a) parallelizes intra-node reduction compute over l cores,
(b) reduces each per-leader inter-node message size by a factor of
l, and (c) drives l concurrent inter-node
transfers — which together cut Allreduce communication steps from
lg p to lg h and exploit the interconnect's
per-pair-vs-concurrent-pair curve in whatever regime (rate-limited Zone
A, transition Zone B, or bandwidth-limited Zone C) the per-leader chunk
lands in.
Method
DPML runs in four phases (Figure 2 of the paper):
- Local Copy to Shared Memory: every process splits
its input into
lpartitions and copies them into specific SHM regions; equivalent tolparallel gathers. - Intra-node Reduction by Leaders: each
Leader_jreducesppn-1buffers of sizen/lin parallel, sharing compute overlcores. - Inter-node Allreduce by Leaders: each leader runs an inter-node Allreduce with the same-index leader on every other node, using the library's dynamically chosen inner algorithm (recursive doubling or reduce-scatter + allgather).
- Local Copy to Individual Processes: SHM-broadcast the fully-reduced partitions back to all local ranks.
DPML-Pipelined further sub-partitions each leader's
post-Phase-2 buffer into k chunks, dispatching them via
non-blocking Allreduce + waitall. Designed for very-large
messages on Omni-Path that would otherwise still sit in the
bandwidth-limited Zone C.
SHArP integration for small messages on IB:
- Node-level leader: one leader/node calls SHArP; SHArP distributes reduction across IB switches. Suffers QPI cost on dual-socket nodes.
- Socket-level leader: one leader/socket (typically
ppn/2per leader) avoids QPI and uses each node's nearest HCA — beneficial on multi-HCA systems.
A theoretical cost model extends Rabenseifner's by separating
shared-memory costs (a', b') from inter-node
costs (a, b), yielding the total DPML cost
T_allreduce = 2*l*(a' + b'*(n/l)) + ((p/(h*l)) - 1)*n*c + ceil(lg h)*(a + (n*b)/l + (n*c)/l),
showing communication steps reduced from lg p to
lg h and per-step message size reduced by a factor of
l.
Experimental Setup
| Component | Value |
|---|---|
| MPI implementation | DPML inside MVAPICH2 codebase |
| Baselines | MVAPICH2-2.2 ("MVAPICH2"), Intel MPI 2017.1.132 ("Intel MPI") |
| Mode | Full subscription |
| Iterations | 1,000+ microbench, 5 application runs avg |
| Cluster A | 40 nodes, Dual Haswell 14c @ 2.40 GHz, 128 GB, Mellanox EDR ConnectX-4 100 Gb/s, SHArP-capable, CentOS 7.2.1511, OFED 3.4-2 |
| Cluster B | 648 nodes, Dell C6320, Dual Broadwell E5-2680v4 14c @ 2.40 GHz, 128 GB, Mellanox EDR ConnectX-4 100 Gb/s |
| Cluster C | 752 nodes, Dual Haswell 14c @ 2.30 GHz, 128 GB DDR4, Intel Omni-Path 100 Series 100 Gb/s |
| Cluster D | 508 nodes, Intel Xeon Phi 7250 KNL (68c, 4 HWT, 1.4 GHz), 96 GB DDR4 + 16 GB MCDRAM cache mode, Omni-Path fat-tree (8 core / 320 leaf, 5/4 oversubscription), CentOS 7 |
| Op / dtype | MPI_SUM / MPI_FLOAT |
| Microbenchmarks | osu_mbw_mr, osu_allreduce |
| Applications | HPCG (DDOT time), miniAMR
(Overall Mesh Refinement time at refinement frequency =
1,000) |
| Leader sweep | l ∈ {1, 2, 4, 8, 16} |
| Largest scale | 10,240 procs / 160 nodes on Cluster D |
Headline Quantitative Results
Architectural microbenchmark (Section 3 / Figure 1):
- Intra-node SHM (KNL): relative throughput tracks number of pairs across all message sizes — shallow hierarchies preferred.
- IB EDR: concurrent transfers improve total throughput across all message sizes.
- Omni-Path: concurrent transfers help small messages only; relative throughput collapses to ~1 for large messages — must partition.
Number-of-leaders sweep:
- 512 KB Allreduce, Cluster B (Xeon + IB): 16 leaders 4.9x faster than 1 leader.
- 512 KB Allreduce, Cluster C (Xeon + OPA): 16 leaders 4.3x faster than 1 leader.
- Best
lfor 8 KB: 4 on IB clusters; 16 on OPA clusters. - For sub-1-KB messages, more leaders does not help and can slightly degrade performance.
SHArP designs (Cluster A, 16 nodes, Figure 8):
- 1 ppn: SHArP up to 2.5x over host-based; benefit fades by 4 KB.
- 4 ppn: node-leader and socket-leader SHArP up to 80% / 100% faster than host-based.
- 28 ppn: socket-leader 73% faster, node-leader 46% faster.
Comparison with state-of-the-art MPI libraries (Figure 9):
- Cluster A: DPML up to 3.59x vs MVAPICH2.
- Cluster B: up to 3.08x vs MVAPICH2.
- Cluster C: up to 2.98x vs Intel MPI, 1.4x vs MVAPICH2.
- Cluster D: up to 2.3x vs Intel MPI, 3.31x vs MVAPICH2.
- 10,240 procs / Cluster D: DPML beats MVAPICH2 by 207% and Intel MPI by 48%.
Application-level (Figure 11):
- HPCG (Cluster A, weak-scaled, 56-448 procs): up to
35% DDOT improvement at 56 procs; 10%
at 224 procs (decreases with scale because Allreduce
countis fixed and its share of total time shrinks). - miniAMR Cluster C (1,792 procs, 28 ppn): up to 40% over MVAPICH2, 20% over Intel MPI.
- miniAMR Cluster D (up to 2,048 procs, 32 ppn): up to 60% over MVAPICH2, 20% over Intel MPI.
Limitations
- Allreduce only — DPML is not yet ported to other blocking or non-blocking collectives (Reduce, Bcast, Allgather, Ialltoall, etc.); flagged as future work.
- KNL + InfiniBand combination excluded due to lack of large-cluster availability with that pairing.
- SHArP measurements restricted to Mellanox-equipped Cluster A; HPCG results only reported for Cluster A for the same reason.
- Optimal
lchosen empirically per cluster and message size; the paper does not produce a closed-form selector or runtime auto-tuner. - DPML-Pipelined
kis described in terms of inverse-scaling with message size andlbut no programmatic rule for choosingkis given. - miniAMR results incomplete on Clusters A and B due to "technical issues."
- All workloads are CPU-based; despite the filename, the paper does not evaluate GPU collectives, GPUDirect RDMA, or NCCL.
Open Problems
- Apply DPML to additional blocking and non-blocking collectives — Reduce, Allgather, Alltoall, Iallreduce — and quantify benefit.
- Integrate SHArP with non-blocking collectives.
- Develop a runtime auto-selector that picks
l(andkfor DPML-Pipelined) from message size, system size, CPU/interconnect identity, andppnrather than relying on offline empirical tuning. - Extend the cost model to handle the SHArP socket-leader case and multi-HCA NUMA topologies analytically.
- Re-evaluate on the (excluded) KNL + InfiniBand combination and on newer architectures with GPU acceleration / GPUDirect RDMA.
Note on NCCL Tuning
DPML's measurement of a 4.3-4.9x latency reduction at 512 KB when
going from 1 to 16 leaders, paired with the finding that more leaders
degrade sub-1-KB Allreduce, is a direct empirical statement of
the same trade-off that NCCL exposes via nChannels:
latency-bound small messages prefer minimum channel count,
bandwidth/throughput-bound large messages prefer maximum, and the
optimum depends on the interconnect's throughput-vs-concurrency curve
(the paper's Zone A / Zone B / Zone C characterization on Omni-Path).
The result that the optimal l is 4 on IB but 16 on OPA at
the same 8 KB message size is a concrete example of why per-collective
channel selection must condition on the interconnect identity, not just
the message size.