Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design

Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. (DK) Panda | The Ohio State University | SC17, Denver, CO, Nov 2017 | DOI: 10.1145/3126908.3126954

Note: filed under 0050_MVAPICH2-GDR.pdf; the actual paper is the SC17 DPML Allreduce work, implemented inside the MVAPICH2 codebase from which MVAPICH2-GDR derives.


Problem

Existing MPI_Allreduce designs in MPICH2, Open-MPI, and MVAPICH2 use a hierarchical strategy with one (or two) leader process(es) per node that gathers local data via shared memory, performs the inter-node reduction, and shared-memory-broadcasts the result back. This wastes the parallelism of modern multi-/many-core CPUs (Intel KNL has 64-72 cores), forces a single core to do ppn-1 reductions, and only exercises one inter-node transfer at a time even when the interconnect (Mellanox EDR IB or Intel Omni-Path at 100 Gb/s) can sustain many concurrent transfers from one node. Allreduce is the largest single source of MPI time in production HPC (37% per a 5-year study cited from Rabenseifner) and is increasingly used on medium-and-large messages by deep learning workloads — so this gap directly caps both scientific and DL scaling.


Core Insight

A multi-leader design that partitions each message into l chunks and assigns one leader per chunk simultaneously (a) parallelizes intra-node reduction compute over l cores, (b) reduces each per-leader inter-node message size by a factor of l, and (c) drives l concurrent inter-node transfers — which together cut Allreduce communication steps from lg p to lg h and exploit the interconnect's per-pair-vs-concurrent-pair curve in whatever regime (rate-limited Zone A, transition Zone B, or bandwidth-limited Zone C) the per-leader chunk lands in.


Method

DPML runs in four phases (Figure 2 of the paper):

  1. Local Copy to Shared Memory: every process splits its input into l partitions and copies them into specific SHM regions; equivalent to l parallel gathers.
  2. Intra-node Reduction by Leaders: each Leader_j reduces ppn-1 buffers of size n/l in parallel, sharing compute over l cores.
  3. Inter-node Allreduce by Leaders: each leader runs an inter-node Allreduce with the same-index leader on every other node, using the library's dynamically chosen inner algorithm (recursive doubling or reduce-scatter + allgather).
  4. Local Copy to Individual Processes: SHM-broadcast the fully-reduced partitions back to all local ranks.

DPML-Pipelined further sub-partitions each leader's post-Phase-2 buffer into k chunks, dispatching them via non-blocking Allreduce + waitall. Designed for very-large messages on Omni-Path that would otherwise still sit in the bandwidth-limited Zone C.

SHArP integration for small messages on IB:

A theoretical cost model extends Rabenseifner's by separating shared-memory costs (a', b') from inter-node costs (a, b), yielding the total DPML cost T_allreduce = 2*l*(a' + b'*(n/l)) + ((p/(h*l)) - 1)*n*c + ceil(lg h)*(a + (n*b)/l + (n*c)/l), showing communication steps reduced from lg p to lg h and per-step message size reduced by a factor of l.


Experimental Setup

Component Value
MPI implementation DPML inside MVAPICH2 codebase
Baselines MVAPICH2-2.2 ("MVAPICH2"), Intel MPI 2017.1.132 ("Intel MPI")
Mode Full subscription
Iterations 1,000+ microbench, 5 application runs avg
Cluster A 40 nodes, Dual Haswell 14c @ 2.40 GHz, 128 GB, Mellanox EDR ConnectX-4 100 Gb/s, SHArP-capable, CentOS 7.2.1511, OFED 3.4-2
Cluster B 648 nodes, Dell C6320, Dual Broadwell E5-2680v4 14c @ 2.40 GHz, 128 GB, Mellanox EDR ConnectX-4 100 Gb/s
Cluster C 752 nodes, Dual Haswell 14c @ 2.30 GHz, 128 GB DDR4, Intel Omni-Path 100 Series 100 Gb/s
Cluster D 508 nodes, Intel Xeon Phi 7250 KNL (68c, 4 HWT, 1.4 GHz), 96 GB DDR4 + 16 GB MCDRAM cache mode, Omni-Path fat-tree (8 core / 320 leaf, 5/4 oversubscription), CentOS 7
Op / dtype MPI_SUM / MPI_FLOAT
Microbenchmarks osu_mbw_mr, osu_allreduce
Applications HPCG (DDOT time), miniAMR (Overall Mesh Refinement time at refinement frequency = 1,000)
Leader sweep l ∈ {1, 2, 4, 8, 16}
Largest scale 10,240 procs / 160 nodes on Cluster D

Headline Quantitative Results

Architectural microbenchmark (Section 3 / Figure 1):

Number-of-leaders sweep:

SHArP designs (Cluster A, 16 nodes, Figure 8):

Comparison with state-of-the-art MPI libraries (Figure 9):

Application-level (Figure 11):


Limitations


Open Problems

  1. Apply DPML to additional blocking and non-blocking collectives — Reduce, Allgather, Alltoall, Iallreduce — and quantify benefit.
  2. Integrate SHArP with non-blocking collectives.
  3. Develop a runtime auto-selector that picks l (and k for DPML-Pipelined) from message size, system size, CPU/interconnect identity, and ppn rather than relying on offline empirical tuning.
  4. Extend the cost model to handle the SHArP socket-leader case and multi-HCA NUMA topologies analytically.
  5. Re-evaluate on the (excluded) KNL + InfiniBand combination and on newer architectures with GPU acceleration / GPUDirect RDMA.

Note on NCCL Tuning

DPML's measurement of a 4.3-4.9x latency reduction at 512 KB when going from 1 to 16 leaders, paired with the finding that more leaders degrade sub-1-KB Allreduce, is a direct empirical statement of the same trade-off that NCCL exposes via nChannels: latency-bound small messages prefer minimum channel count, bandwidth/throughput-bound large messages prefer maximum, and the optimum depends on the interconnect's throughput-vs-concurrency curve (the paper's Zone A / Zone B / Zone C characterization on Omni-Path). The result that the optimal l is 4 on IB but 16 on OPA at the same 8 KB message size is a concrete example of why per-collective channel selection must condition on the interconnect identity, not just the message size.