Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design — Detailed Summary

Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. (DK) Panda | The Ohio State University | SC17 (International Conference for High Performance Computing, Networking, Storage and Analysis), Denver, CO, Nov 12-17, 2017 | DOI: 10.1145/3126908.3126954

Note on filename: the PDF is filed under 0050_MVAPICH2-GDR.pdf but the actual paper is the SC17 DPML Allreduce work by Bayatpour et al. The implementation is done inside the MVAPICH2 codebase (the OSU MPI stack from which MVAPICH2-GDR derives), which likely explains the filename.

Per-section summary organized by paper headings. Each section provides paragraph-level bullets and preserves all quantitative results, equations, table notations, and named methods.


Abstract

Keywords: MPI_Allreduce, Data Partitioning, Multi-Leader, SHArP, Collectives, MPI.


1. Introduction

Background and motivation:

Limitations of existing designs:

Hardware offload:

Driving research question (verbatim from paper):

Can we design novel communication algorithms for Allreduce primitive that takes advantage of the vast amount of parallelism available in modern multi-/many-core architectures as well as high throughput and high-end features exposed by modern interconnects like InfiniBand and Omni-Path to deliver best performance and scalability?

Concrete design challenges raised:

1.1 Contributions


2. Background

2.1 Reduction Collectives in MPI

2.2 Overview of SHArP


3. Communication Characteristics of Modern Architectures

The authors first measure raw point-to-point characteristics of intra-node shared-memory and inter-node IB/OPA fabrics using osu_mbw_mr from the OSU Microbenchmark Suite. The benchmark records throughput of N concurrent pairs relative to one pair.

Intra-node (shared memory on KNL — Figure 1(a)):

Inter-node InfiniBand (Figure 1(b), Mellanox EDR 100Gb/s):

Inter-node Omni-Path (Figure 1(c)/(d), 100Gb/s):

Reasoning about existing algorithms:


4. Proposed Designs

4.1 Data Partitioning-based Multi-Leader Allreduce (DPML)

DPML designates l leader processes per node which (a) share computation and (b) drive l concurrent inter-node transfers. The collective runs in four phases (Figure 2):

Phase 1 — Local Copy to Shared Memory:

Phase 2 — Intra-node Reduction by Leaders:

Phase 3 — Inter-node Allreduce by Leaders:

Phase 4 — Local Copy to Individual Processes:

4.2 Taking Advantage of High Message Rate: Pipelined Data Transfer

The Omni-Path message-size-vs-throughput curve has three zones:

DPML's design implications:

DPML-Pipelined:

4.3 Taking Advantage of Advanced Network Offload: SHArP

Two SHArP integration designs:

  1. Node-level Leader (one leader per node):

    • One leader per node creates the SHArP communicator, gathers from local processes, transmits to the switch, and broadcasts the result back. SHArP itself distributes computation across the IB switches.
    • Drawback: in dual-socket nodes, gathering across sockets incurs a QPI inter-socket cost that dominates on multi-core machines.
  2. Socket-level Leader (one leader per socket; ppn/2 processes per leader):

    • One leader on each socket, communicating only with local-socket ranks; total number of processes participating in SHArP stays small (within SHArP's concurrency budget) but the QPI cost is avoided.
    • Each leader uses its local closest HCA — beneficial on multi-HCA machines because each HCA is exercised concurrently.

5. Modeling the Cost of Allreduce Operations

5.1 Cost Model (extends Rabenseifner [25])

Treats shared-memory and inter-node costs separately under a full-duplex assumption.

Notations (Table 1):

Symbol Description
p Number of MPI processes
h Number of nodes
l Number of leader processes per node
n Size of input vector in bytes
a Startup time per inter-node message
b Transfer time per byte for inter-node messages
a' Startup time for shared-memory copy
b' Transfer time per byte for shared-memory copy
c Computation cost per byte of one reduction op
k Number of sub-partitions used in DPML-Pipelined

Pure recursive doubling (Eq. 1): T_{r.d} = ceil(lg p) * (a + n*b + n*c)

5.2 Phases of the Algorithm (Eqs. 2-7)

  1. Copy to local leaders: T_copy = l * (a' + b' * (n/l))
  2. Intra-node reduction by leaders: T_comp = ((p / (h*l)) - 1) * n * c
  3. Inter-node Allreduce by leaders (recursive doubling inner step): T_comm = ceil(lg h) * (a + (n*b)/l + (n*c)/l)
  4. Local copy back: T_bcast = l * (a' + b' * (n/l))

Total (Eq. 7): T_allreduce = 2 * l * (a' + b' * (n/l)) + ((p / (h*l)) - 1) * n * c + ceil(lg h) * (a + (n*b)/l + (n*c)/l)

DPML-Pipelined (Eq. 5): T_comm_k = lg(h) * (a*k + (n*b)/l + (n*c)/l)

5.3 Discussion


6. Performance Evaluation

6.1 Experimental Setup

Cluster summary (also visualized in Figure 3 — KNL + IB excluded due to unavailability of large clusters with that combination):

Cluster Processor Nodes Cores/node RAM Interconnect
A — Xeon + IB w/ SHArP Dual Haswell, 2.40 GHz 40 28 (1,120 total) 128 GB Mellanox MT4115 EDR ConnectX-4 (100 Gbps), PCIe Gen3, CentOS 7.2.1511, kernel 3.10.0.2827.10.1.el7, OFED 3.4-2
B — Xeon + IB w/o SHArP Dell PowerEdge C6320, Dual Broadwell E5-2680 v4, 2.40 GHz 648 28 128 GB Mellanox MT4115 EDR ConnectX-4 (100 Gbps), PCIe Gen3
C — Xeon + Omni-Path Dual Haswell, 2.30 GHz 752 28 128 GB DDR4 Intel Omni-Path 100 Series (100 Gbps), PCIe Gen3
D — KNL + Omni-Path Intel Xeon Phi 7250 (68 cores, 4 HW threads/core, 1.4 GHz, 9 racks) 508 up to 64 used (no oversubscription) 96 GB DDR4 + 16 GB MCDRAM cache mode OPA fat-tree, 8 core-switches, 320 leaf switches, 5/4 oversubscription; 112 GB local SSD; CentOS 7

6.2 Impact of Number of Leaders on Performance

Tested with l ∈ {1, 2, 4, 8, 16} on Xeon+IB (Fig 4, Cluster A 448 procs/16 nodes/28 ppn), Xeon+IB (Fig 5, Cluster B 1,792 procs/64 nodes/28 ppn), Xeon+OPA (Fig 6, Cluster C 1,792/64/28), KNL+OPA (Fig 7, Cluster D 1,024/32/32). Op = MPI_SUM, datatype = MPI_FLOAT.

6.3 Impact of SHArP on Communication Performance

(Cluster A, 16 nodes, Figure 8; node-leader vs socket-leader designs.)

6.4 Comparison with State-of-the-art MPI Libraries

(Figure 9, optimal l chosen empirically per cluster; Intel MPI not available on Clusters A and B.)

6.5 Performance of HPCG

6.6 Performance of miniAMR


Three categories surveyed:

Modeling and redesigning collective algorithms:

Hardware offloading mechanisms:

Shared-memory-based collectives:


8. Conclusion and Future Work


Appendix A — Named Methods, Algorithms, and Benchmarks

Named designs introduced:

Existing/baseline algorithms used:

Hardware-offload primitives:

Microbenchmarks and applications:


Appendix B — Quantitative Results Compendium

Regime Result
512 KB Allreduce, Cluster B (Xeon+IB) 16 leaders 4.9x faster than 1 leader
512 KB Allreduce, Cluster C (Xeon+OPA) 16 leaders 4.3x faster than 1 leader
Best l for 8 KB on IB clusters l = 4
Best l for 8 KB on OPA clusters l = 16
1 ppn, small messages, SHArP 2.5x over host-based
4 ppn, small messages, node/socket-leader SHArP 1.8x / 2x over host-based
28 ppn, small messages, socket-leader SHArP 73% over host-based
28 ppn, small messages, node-leader SHArP 46% over host-based
Cluster A vs MVAPICH2 default up to 3.59x
Cluster B vs MVAPICH2 default up to 3.08x
Cluster C vs Intel MPI / MVAPICH2 up to 2.98x / 1.4x
Cluster D vs Intel MPI / MVAPICH2 up to 2.3x / 3.31x
10,240 procs, Cluster D vs MVAPICH2 / Intel MPI 207% / 48%
HPCG, 56 procs, Cluster A 35% improvement
HPCG, 224 procs, Cluster A 10% improvement
miniAMR, 1,792 procs, Cluster C 40% over MVAPICH2 / 20% over Intel MPI
miniAMR, up to 2,048 procs, Cluster D 60% over MVAPICH2 / 20% over Intel MPI

Note on NCCL Tuning

DPML's central observation — that the optimal number of concurrent leaders per node depends jointly on message size, process count, and the interconnect's throughput-vs-concurrency curve — is the same trade-off NCCL exposes via nChannels. The 3-zone characterization of Omni-Path (Zone A rate-limited / Zone B transition / Zone C bandwidth-limited) is a hardware-grounded explanation for why a fixed nChannels is suboptimal across message sizes: small messages live in a regime where extra channels buy throughput, but large messages on a saturated link gain nothing from more channels. The paper's measured 4.3-4.9x latency reduction at 512 KB when going from 1 to 16 leaders is direct evidence that medium/large collectives benefit from aggressive parallelism, while the small-message result (more leaders hurt) is direct evidence that latency-bound regimes prefer the minimum channel count — a pattern any per-collective NCCL config selector should respect.