A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters — Detailed Summary

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo | Tsinghua University, ByteDance, Google | OSDI 2020 (14th USENIX Symposium on Operating Systems Design and Implementation, Nov 4-6, 2020)

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.


Abstract


1. Introduction

Background:

Two existing distributed-training architectures:

Motivating gap:

Key design ideas:

Contributions enumerated:

  1. New distributed DNN training architecture for heterogeneous GPU/CPU clusters that achieves communication optimality and contains all-reduce and PS as special cases.
  2. Optimal intra-machine communication strategies tailored to PCIe-only and NVLink-based GPU machines.
  3. Summation Service abstraction that removes the CPU bottleneck of classical PS.

2. Background

2.1 Distributed DNN Training

2.2 All-reduce

2.3 Parameter Server (PS)

Architecture Per-iteration comm time Optimal?
All-reduce 2(n-1)M/(nB) Only if k = 0
Non-Colocated PS max(M/B, nM/(kB)) Only if k = n
Colocated PS 2(n-1)M/(nB) Only if k = 0

3. Motivation and BytePS Architecture

3.1 Motivation

BytePS goals:

  1. Always communication-optimal for any 0 <= k <= n (and adaptive as the cluster scheduler changes k over time), with proofs.
  2. Optimal intra-machine communication for diverse PCIe/NVLink topologies; all-reduce and PS are special cases.
  3. Communication time close to the theoretical optimum in practice — original PS is far from its theoretical limit, so BytePS removes the implementation bottlenecks.
  4. Generic to DNN training; supports TensorFlow, PyTorch, MXNet.

3.2 Architecture Overview


4. BytePS Communication Design

4.1 Inter-machine Communication

4.1.1 Communication Efficiency Analysis

4.2 Intra-machine Communication

4.2.1 PCIe-only Topology

4.2.3 Discussion


5. Summation Service

Asynchronous training support:


6. Implementation

6.1 Multi-Stage Pipeline

6.2 Address RDMA Performance Issues

Solution Throughput (Gb/s) Speedup vs. baseline
baseline 41 1.00x
+shm 52 1.27x
+shm +aligned 76 1.85x
all (above + 1 SGE) 89 2.17x

6.3 BytePS Usage


7. Evaluation

Fidelity highlights:

7.1 Inter-machine Microbenchmarks

7.2 Leverage CPU Machines

7.3 Adapt to Intra-machine Topology

7.4 Scalability

Comparison Speedup range
BytePS (with CPU) over all-reduce 10% — 84%
BytePS (no CPU) over all-reduce 9% — 53%
BytePS over native PS up to 245% (and 52% on VGG-16 specifically)
Model BytePS scaling efficiency
ResNet-50 97.5% (all-reduce: 88%)
5 of 6 models other than UGATIT >= 91.6%
UGATIT (least scalable) 74% (BytePS), with the largest gap over all-reduce — 84% gain

8. Observations and Discussion



10. Conclusion


11. Acknowledgement


Note on NCCL Tuning

BytePS measures and uses two intra-machine collective layouts that NCCL itself selects between (reduce-scatter+all-gather for symmetric PCIe-only topologies vs. reduce/broadcast with a non-NIC root for asymmetric NVLink topologies); the paper shows root choice alone changes throughput substantially (Fig. 14b: root=2 best, root=0 worst). For NCCL configuration tuning, the directly relevant finding is that the optimal collective and its root depend on whether the NIC shares a PCIe switch with one of the GPUs — a property invisible to default NCCL ranking heuristics on asymmetric machines. The paper also quantifies the small-tensor partitioning bound (4 MB) at which pipelining and load-balancing become beneficial, a useful prior on chunk granularity.