A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo | Tsinghua University, ByteDance, Google | OSDI 2020


Problem

Production GPU clusters are heterogeneous: GPUs are usually fully utilized (~96% peak allocation), but a 3-month ByteDance trace shows CPU utilization averages only 20–35% and 20–45% of GPU machines have idle NIC bandwidth because they are running non-distributed jobs. Existing data-parallel architectures cannot use these spare resources: all-reduce participates only on GPU machines and is bandwidth-optimal only at k = 0 extra CPU machines, while non-colocated PS is communication-optimal only at k = n and additionally suffers severe implementation bottlenecks. For arbitrary 0 < k < n, neither is optimal, so distributed-training records have been dominated by all-reduce despite PS having a higher theoretical ceiling.


Core Insight

A single workload-allocation rule, parameterized by the number of GPU machines n and CPU machines k, is provably communication-optimal across the full 0 ≤ k ≤ n regime — and reduces to all-reduce at k = 0 and to non-colocated PS at k = n as boundary cases. The remaining gap between theory and practice is closed by (i) a Summation Service that splits the optimizer so only AVX-friendly gradient summation runs on CPU while parameter update runs on GPU, and (ii) topology-aware intra-machine schedules.


Method

Three-level taxonomy of contributions:

+-------------------------------------------------------------+
| Inter-machine Communication (proved optimal)                |
|   Two service classes: SSCPU (on CPU machines)              |
|                        SSGPU (on GPU machines)              |
|   Workload assignment (Theorem 1):                          |
|     M_SSCPU = 2(n-1) / (n^2 + kn - 2k) * M                  |
|     M_SSGPU = (n - k) / (n^2 + kn - 2k) * M                 |
|   Optimal time:  t_opt = 2n(n-1)M / [(n^2+kn-2k) * B]       |
|   Speedups:  g_a vs all-reduce, g_p vs non-colocated PS     |
|   Tensors partitioned into <=4 MB units, consistently       |
|     hashed into [0, n^2+kn-2k) for load balancing.          |
+-------------------------------------------------------------+
| Intra-machine Communication (topology-aware)                |
|   PCIe-only:  reduce-scatter + CPU-assisted aggregation     |
|     + all-gather (6-step pipeline).                         |
|   NVLink: reduce/broadcast with a non-NIC GPU root          |
|     so that the NIC's PCIe path stays free.                 |
|   Generalized rule on (Sn, Sg) — PCIe switches with         |
|     NIC vs without.                                         |
+-------------------------------------------------------------+
| Summation Service (split optimizer)                         |
|   CPU runs gradient summation only (AVX + OpenMP),          |
|     <3 cores saturate 100 Gb/s NIC.                         |
|   GPU runs parameter update (<0.5% extra FLOPs).            |
|   Async-training preserved by sending delta-parameter       |
|     w'_t+1 - w_t to SS (Theorem 2: equivalent to            |
|     Async Parallel).                                        |
+-------------------------------------------------------------+
| RDMA Implementation (closes the gap to t_opt)               |
|   Reuse registered tensor buffers; shm path for loopback;   |
|   page-aligned memory; one SGE per RDMA WRITE.              |
|   Multi-stage pipeline; priority scheduling per [34, 55].   |
+-------------------------------------------------------------+

The all-reduce vs. non-colocated PS speedups are g_a = (n^2 + kn - 2k) / n^2 and g_p = (n^2 + kn - 2k) / [2k(n-1)]; with n = 32, k = 16 this gives g_a = 1.46 and g_p = 1.52. At k = n and n -> infinity, g_a doubles to 2.


Experimental Setup

Component Value
Cluster ByteDance production datacenter, RoCEv2 full-bisection
GPU machines up to 32 machines, each 8x V100 32 GB (PCIe-only or NVLink-based)
CPU machines scheduler-allocated, < 4 cores per machine for SS
NIC 1x 100 GbE RDMA per machine; point-to-point ≈ 90 Gb/s
Frameworks TensorFlow, PyTorch, MXNet (BytePS plugins)
Baselines Horovod 0.19 + NCCL 2.5.7 (all-reduce); native TF PS, MXNet PS-RDMA
Models ResNet-50, VGG-16, UGATIT GAN (CV); Transformer, BERT-Large, GPT-2 (NLP)
Scale 8 → 256 GPUs (1 → 32 GPU machines)
Tensor partition 4 MB
RDMA verb RDMA WRITE; DCQCN congestion control [75]
Code change to switch backends < 20 LoC per model

Headline Quantitative Results

RDMA implementation speedup (Table 2, microbenchmark, 1 CPU + 1 GPU machine):

Solution Throughput (Gb/s) Speedup
baseline 41 1.00x
+shm 52 1.27x
+shm +aligned 76 1.85x
all (above + 1 SGE) 89 2.17x

Inter-machine microbenchmark (Fig. 12, 8x 1-GPU + 0..8 CPU): BytePS within 1–9% of theoretical optimum across all k; all-reduce flat in k; PS bottlenecked.

Leverage CPU machines (Fig. 13, 8x 8-V100 + 0..8 CPU):

Intra-machine (Fig. 14, no CPU machines):

Scaling at 256 V100 32 GB GPUs (Figs. 15, 16):

VGG-16 improvement breakdown vs native PS (52% total at 256 GPUs): 19% intra-server design + 18% Summation Service + 15% RDMA/pipelining implementation.

Theoretical analysis numbers: PCIe-only optimal job time J* = 0.129 s; pure CPU-assisted J(0) = 0.141 s (9% above optimum, but chosen because brute-force copy stresses CPU memory by 4x); CPU-assisted vs. ring all-reduce: 23.7% smaller completion time at production link speeds (p=2, n=4, b_bottleneck=80 Gb/s, b(S_j,C_j)=105 Gb/s).

Summation Service overhead (parameter-update FLOPs / FP+BP FLOPs, SGD): VGG-16 0.43%; ResNet-50 0.33%; BERT-Large 0.078%.


Limitations


Open Problems

  1. Distributed profiler that builds a complete cross-node timeline to verify the asynchronicity hypothesis (why BytePS beats all-reduce at k = 0).
  2. Elastic CPU scheduling for in-flight training jobs, exploiting that k only affects performance (not convergence).
  3. Generalization of BytePS principles to model parallelism (Megatron-LM [62], Mesh-TensorFlow [61]), where the all-reduce primitive used is also accelerable by BytePS.
  4. Integration with new hardware — InfiniBand-switch ASIC aggregation [28], P4 in-network aggregation [58, 59], SmartNICs [46] — to offload Summation Service away from host CPU.
  5. Generalizing the (Sn, Sg) intra-machine rule to multi-NIC and richer accelerator-attached topologies beyond the two production topologies measured.

Note on NCCL Tuning

The paper shows that on asymmetric NVLink machines (one PCIe switch carries the NIC), the choice of reduce/broadcast root materially changes throughput: root=2 (non-NIC) is optimal, root=0 (default NCCL behavior, GPU adjacent to NIC) is worst (Fig. 14b). The directly relevant finding for NCCL configuration tuning is that the right collective and its root depend on whether a GPU shares a PCIe switch with the NIC — a property invisible to default NCCL ranking heuristics on asymmetric servers. The 4 MB partition size used by BytePS is also a useful prior for chunk granularity when pipelining sends and receives.