A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo | Tsinghua University, ByteDance, Google | OSDI 2020

Problem

Production GPU clusters are heterogeneous: GPUs are usually fully utilized (~96% peak allocation), but a 3-month ByteDance trace shows CPU utilization averages only 20–35% and 20–45% of GPU machines have idle NIC bandwidth because they are running non-distributed jobs. Existing data-parallel architectures cannot use these spare resources: all-reduce participates only on GPU machines and is bandwidth-optimal only at k = 0 extra CPU machines, while non-colocated PS is communication-optimal only at k = n and additionally suffers severe implementation bottlenecks. For arbitrary 0 < k < n, neither is optimal, so distributed-training records have been dominated by all-reduce despite PS having a higher theoretical ceiling.

Core Insight

A single workload-allocation rule, parameterized by the number of GPU machines n and CPU machines k, is provably communication-optimal across the full 0 ≤ k ≤ n regime — and reduces to all-reduce at k = 0 and to non-colocated PS at k = n as boundary cases. The remaining gap between theory and practice is closed by (i) a Summation Service that splits the optimizer so only AVX-friendly gradient summation runs on CPU while parameter update runs on GPU, and (ii) topology-aware intra-machine schedules.

Method

Three-level taxonomy of contributions:

+-------------------------------------------------------------+
| Inter-machine Communication (proved optimal)                |
|   Two service classes: SSCPU (on CPU machines)              |
|                        SSGPU (on GPU machines)              |
|   Workload assignment (Theorem 1):                          |
|     M_SSCPU = 2(n-1) / (n^2 + kn - 2k) * M                  |
|     M_SSGPU = (n - k) / (n^2 + kn - 2k) * M                 |
|   Optimal time:  t_opt = 2n(n-1)M / [(n^2+kn-2k) * B]       |
|   Speedups:  g_a vs all-reduce, g_p vs non-colocated PS     |
|   Tensors partitioned into <=4 MB units, consistently       |
|     hashed into [0, n^2+kn-2k) for load balancing.          |
+-------------------------------------------------------------+
| Intra-machine Communication (topology-aware)                |
|   PCIe-only:  reduce-scatter + CPU-assisted aggregation     |
|     + all-gather (6-step pipeline).                         |
|   NVLink: reduce/broadcast with a non-NIC GPU root          |
|     so that the NIC's PCIe path stays free.                 |
|   Generalized rule on (Sn, Sg) — PCIe switches with         |
|     NIC vs without.                                         |
+-------------------------------------------------------------+
| Summation Service (split optimizer)                         |
|   CPU runs gradient summation only (AVX + OpenMP),          |
|     <3 cores saturate 100 Gb/s NIC.                         |
|   GPU runs parameter update (<0.5% extra FLOPs).            |
|   Async-training preserved by sending delta-parameter       |
|     w'_t+1 - w_t to SS (Theorem 2: equivalent to            |
|     Async Parallel).                                        |
+-------------------------------------------------------------+
| RDMA Implementation (closes the gap to t_opt)               |
|   Reuse registered tensor buffers; shm path for loopback;   |
|   page-aligned memory; one SGE per RDMA WRITE.              |
|   Multi-stage pipeline; priority scheduling per [34, 55].   |
+-------------------------------------------------------------+

The all-reduce vs. non-colocated PS speedups are g_a = (n^2 + kn - 2k) / n^2 and g_p = (n^2 + kn - 2k) / [2k(n-1)]; with n = 32, k = 16 this gives g_a = 1.46 and g_p = 1.52. At k = n and n -> infinity, g_a doubles to 2.

Experimental Setup

Component	Value
Cluster	ByteDance production datacenter, RoCEv2 full-bisection
GPU machines	up to 32 machines, each 8x V100 32 GB (PCIe-only or NVLink-based)
CPU machines	scheduler-allocated, < 4 cores per machine for SS
NIC	1x 100 GbE RDMA per machine; point-to-point ≈ 90 Gb/s
Frameworks	TensorFlow, PyTorch, MXNet (BytePS plugins)
Baselines	Horovod 0.19 + NCCL 2.5.7 (all-reduce); native TF PS, MXNet PS-RDMA
Models	ResNet-50, VGG-16, UGATIT GAN (CV); Transformer, BERT-Large, GPT-2 (NLP)
Scale	8 → 256 GPUs (1 → 32 GPU machines)
Tensor partition	4 MB
RDMA verb	RDMA WRITE; DCQCN congestion control [75]
Code change to switch backends	< 20 LoC per model

Headline Quantitative Results

RDMA implementation speedup (Table 2, microbenchmark, 1 CPU + 1 GPU machine):

Solution	Throughput (Gb/s)	Speedup
baseline	41	1.00x
+shm	52	1.27x
+shm +aligned	76	1.85x
all (above + 1 SGE)	89	2.17x

Inter-machine microbenchmark (Fig. 12, 8x 1-GPU + 0..8 CPU): BytePS within 1–9% of theoretical optimum across all k; all-reduce flat in k; PS bottlenecked.

Leverage CPU machines (Fig. 13, 8x 8-V100 + 0..8 CPU):

BytePS with CPU vs without: up to 20% end-to-end speedup; SS uses < 4 cores per CPU machine — effectively free.
BytePS over all-reduce: up to 45%.
NVLink machines benefit more than PCIe-only machines.

Intra-machine (Fig. 14, no CPU machines):

PCIe-only: CPU-assisted aggregation gives up to 20% over the strawman.
NVLink: root=2 (non-NIC) optimal; root=2,3 ties; root=all worse; root=0 (≈ default NCCL) worst.

Scaling at 256 V100 32 GB GPUs (Figs. 15, 16):

BytePS (with CPU) over all-reduce: 10% – 84% across six models.
BytePS (no CPU) over all-reduce: 9% – 53%.
BytePS over native PS: up to 245%.
ResNet-50 scaling efficiency: 97.5% (BytePS) vs. 88% (all-reduce); 5/6 models > 91.6% with BytePS; UGATIT 74% — least scalable but largest gain over all-reduce (84%).

VGG-16 improvement breakdown vs native PS (52% total at 256 GPUs): 19% intra-server design + 18% Summation Service + 15% RDMA/pipelining implementation.

Theoretical analysis numbers: PCIe-only optimal job time J* = 0.129 s; pure CPU-assisted J(0) = 0.141 s (9% above optimum, but chosen because brute-force copy stresses CPU memory by 4x); CPU-assisted vs. ring all-reduce: 23.7% smaller completion time at production link speeds (p=2, n=4, b_bottleneck=80 Gb/s, b(S_j,C_j)=105 Gb/s).

Summation Service overhead (parameter-update FLOPs / FP+BP FLOPs, SGD): VGG-16 0.43%; ResNet-50 0.33%; BERT-Large 0.078%.

Limitations

Authors cannot yet rigorously explain why BytePS beats all-reduce at k = 0 even without intra-machine optimization (Fig. 14 surprise); they hypothesize reduced cross-node ordering vs all-reduce but say a distributed profiler is needed.
Elasticity (dynamically scaling k mid-training) is acknowledged as future work; the current scheduler still pins k.
GPU-Direct RDMA is excluded because PCIe-only topology cannot meet its same-switch requirement and most clouds (e.g., AWS) do not expose it.
Results target 100 GbE RoCEv2 with full bisection; behavior on slower or non-bisection networks is not characterized.
Convergence comparisons are framed as system scalability; algorithm-level scalability (hyperparameters, convergence with very large batches) is intentionally outside scope.
Evaluation is on V100 32 GB nodes; A100 / H100 with NVSwitch are not tested.
Only one tensor-partition size (4 MB) is fixed across all experiments — sensitivity to partition size is not quantified.

Open Problems

Distributed profiler that builds a complete cross-node timeline to verify the asynchronicity hypothesis (why BytePS beats all-reduce at k = 0).
Elastic CPU scheduling for in-flight training jobs, exploiting that k only affects performance (not convergence).
Generalization of BytePS principles to model parallelism (Megatron-LM [62], Mesh-TensorFlow [61]), where the all-reduce primitive used is also accelerable by BytePS.
Integration with new hardware — InfiniBand-switch ASIC aggregation [28], P4 in-network aggregation [58, 59], SmartNICs [46] — to offload Summation Service away from host CPU.
Generalizing the (Sn, Sg) intra-machine rule to multi-NIC and richer accelerator-attached topologies beyond the two production topologies measured.

Note on NCCL Tuning

The paper shows that on asymmetric NVLink machines (one PCIe switch carries the NIC), the choice of reduce/broadcast root materially changes throughput: root=2 (non-NIC) is optimal, root=0 (default NCCL behavior, GPU adjacent to NIC) is worst (Fig. 14b). The directly relevant finding for NCCL configuration tuning is that the right collective and its root depend on whether a GPU shares a PCIe switch with the NIC — a property invisible to default NCCL ranking heuristics on asymmetric servers. The 4 MB partition size used by BytePS is also a useful prior for chunk granularity when pipelining sends and receives.