A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo | Tsinghua University, ByteDance, Google | OSDI 2020
Problem
Production GPU clusters are heterogeneous: GPUs are usually fully
utilized (~96% peak allocation), but a 3-month ByteDance trace shows CPU
utilization averages only 20–35% and 20–45% of GPU machines have idle
NIC bandwidth because they are running non-distributed jobs. Existing
data-parallel architectures cannot use these spare resources: all-reduce
participates only on GPU machines and is bandwidth-optimal only at
k = 0 extra CPU machines, while non-colocated PS is
communication-optimal only at k = n and additionally
suffers severe implementation bottlenecks. For arbitrary
0 < k < n, neither is optimal, so
distributed-training records have been dominated by all-reduce despite
PS having a higher theoretical ceiling.
Core Insight
A single workload-allocation rule, parameterized by the number of GPU
machines n and CPU machines k, is
provably communication-optimal across the full
0 ≤ k ≤ n regime — and reduces to all-reduce at
k = 0 and to non-colocated PS at k = n as
boundary cases. The remaining gap between theory and practice is closed
by (i) a Summation Service that splits the optimizer so
only AVX-friendly gradient summation runs on CPU while parameter update
runs on GPU, and (ii) topology-aware intra-machine schedules.
Method
Three-level taxonomy of contributions:
+-------------------------------------------------------------+
| Inter-machine Communication (proved optimal) |
| Two service classes: SSCPU (on CPU machines) |
| SSGPU (on GPU machines) |
| Workload assignment (Theorem 1): |
| M_SSCPU = 2(n-1) / (n^2 + kn - 2k) * M |
| M_SSGPU = (n - k) / (n^2 + kn - 2k) * M |
| Optimal time: t_opt = 2n(n-1)M / [(n^2+kn-2k) * B] |
| Speedups: g_a vs all-reduce, g_p vs non-colocated PS |
| Tensors partitioned into <=4 MB units, consistently |
| hashed into [0, n^2+kn-2k) for load balancing. |
+-------------------------------------------------------------+
| Intra-machine Communication (topology-aware) |
| PCIe-only: reduce-scatter + CPU-assisted aggregation |
| + all-gather (6-step pipeline). |
| NVLink: reduce/broadcast with a non-NIC GPU root |
| so that the NIC's PCIe path stays free. |
| Generalized rule on (Sn, Sg) — PCIe switches with |
| NIC vs without. |
+-------------------------------------------------------------+
| Summation Service (split optimizer) |
| CPU runs gradient summation only (AVX + OpenMP), |
| <3 cores saturate 100 Gb/s NIC. |
| GPU runs parameter update (<0.5% extra FLOPs). |
| Async-training preserved by sending delta-parameter |
| w'_t+1 - w_t to SS (Theorem 2: equivalent to |
| Async Parallel). |
+-------------------------------------------------------------+
| RDMA Implementation (closes the gap to t_opt) |
| Reuse registered tensor buffers; shm path for loopback; |
| page-aligned memory; one SGE per RDMA WRITE. |
| Multi-stage pipeline; priority scheduling per [34, 55]. |
+-------------------------------------------------------------+
The all-reduce vs. non-colocated PS speedups are
g_a = (n^2 + kn - 2k) / n^2 and
g_p = (n^2 + kn - 2k) / [2k(n-1)]; with
n = 32, k = 16 this gives g_a = 1.46 and
g_p = 1.52. At k = n and
n -> infinity, g_a doubles to 2.
Experimental Setup
| Component | Value |
|---|---|
| Cluster | ByteDance production datacenter, RoCEv2 full-bisection |
| GPU machines | up to 32 machines, each 8x V100 32 GB (PCIe-only or NVLink-based) |
| CPU machines | scheduler-allocated, < 4 cores per machine for SS |
| NIC | 1x 100 GbE RDMA per machine; point-to-point ≈ 90 Gb/s |
| Frameworks | TensorFlow, PyTorch, MXNet (BytePS plugins) |
| Baselines | Horovod 0.19 + NCCL 2.5.7 (all-reduce); native TF PS, MXNet PS-RDMA |
| Models | ResNet-50, VGG-16, UGATIT GAN (CV); Transformer, BERT-Large, GPT-2 (NLP) |
| Scale | 8 → 256 GPUs (1 → 32 GPU machines) |
| Tensor partition | 4 MB |
| RDMA verb | RDMA WRITE; DCQCN congestion control [75] |
| Code change to switch backends | < 20 LoC per model |
Headline Quantitative Results
RDMA implementation speedup (Table 2, microbenchmark, 1 CPU + 1 GPU machine):
| Solution | Throughput (Gb/s) | Speedup |
|---|---|---|
| baseline | 41 | 1.00x |
| +shm | 52 | 1.27x |
| +shm +aligned | 76 | 1.85x |
| all (above + 1 SGE) | 89 | 2.17x |
Inter-machine microbenchmark (Fig. 12, 8x 1-GPU + 0..8
CPU): BytePS within 1–9% of theoretical optimum across all
k; all-reduce flat in k; PS bottlenecked.
Leverage CPU machines (Fig. 13, 8x 8-V100 + 0..8 CPU):
- BytePS with CPU vs without: up to 20% end-to-end speedup; SS uses < 4 cores per CPU machine — effectively free.
- BytePS over all-reduce: up to 45%.
- NVLink machines benefit more than PCIe-only machines.
Intra-machine (Fig. 14, no CPU machines):
- PCIe-only: CPU-assisted aggregation gives up to 20% over the strawman.
- NVLink: root=2 (non-NIC) optimal; root=2,3 ties; root=all worse; root=0 (≈ default NCCL) worst.
Scaling at 256 V100 32 GB GPUs (Figs. 15, 16):
- BytePS (with CPU) over all-reduce: 10% – 84% across six models.
- BytePS (no CPU) over all-reduce: 9% – 53%.
- BytePS over native PS: up to 245%.
- ResNet-50 scaling efficiency: 97.5% (BytePS) vs. 88% (all-reduce); 5/6 models > 91.6% with BytePS; UGATIT 74% — least scalable but largest gain over all-reduce (84%).
VGG-16 improvement breakdown vs native PS (52% total at 256 GPUs): 19% intra-server design + 18% Summation Service + 15% RDMA/pipelining implementation.
Theoretical analysis numbers: PCIe-only optimal job
time J* = 0.129 s; pure CPU-assisted
J(0) = 0.141 s (9% above optimum, but chosen because
brute-force copy stresses CPU memory by 4x); CPU-assisted vs. ring
all-reduce: 23.7% smaller completion time at production
link speeds
(p=2, n=4, b_bottleneck=80 Gb/s, b(S_j,C_j)=105 Gb/s).
Summation Service overhead (parameter-update FLOPs / FP+BP FLOPs, SGD): VGG-16 0.43%; ResNet-50 0.33%; BERT-Large 0.078%.
Limitations
- Authors cannot yet rigorously explain why BytePS beats all-reduce at
k = 0even without intra-machine optimization (Fig. 14 surprise); they hypothesize reduced cross-node ordering vs all-reduce but say a distributed profiler is needed. - Elasticity (dynamically scaling
kmid-training) is acknowledged as future work; the current scheduler still pinsk. - GPU-Direct RDMA is excluded because PCIe-only topology cannot meet its same-switch requirement and most clouds (e.g., AWS) do not expose it.
- Results target 100 GbE RoCEv2 with full bisection; behavior on slower or non-bisection networks is not characterized.
- Convergence comparisons are framed as system scalability; algorithm-level scalability (hyperparameters, convergence with very large batches) is intentionally outside scope.
- Evaluation is on V100 32 GB nodes; A100 / H100 with NVSwitch are not tested.
- Only one tensor-partition size (4 MB) is fixed across all experiments — sensitivity to partition size is not quantified.
Open Problems
- Distributed profiler that builds a complete
cross-node timeline to verify the asynchronicity hypothesis (why BytePS
beats all-reduce at
k = 0). - Elastic CPU scheduling for in-flight training jobs,
exploiting that
konly affects performance (not convergence). - Generalization of BytePS principles to model parallelism (Megatron-LM [62], Mesh-TensorFlow [61]), where the all-reduce primitive used is also accelerable by BytePS.
- Integration with new hardware — InfiniBand-switch ASIC aggregation [28], P4 in-network aggregation [58, 59], SmartNICs [46] — to offload Summation Service away from host CPU.
- Generalizing the (Sn, Sg) intra-machine rule to multi-NIC and richer accelerator-attached topologies beyond the two production topologies measured.
Note on NCCL Tuning
The paper shows that on asymmetric NVLink machines (one PCIe switch carries the NIC), the choice of reduce/broadcast root materially changes throughput: root=2 (non-NIC) is optimal, root=0 (default NCCL behavior, GPU adjacent to NIC) is worst (Fig. 14b). The directly relevant finding for NCCL configuration tuning is that the right collective and its root depend on whether a GPU shares a PCIe switch with the NIC — a property invisible to default NCCL ranking heuristics on asymmetric servers. The 4 MB partition size used by BytePS is also a useful prior for chunk granularity when pipelining sends and receives.