A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters — Detailed Summary

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo | Tsinghua University, ByteDance, Google | OSDI 2020 (14th USENIX Symposium on Operating Systems Design and Implementation, Nov 4-6, 2020)

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.

Abstract

Production data center clusters that run DNN training jobs are inherently heterogeneous: GPUs and CPUs both provide compute, and network bandwidth is the binding distributed-training resource.
Existing distributed DNN training architectures — all-reduce and Parameter Server (PS) — cannot fully exploit such heterogeneous resources, leaving spare CPU and bandwidth on the table.
The paper presents BytePS, a new distributed DNN training architecture that leverages spare CPU cores and NIC bandwidth in the cluster to accelerate GPU-resident training jobs.
BytePS is proven communication-optimal for any allocation 0 <= k <= n of CPU machines to n GPU machines, and is unified: existing all-reduce (k=0) and non-colocated PS (k=n) are two special cases of BytePS.
BytePS introduces a Summation Service (SS) abstraction that splits a parameter optimizer into (i) gradient summation, accelerated on CPUs via AVX, and (ii) parameter update, executed on GPUs.
BytePS supports TensorFlow, PyTorch, and MXNet via Horovod-like and native APIs, and outperforms state-of-the-art open-source all-reduce by up to 84% and PS by up to 245% on training jobs running with up to 256 GPUs.

1. Introduction

Background:

DNN research has produced breakthroughs in CV, speech, NLP, and reinforcement learning; large GPU clusters with thousands of GPUs are standard for training.
GPU clusters contain not only GPUs but also high-end CPUs and high-speed networks (Ethernet or InfiniBand). CPU-only machines also exist for data preprocessing.
Production observation: GPUs are usually fully utilized, while CPU and network bandwidth often have spare capacity.

Two existing distributed-training architectures:

All-reduce: only GPU machines participate; gradients are aggregated via the all-reduce primitive.
Parameter Server (PS): GPU workers push gradients to PS processes (commonly on CPU machines); PS aggregates and runs the optimizer (SGD, Adam, etc.) and pushes the updated weights back.
All-reduce is bandwidth-optimal for a homogeneous GPU-only set [54], but loses optimality once additional CPU+bandwidth resources are available.

Motivating gap:

In theory, with extra CPU machines, PS can outperform all-reduce; in practice, existing PS implementations have inferior performance due to design and implementation bottlenecks, so distributed-training records [27, 49, 73] are dominated by all-reduce.
BytePS is therefore designed to be communication-optimal both in theory and in practice for arbitrary GPU/CPU machine ratios and arbitrary intra-machine PCIe/NVLink topologies.

Key design ideas:

BytePS unifies all-reduce and PS as boundary cases (k=0 and k=n) of a single optimal traffic-allocation strategy.
BytePS removes the CPU bottleneck observed in classical PS by splitting the optimizer: gradient summation stays on CPU (fits memory bandwidth and AVX); parameter update moves to GPU.
BytePS adopts pipelining and priority scheduling from prior work [34, 55] and resolves several RDMA performance issues.

Contributions enumerated:

New distributed DNN training architecture for heterogeneous GPU/CPU clusters that achieves communication optimality and contains all-reduce and PS as special cases.
Optimal intra-machine communication strategies tailored to PCIe-only and NVLink-based GPU machines.
Summation Service abstraction that removes the CPU bottleneck of classical PS.

BytePS has been deployed internally at ByteDance and open-sourced [4]; evaluation uses six DNN models, three frameworks, and 256 GPUs.

2. Background

2.1 Distributed DNN Training

Training a DNN consists of three steps per iteration: forward propagation (FP), backward propagation (BP), and parameter update via an optimizer (SGD [76], Adam [42], etc.).
The dominant distributed-training paradigm is data parallelism: each GPU keeps a full model replica, processes a different mini-batch, and synchronizes gradients every iteration.
Modern GPU clusters host hundreds-to-thousands of machines connected by RDMA networks; per-machine: multiple GPUs, tens of CPU cores, hundreds of GB DRAM, one to several 100 Gb/s NICs.
A public cluster trace [35] shows 50% of hosts have CPU utilization < 30% — supporting the spare-CPU thesis.
Notation used throughout: n = number of GPU machines; M = model size in bytes; B = network bandwidth per machine.

2.2 All-reduce

All-reduce originated from HPC and aggregates gradients collectively without any extra CPU machine; ring all-reduce is the most common algorithm.
A ring all-reduce splits into reduce-scatter then all-gather (Fig. 1):
- Reduce-scatter: each node sends (n-1)M/n bytes; M is partitioned into n segments.
- All-gather: each node again sends (n-1)M/n bytes broadcasting its reduced segment.
Total per-node traffic: 2(n-1)M/n; completion time: 2(n-1)M/(nB), proven optimal for uniform-bandwidth topologies with no extra resources [54].
For non-uniform / hierarchical topologies, the optimum is 2(n'-1)M/(n'B') where B' is the slowest link bandwidth and n' the number of nodes on the slowest layer; for simplicity the paper assumes one GPU per machine, n=n', B=B'.
All-reduce cannot use additional non-worker nodes — designed for homogeneous setups.

2.3 Parameter Server (PS)

PS [44] has two roles: workers (run FP/BP, push gradients, pull weights) and PS (aggregate gradients, run optimizer). PS commonly runs on CPU machines because GPUs are far more expensive (AWS p3.16xlarge: ~$25/hr; r4.16xlarge same minus GPUs: $4.2/hr).
Two PS placement strategies (Fig. 2):
- Non-colocated mode (Fig. 2a): PS runs on k dedicated CPU machines. Each GPU worker sends/receives M bytes per iteration; each CPU machine handles nM/k bytes.
- Colocated mode (Fig. 2b): PS processes run on the GPU machines themselves and reuse spare CPU resources; loop-back communication for local PS-worker traffic.
Theoretical PS communication time: max(M/B, nM/(kB)).
When k=n, PS is communication optimal and faster than all-reduce; when k <= n/2, non-colocated PS becomes slower than all-reduce because the CPU machines bottleneck.
Comparison summary (paper Table 1):

Architecture	Per-iteration comm time	Optimal?
All-reduce	`2(n-1)M/(nB)`	Only if `k = 0`
Non-Colocated PS	`max(M/B, nM/(kB))`	Only if `k = n`
Colocated PS	`2(n-1)M/(nB)`	Only if `k = 0`

Non-colocated PS can leverage extra CPU+bandwidth but may under-utilize GPU machine resources; colocated PS and all-reduce use GPU-machine resources well but cannot consume extra CPU machines.
PS supports asynchronous training (mitigates stragglers) but converges more slowly; the paper focuses on synchronous training and addresses async briefly in §5.

3. Motivation and BytePS Architecture

3.1 Motivation

ByteDance's internal users mostly chose all-reduce because existing PS performed worse, except where async-PS was preferred for straggler tolerance.
Opportunity (Fig. 3): A 3-month trace (2020-01-01 to 2020-03-31) of a cluster with thousands of GPUs shows GPUs near 96% allocation at peaks but only 55%-80% of GPU machines are running distributed jobs at any time, leaving the network bandwidth of 20%-45% of GPU machines idle. Cluster-wide CPU utilization is 20%-35%, consistent with [35].
Existing architectures are insufficient: all-reduce and colocated PS only use GPU-worker resources; non-colocated PS may under-utilize GPU-worker NIC and CPU. For 0 < k < n, neither is optimal.
End-to-end demonstration (Fig. 4): VGG-16, 32x V100 GPUs (4 GPU machines), 100 GbE RDMA, batch size 32 per GPU. MXNet PS-RDMA [1] and NCCL-2.5.7 [13] all-reduce both fall far below 32x linear scaling. Even ByteScheduler [55] (state-of-the-art) on top of PS or all-reduce remains far from linear.
These results motivate BytePS to leverage spare CPUs/bandwidth dynamically.

BytePS goals:

Always communication-optimal for any 0 <= k <= n (and adaptive as the cluster scheduler changes k over time), with proofs.
Optimal intra-machine communication for diverse PCIe/NVLink topologies; all-reduce and PS are special cases.
Communication time close to the theoretical optimum in practice — original PS is far from its theoretical limit, so BytePS removes the implementation bottlenecks.
Generic to DNN training; supports TensorFlow, PyTorch, MXNet.

3.2 Architecture Overview

BytePS components (Fig. 5):
- Communication Service (CS) — runs on every GPU machine, internally aggregates/broadcasts among local GPUs and externally exchanges tensors with SS.
- Summation Service (SS) — runs on the CPU of every machine (CPU-only machines and GPU machines), receives tensors, sums them, returns sums to CS.
"CPU machine" includes both dedicated CPU-only machines and GPU machines whose spare CPU cores and NIC are scheduled to run SS — improving cluster-wide utilization.
SS is much simpler than classical PS: only sum, no full optimizer state. The full optimizer (parameter update step) lives in CS on GPU machines.
CS responsibilities: (1) decide traffic allocation to each SS (analyzed in §4.1), (2) pick local tensor aggregation strategy per intra-machine topology (§4.2), (3) be RDMA-optimized (§6.2).
BytePS gracefully degenerates: k=0 -> SS only on GPU machines (= all-reduce); k=n -> SS only on CPU machines (= non-colocated PS); for intermediate k, all SS instances cooperate with the analytically-derived optimal split.

4. BytePS Communication Design

4.1 Inter-machine Communication

All inter-machine traffic is between CS and SS; the design balances per-machine traffic so no node is the bottleneck.
Assumes full-bisection-bandwidth network (common in DL clusters [52]) and modern RDMA congestion control such as DCQCN [75], so any node can hit its NIC bandwidth B.
SS is split into SSCPU (running on CPU machines, traffic governed only by the CPU machine's NIC) and SSGPU (running on GPU machines, sharing NIC bandwidth with that machine's CS).
BytePS assigns workload (bytes summed per machine) per the equations:
- Eq. 1: M_{SSCPU} = 2(n-1) / (n^2 + kn - 2k) * M
- Eq. 2: M_{SSGPU} = (n-k) / (n^2 + kn - 2k) * M
- Constraints k >= 1, n >= 2, k <= n. Outside these, BytePS degenerates to all-reduce or PS.
Since DNN tensors have variable sizes, BytePS partitions tensors into parts of at most 4 MB and consistently hashes each part's index into [0, n^2 + kn - 2k) to approximate Eq. 1 and Eq. 2; consistent indexing/hashing guarantees the same part from all GPUs lands at the same SS.

4.1.1 Communication Efficiency Analysis

The paper proves Theorem 1: the workload assignment in Eq. 1 and Eq. 2 minimizes inter-machine communication time.
Per-machine traffic accounting (assuming M >> 4 MB partition):
- GPU machine runs CS and SS; its CS sends/receives M - M_{SSGPU} bytes externally (the SSGPU portion stays loop-back). Its SSGPU receives (n-1) * M_{SSGPU} bytes from n-1 other GPU machines.
- Eq. 3: t_g = [M + (n-2) * M_{SSGPU}] / B.
- Eq. 4: t_c = M_{SSCPU} / B (per CPU machine).
- Eq. 5: M = k * M_{SSCPU} + n * M_{SSGPU} (total work conservation).
Final communication time = max(t_c, t_g). Minimum occurs when t_c = t_g; solving gives Eq. 1 and Eq. 2.
Combining Eq. 3 and Eq. 2 yields the optimal time:
- Eq. 6: t_opt = 2n(n-1)M / [(n^2 + kn - 2k) B].
Two acceleration ratios are defined:
- Eq. 7 (left): g_a = (n^2 + kn - 2k) / n^2 — speedup vs. plain all-reduce.
- Eq. 7 (right): g_p = (n^2 + kn - 2k) / [2k(n-1)] — speedup vs. non-colocated PS.
Boundary behaviour:
- When k = n and n -> infinity: g_a = 2. BytePS doubles all-reduce throughput in the large-cluster, abundant-CPU regime.
- When k is small, g_p is large, because non-colocated PS bottlenecks on CPU NICs.
- Concrete example: n = 32, k = 16 -> g_a = 1.46, g_p = 1.52 — BytePS is 46% faster than all-reduce and 52% faster than PS.
Adding more CPU machines beyond k = n does not help because the GPU NIC becomes the wall.

4.2 Intra-machine Communication

Intra-machine matters because multiple GPUs per machine must aggregate/broadcast tensors before/after CS-SS exchange; uncoordinated traffic causes PCIe contention and prevents NIC saturation.
Internal topologies vary widely; paper covers two representative ones from production and gives a generalization rule.

4.2.1 PCIe-only Topology

Topology (Fig. 6a): 2 NUMA CPUs joined by QPI, 8 GPUs split into two groups of 4, each group on a PCIe switch (PCIe 3.0 x16 = 128 Gb/s theoretical), 100 Gb/s NIC connected via PCIe to one CPU. CPU memory and QPI exceed 300 Gb/s — not the bottleneck.
Measured GPU-to-GPU memory copy: ~105 Gb/s within same PCIe switch vs ~80 Gb/s across PCIe switches.
TensorFlow PS, MXNet PS, and Horovod's "hierarchical all-reduce" do reduce / reduce-scatter across all 8 GPUs, hitting the slow cross-switch path.
BytePS approach: CPU-assisted aggregation — let GPUs under one PCIe switch reduce locally, copy to CPU, let CPU do global summation, broadcast back. Six steps:
1. Reduce-Scatter within each PCIe switch (l GPUs): (l-1)M/l traffic inside the switch.
2. GPU-CPU Copy: each GPU sends M/l to CPU memory.
3. CPU-Reduce across switches: no PCIe traffic.
4. Networking: CS sends to SS, receives reduced result.
5. CPU-GPU Copy: each GPU pulls back M/l.
6. All-Gather under each PCIe switch: (l-1)M/l traffic.
With l = 4, cross-PCIe-switch GPU-GPU traffic is replaced by M bytes on the PCIe-CPU link, much less than direct collective on 8 GPUs (which would put 7M/4 traffic on the PCIe-to-GPU link).
Optimality analysis (Fig. 7). Generic graph G = (V, E) with leaf GPUs N, switches S, CPU nodes C; assumptions: edges duplex with equal-direction bandwidth; symmetric per layer; memory/QPI not the bottleneck.
Combining brute-force copy on x of data with CPU-assisted on y = 1 - x:
- Eq. 8: t(S_j, C_j) = n * xM + (yM/n) * n = (nx + y) M.
- Eq. 9: t(N_i, S_j) = xM + (2(n-1)/n + 1/n) * yM = ((2n-1)/n * y + x) M.
- Job time J = max(t(N_i, S_j) / b(N_i, S_j), t(S_j, C_j) / b(S_j, C_j)).
Plugging in the production machine: b(N_i, S_j) = b(S_j, C_j) = 13.1 GB/s = 105 Gb/s, M = 1 GB, n = 4 -> arg min_x J(x) = max((3x+1)/13.1, (7-3x)/52.4). Optimum at x* = 1/5, J* = 0.129 s.
Pure CPU-assisted (x = 0) gives J(0) = 0.141 s — 9% worse than the true optimum, but brute-force copy stresses CPU memory by 4x, especially under FP16 summation (Fig. 9b), so BytePS sticks with CPU-assisted aggregation only.
CPU-assisted aggregation beats ring all-reduce. Ring all-reduce time J_ar = 2(np-1)M / (np * b_bottleneck); CPU-assisted time J_ca = (M / b(S_j, C_j)) * max(1, (2n-1)/(kn)) with kappa = b(N_i, S_j) / b(S_j, C_j). With production values (p=2, n=4, b_bottleneck = 80 Gb/s, b(S_j, C_j) = 105 Gb/s), J_ca is 23.7% smaller than J_ar.

4.2.2 NVLink-based Topology

Topology (Fig. 8a): 4 PCIe switches each connecting 2 GPUs; all 8 GPUs are on an NVLink mesh giving 1.2 Tb/s GPU-GPU bandwidth; NIC sits on one PCIe switch (P0_CPU0).
NVLink lets GPU-GPU communication bypass PCIe, so CPU-assisted aggregation is not needed. NCCL is still suboptimal because the topology is asymmetric: NIC and the two GPUs under P0_CPU0 share the same PCIe path, and SS on the same machine also uses it.
BytePS approach: reduce/broadcast with a non-NIC root. Pick a GPU under a PCIe switch that does not carry the NIC (e.g., GPU2) as the reduce/broadcast root. All GPUs reduce to GPU2 over NVLink; GPU2 copies the result to CPU0 memory; CS sends to SS; on the way back, GPU2 receives the reduced data and broadcasts it over NVLink. This frees P0_CPU0 entirely for the NIC.
The hotspot at GPU2 is irrelevant because NVLink is much wider than PCIe; the P1_CPU0 PCIe link used for the GPU-CPU copy is ~100 Gb/s, matching the NIC, so it is also not a bottleneck.
NCCL by default tends to use GPUs adjacent to the NIC, hitting the bottleneck.

4.2.3 Discussion

The PCIe-only and NVLink optima differ; intra-machine communication is not one-size-fits-all; the strategies generalize via two principles:
1. Avoid direct GPU-GPU memory copy when GPUs are not under the same PCIe switch.
2. Minimize traffic on the PCIe-CPU link shared between GPUs and NIC.
Practical procedure. Let Sn = #PCIe switches that host both GPUs and NIC; Sg = #PCIe switches with only GPUs:
- Sn > 0 and Sg > 0 (asymmetric, NVLink-like): CS uses reduce/broadcast with non-NIC roots.
- Sn = 0 or Sg = 0 (symmetric, PCIe-only-like): CS uses reduce-scatter / all-gather; if no NVLink, apply CPU-assisted aggregation.
Multi-NIC topologies follow the same principles by adjusting Sn and Sg.
GPU-Direct RDMA (GDR) is not adopted: requires GPU and NIC on the same PCIe switch (else throughput drops below 50 Gb/s on a 100 GbE NIC [12]); PCIe-only topology violates that, NVLink-based already avoids the bottleneck, and most clouds (e.g., AWS) do not support GDR.

5. Summation Service

Why not run the full optimizer on CPU (the classical PS choice)?
Experiment: 1x V100 GPU machine + 1x Intel Xeon Platinum CPU machine (32 cores w/ HT, MKL [7]), 100 GbE; CPU runs the full optimizer.
Fig. 9a: even with 32 cores + MKL, running the optimizer on CPU slows end-to-end training; the more complex the optimizer (SGD -> RMSProp), the worse the bottleneck.
Root cause: CPU memory bandwidth. Adam, e.g., needs >10x memory accesses per parameter update; a 6-channel DDR4-2666 setup peaks at 1024 Gb/s read+write [8]; a 100 Gb/s NIC alone consumes 200 Gb/s of memory bandwidth, so the residual is insufficient for Adam over a 100 Gb/s gradient stream.
CPU is great at summation only. AVX [47] gives modern x86 CPUs >200 Gb/s summation throughput at FP16 and FP32 (Fig. 9b), exceeding the 100 Gb/s NIC; SS will not bottleneck.
BytePS solution: split optimizer. Keep gradient summation on CPU (= SS); move parameter update (more compute) to GPU.
Fig. 10 compares component placement between PS, all-reduce, and BytePS:
- PS: FP/BP on GPU; comm; sum + update on CPU (full optimizer on CPU).
- All-reduce: FP/BP on GPU; sum across GPUs; update on GPU (no CPU role).
- BytePS: FP/BP on GPU; comm; sum on CPU, update on GPU.
AVX + OpenMP implementation: SS uses fewer than 3 CPU cores at 100 Gb/s throughput.
SS overhead ratio = parameter-update FLOPs / (FP + BP FLOPs):
- VGG-16 (SGD): 138 MFLOPs / 32 GFLOPs ≈ 0.43%
- ResNet-50 (SGD): 26 MFLOPs / 7.8 GFLOPs ≈ 0.33%
- BERT-Large (SGD): 387 MFLOPs / 494 GFLOPs ≈ 0.078%
With realistic batch sizes (tens to hundreds), the per-iteration overhead is even smaller because update happens once per batch.
Horovod's "CPU all-reduce" option only relocates the gradient aggregation to CPU on GPU machines; it does not use additional CPU machines, so it provides no communication advantage. BytePS is fundamentally different in that it uses spare CPU machines for SS.

Asynchronous training support:

Splitting summation and update breaks classical async semantics (which keep the latest parameters on PS).
BytePS redesigns the async workflow (Fig. 11b): GPU computes new params, sends delta Δw_t = w'_{t+1} - w_t to SS; SS keeps w_t and overwrites with w_t + Δw_t.
Theorem 2. The BytePS async algorithm is equivalent to Async Parallel [25] in terms of convergence.
Proof sketch: induction on t. Base case t=1: workers and CSes start from w_0, so Δw_{i,1} = f(g_{i,1}). Inductive step: assuming Δw_{i,k} = f(g_{i,k}) for t = k, gradients at t = k+1 are computed from the same w_k, so w_{ps,k+1} = w_k + f(g_{i,k+1}) = w_{byteps,k+1}. Therefore w_ps = w_byteps after any T iterations. (Eq. 10 and Eq. 11).

6. Implementation

BytePS core in C++ (10K LoC); plugins for TensorFlow, PyTorch, MXNet in C++ and Python (7.8K LoC Python). Open-sourced [4].

6.1 Multi-Stage Pipeline

BytePS borrows tensor partitioning + pipelining from prior work [34, 55]. Each step of CS becomes a pipeline stage; PCIe-only path has 6 stages -> 6-stage pipeline.
Each stage = independent thread with a priority queue; tensors >4 MB are partitioned; priorities follow [34, 55] (former layers have higher priority because they are needed sooner for the next iteration's FP).

6.2 Address RDMA Performance Issues

Network: RDMA RoCEv2, one 100 GbE NIC per machine, full bisection.
RDMA Memory Management: BytePS uses RDMA WRITE for performance [39]; conventional one-sided WRITE/READ needs two round trips (get remote address, then write/read). BytePS exploits the fact that DNN training reuses the same tensor set every iteration: register and exchange addresses once on the first iteration, then reuse.
Slow Receiver Symptom (PFC storms) [30]: three causes identified, each with a fix.
1. Internal RDMA loopback creates 2:1 incast on NIC between RX and loopback; both compete for DMA-to-memory egress. Fix: detect SS on same machine, switch to shared memory (shm) path; SS reads from shm, writes summation back to shm.
2. Page alignment. RDMA must use page-aligned memory; misalignment forces multi-page DMA and triggers PFC.
3. Single SGE per WRITE. Receiver throughput is impacted by sender-side scatter-gather count; BytePS enforces exactly one SGE per RDMA WRITE.
Table 2 — RDMA optimization gains (CPU machine + GPU machine microbenchmark):

Solution	Throughput (Gb/s)	Speedup vs. baseline
baseline	41	1.00x
+shm	52	1.27x
+shm +aligned	76	1.85x
all (above + 1 SGE)	89	2.17x

After all three fixes, NIC PFC generation is negligible.
Many-to-one comms in BytePS would cause TCP incast [66], but RDMA RoCEv2 + DCQCN [75] congestion control prevents it.

6.3 BytePS Usage

Python interfaces near-identical to Horovod / PyTorch native API / TensorFlow native API.
Migrating Horovod-MNIST [19] takes one line: import horovod -> import byteps. Most internal Horovod jobs were converted automatically.

7. Evaluation

Fidelity highlights:

All resources allocated by the production cluster scheduler (non-preemptive); even the largest job in the paper uses < 5% of GPUs in a busy cluster.
Large per-GPU batch sizes are used (close to GPU memory limit), so reported speedups are a lower bound on what smaller-batch jobs would achieve.
Models are representative production workloads; code public [5].
Comparisons use unmodified state-of-the-art PS and all-reduce; the §6.2 RDMA fixes are not applied to the baselines.
All cluster machines have one 100 GbE NIC; full-bisection RoCEv2.

7.1 Inter-machine Microbenchmarks

Setup: 8 1-GPU machines, all GPU workers reduce dummy tensors and record goodput. Point-to-point RDMA goodput in this network ≈ 90 Gb/s.
Fig. 12: BytePS is within 1-9% of theoretical optimum (line based on B = 90 Gb/s and §4.1 analysis) for all k ∈ [0, 8]. All-reduce matches optimum only at k = 0 and is flat in k. MXNet PS (no optimizer in this benchmark) is bottlenecked by the §6.2 issues; with the optimizer, PS would be worse than all-reduce even at k = n.
Because of SS, BytePS is unaffected by the optimizer issues that plague classical PS in real workloads.

7.2 Leverage CPU Machines

8 GPU machines, each 8 V100 32 GB GPUs (PCIe-only or NVLink); CPU machines varied 0..8; baseline = Horovod 0.19 + NCCL 2.5.7. Models: UGATIT GAN (PyTorch, batch=2 images), GPT-2 (PyTorch, batch=80 tokens).
Fig. 13: BytePS goes up to 20% faster with extra CPU machines vs. without; SS uses < 4 CPU cores per CPU machine, easy to schedule among non-distributed jobs — effectively free speedup for the cluster.
BytePS is consistently faster than all-reduce; the best-case speedup vs. all-reduce is 45%. NVLink-based machines see larger gains because the network (rather than PCIe) is the binding constraint.
Communication-intensive models (e.g., GAN UGATIT) gain more end-to-end.

7.3 Adapt to Intra-machine Topology

Same hardware as §7.2; runs without CPU machines so the §7.2 advantage is removed. Compares BytePS strategies head-to-head.
PCIe-only (Fig. 14a): the §4.2.1 optimal (CPU-assisted aggregation) gives up to 20% over a strawman all-reduce-style strategy.
NVLink-based (Fig. 14b): roots {root=2}, {root=2,3 round-robin}, root=all (≈ Horovod hierarchical), root=0 (≈ plain NCCL).
- root=2 is optimal as predicted (§4.2.2);
- root=2,3 ties with root=2 because GPU3 also avoids NIC contention;
- root=all is worse;
- root=0 is the worst because it competes hardest with the NIC.
Even without intra-machine optimization, BytePS still beats all-reduce (analyzed in §8).

7.4 Scalability

6 jobs scaled 8 -> 256 V100 32 GB GPUs (1 -> 32 GPU machines), NVLink-based machines only.
Frameworks: TensorFlow, PyTorch, MXNet. Models: ResNet-50 [32], VGG-16 [63], Transformer [67], BERT-Large [26], UGATIT [41], GPT-2 [57]. Each model required <20 LoC change to switch among PS, all-reduce, and BytePS.
For BytePS, two configurations: with CPU machines (k = n) and without; baselines: Horovod + NCCL (all-reduce), TensorFlow native PS, MXNet native PS (RDMA enabled). PyTorch lacks official PS.
Fig. 15 (CV models) and Fig. 16 (NLP models): consistent ranking — BytePS-with-CPU > BytePS-no-CPU > all-reduce > PS.
Headline numbers at 256 GPUs:

Comparison	Speedup range
BytePS (with CPU) over all-reduce	10% — 84%
BytePS (no CPU) over all-reduce	9% — 53%
BytePS over native PS	up to 245% (and 52% on VGG-16 specifically)

Scaling efficiency at 256 GPUs:

Model	BytePS scaling efficiency
ResNet-50	97.5% (all-reduce: 88%)
5 of 6 models other than UGATIT	>= 91.6%
UGATIT (least scalable)	74% (BytePS), with the largest gap over all-reduce — 84% gain

Improvement breakdown on VGG-16 at 256 GPUs (BytePS vs. native PS, total 52%): 19% from optimal communication design (intra-server), 18% from Summation Service, 15% from RDMA / pipelining implementation.

8. Observations and Discussion

BytePS beats all-reduce even without extra CPU machines. Theoretically equal at k = 0; in practice, BytePS still wins, partly from better intra-machine strategy (§7.3) but also without that optimization (Fig. 14). Authors hypothesize BytePS allows more "asynchronicity" — all-reduce needs cross-node order synchronization, BytePS does not. A distributed profiler is needed to confirm.
Cluster scheduler should consider dynamic CPU resources. Because BytePS adapts to any k and k only affects performance (not convergence), the scheduler can scale CPU machines in/out per job mid-training. GPU count usually cannot change because of convergence concerns [16, 74]. Adding elasticity to BytePS is future work.
Model-parallelism support. BytePS accelerates any tensor reduction across GPUs, including those used by Megatron-LM [62] and Mesh-TensorFlow [61], which rely on all-reduce.

Compute acceleration: cuDNN [10], MKL [7], TVM [23], XLA [17], Astra [64], Tensor Fusion [14], graph substitution [37] — complementary, can be combined with BytePS.
Gradient compression [21, 45] reduces traffic at potential accuracy cost — orthogonal.
Communication scheduling and pipelining [31, 34, 55]: priority + tensor partitioning, integrated into BytePS implementation. PipeDream [51] adds parallelism between batches; BytePS can accelerate its data-parallel stages.
Hierarchical all-reduce [24, 49] minimizes traffic at bottleneck links but assumes homogeneous resources, ignoring CPU; recent NCCL has tree-based hierarchical all-reduce, which §7 results show does not outperform BytePS.
Intra-machine optimization: Blink [68] uses hybrid NVLink+PCIe transfers but only inside a single machine; in distributed training, the binding constraint is the NIC and its PCIe path, which Blink does not address.
New hardware: TPU [38], Habana [6] — BytePS is not GPU-specific and applies to any PCIe-attached accelerator. InfiniBand-switch ASIC for all-reduce [28], P4-switch in-network aggregation [58, 59], SmartNICs (E3 [46]) potentially complement BytePS. PHub [48] proposes a rack-scale parameter server with custom hardware (10 NICs/server); BytePS targets commodity datacenter hardware instead.

10. Conclusion

BytePS is a unified, communication-optimal distributed-training architecture for heterogeneous GPU/CPU clusters; all-reduce and PS are special cases.
Summation Service splits a DNN optimizer into CPU-resident gradient summation (AVX, OpenMP, <3 cores at 100 Gb/s) and GPU-resident parameter update.
Numerous implementation details — including RDMA performance fixes (shm, page alignment, single SGE), priority-pipelining, and topology-aware intra-machine schedules — bring real-world performance close to the theoretical optimum.
Deployed at ByteDance, open-sourced [4], with production speedups of up to 84% over all-reduce and 245% over PS at 256 GPUs across six models and three frameworks. Artifact appendix at [3].

11. Acknowledgement

Shepherd Rachit Agarwal and the OSDI reviewers acknowledged.
Yimin Jiang and Yong Cui supported by NSFC No. 61872211 and National Key R&D Program of China No. 2018YFB1800303.

Note on NCCL Tuning

BytePS measures and uses two intra-machine collective layouts that NCCL itself selects between (reduce-scatter+all-gather for symmetric PCIe-only topologies vs. reduce/broadcast with a non-NIC root for asymmetric NVLink topologies); the paper shows root choice alone changes throughput substantially (Fig. 14b: root=2 best, root=0 worst). For NCCL configuration tuning, the directly relevant finding is that the optimal collective and its root depend on whether the NIC shares a PCIe switch with one of the GPUs — a property invisible to default NCCL ranking heuristics on asymmetric machines. The paper also quantifies the small-tensor partitioning bound (4 MB) at which pipelining and load-balancing become beneficial, a useful prior on chunk granularity.