A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters — Detailed Summary
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo | Tsinghua University, ByteDance, Google | OSDI 2020 (14th USENIX Symposium on Operating Systems Design and Implementation, Nov 4-6, 2020)
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.
Abstract
- Production data center clusters that run DNN training jobs are inherently heterogeneous: GPUs and CPUs both provide compute, and network bandwidth is the binding distributed-training resource.
- Existing distributed DNN training architectures — all-reduce and Parameter Server (PS) — cannot fully exploit such heterogeneous resources, leaving spare CPU and bandwidth on the table.
- The paper presents BytePS, a new distributed DNN training architecture that leverages spare CPU cores and NIC bandwidth in the cluster to accelerate GPU-resident training jobs.
- BytePS is proven communication-optimal for any
allocation
0 <= k <= nof CPU machines tonGPU machines, and is unified: existing all-reduce (k=0) and non-colocated PS (k=n) are two special cases of BytePS. - BytePS introduces a Summation Service (SS) abstraction that splits a parameter optimizer into (i) gradient summation, accelerated on CPUs via AVX, and (ii) parameter update, executed on GPUs.
- BytePS supports TensorFlow, PyTorch, and MXNet via Horovod-like and native APIs, and outperforms state-of-the-art open-source all-reduce by up to 84% and PS by up to 245% on training jobs running with up to 256 GPUs.
1. Introduction
Background:
- DNN research has produced breakthroughs in CV, speech, NLP, and reinforcement learning; large GPU clusters with thousands of GPUs are standard for training.
- GPU clusters contain not only GPUs but also high-end CPUs and high-speed networks (Ethernet or InfiniBand). CPU-only machines also exist for data preprocessing.
- Production observation: GPUs are usually fully utilized, while CPU and network bandwidth often have spare capacity.
Two existing distributed-training architectures:
- All-reduce: only GPU machines participate; gradients are aggregated via the all-reduce primitive.
- Parameter Server (PS): GPU workers push gradients to PS processes (commonly on CPU machines); PS aggregates and runs the optimizer (SGD, Adam, etc.) and pushes the updated weights back.
- All-reduce is bandwidth-optimal for a homogeneous GPU-only set [54], but loses optimality once additional CPU+bandwidth resources are available.
Motivating gap:
- In theory, with extra CPU machines, PS can outperform all-reduce; in practice, existing PS implementations have inferior performance due to design and implementation bottlenecks, so distributed-training records [27, 49, 73] are dominated by all-reduce.
- BytePS is therefore designed to be communication-optimal both in theory and in practice for arbitrary GPU/CPU machine ratios and arbitrary intra-machine PCIe/NVLink topologies.
Key design ideas:
- BytePS unifies all-reduce and PS as boundary cases (k=0 and k=n) of a single optimal traffic-allocation strategy.
- BytePS removes the CPU bottleneck observed in classical PS by splitting the optimizer: gradient summation stays on CPU (fits memory bandwidth and AVX); parameter update moves to GPU.
- BytePS adopts pipelining and priority scheduling from prior work [34, 55] and resolves several RDMA performance issues.
Contributions enumerated:
- New distributed DNN training architecture for heterogeneous GPU/CPU clusters that achieves communication optimality and contains all-reduce and PS as special cases.
- Optimal intra-machine communication strategies tailored to PCIe-only and NVLink-based GPU machines.
- Summation Service abstraction that removes the CPU bottleneck of classical PS.
- BytePS has been deployed internally at ByteDance and open-sourced [4]; evaluation uses six DNN models, three frameworks, and 256 GPUs.
2. Background
2.1 Distributed DNN Training
- Training a DNN consists of three steps per iteration: forward propagation (FP), backward propagation (BP), and parameter update via an optimizer (SGD [76], Adam [42], etc.).
- The dominant distributed-training paradigm is data parallelism: each GPU keeps a full model replica, processes a different mini-batch, and synchronizes gradients every iteration.
- Modern GPU clusters host hundreds-to-thousands of machines connected by RDMA networks; per-machine: multiple GPUs, tens of CPU cores, hundreds of GB DRAM, one to several 100 Gb/s NICs.
- A public cluster trace [35] shows 50% of hosts have CPU utilization < 30% — supporting the spare-CPU thesis.
- Notation used throughout:
n= number of GPU machines;M= model size in bytes;B= network bandwidth per machine.
2.2 All-reduce
- All-reduce originated from HPC and aggregates gradients collectively without any extra CPU machine; ring all-reduce is the most common algorithm.
- A ring all-reduce splits into reduce-scatter then all-gather (Fig.
1):
- Reduce-scatter: each node sends
(n-1)M/nbytes; M is partitioned intonsegments. - All-gather: each node again sends
(n-1)M/nbytes broadcasting its reduced segment.
- Reduce-scatter: each node sends
- Total per-node traffic:
2(n-1)M/n; completion time:2(n-1)M/(nB), proven optimal for uniform-bandwidth topologies with no extra resources [54]. - For non-uniform / hierarchical topologies, the optimum is
2(n'-1)M/(n'B')whereB'is the slowest link bandwidth andn'the number of nodes on the slowest layer; for simplicity the paper assumes one GPU per machine,n=n',B=B'. - All-reduce cannot use additional non-worker nodes — designed for homogeneous setups.
2.3 Parameter Server (PS)
- PS [44] has two roles: workers (run FP/BP, push gradients, pull weights) and PS (aggregate gradients, run optimizer). PS commonly runs on CPU machines because GPUs are far more expensive (AWS p3.16xlarge: ~$25/hr; r4.16xlarge same minus GPUs: $4.2/hr).
- Two PS placement strategies (Fig. 2):
- Non-colocated mode (Fig. 2a): PS runs on
kdedicated CPU machines. Each GPU worker sends/receivesMbytes per iteration; each CPU machine handlesnM/kbytes. - Colocated mode (Fig. 2b): PS processes run on the GPU machines themselves and reuse spare CPU resources; loop-back communication for local PS-worker traffic.
- Non-colocated mode (Fig. 2a): PS runs on
- Theoretical PS communication time:
max(M/B, nM/(kB)). - When
k=n, PS is communication optimal and faster than all-reduce; whenk <= n/2, non-colocated PS becomes slower than all-reduce because the CPU machines bottleneck. - Comparison summary (paper Table 1):
| Architecture | Per-iteration comm time | Optimal? |
|---|---|---|
| All-reduce | 2(n-1)M/(nB) |
Only if k = 0 |
| Non-Colocated PS | max(M/B, nM/(kB)) |
Only if k = n |
| Colocated PS | 2(n-1)M/(nB) |
Only if k = 0 |
- Non-colocated PS can leverage extra CPU+bandwidth but may under-utilize GPU machine resources; colocated PS and all-reduce use GPU-machine resources well but cannot consume extra CPU machines.
- PS supports asynchronous training (mitigates stragglers) but converges more slowly; the paper focuses on synchronous training and addresses async briefly in §5.
3. Motivation and BytePS Architecture
3.1 Motivation
- ByteDance's internal users mostly chose all-reduce because existing PS performed worse, except where async-PS was preferred for straggler tolerance.
- Opportunity (Fig. 3): A 3-month trace (2020-01-01 to 2020-03-31) of a cluster with thousands of GPUs shows GPUs near 96% allocation at peaks but only 55%-80% of GPU machines are running distributed jobs at any time, leaving the network bandwidth of 20%-45% of GPU machines idle. Cluster-wide CPU utilization is 20%-35%, consistent with [35].
- Existing architectures are insufficient: all-reduce
and colocated PS only use GPU-worker resources; non-colocated PS may
under-utilize GPU-worker NIC and CPU. For
0 < k < n, neither is optimal. - End-to-end demonstration (Fig. 4): VGG-16, 32x V100 GPUs (4 GPU machines), 100 GbE RDMA, batch size 32 per GPU. MXNet PS-RDMA [1] and NCCL-2.5.7 [13] all-reduce both fall far below 32x linear scaling. Even ByteScheduler [55] (state-of-the-art) on top of PS or all-reduce remains far from linear.
- These results motivate BytePS to leverage spare CPUs/bandwidth dynamically.
BytePS goals:
- Always communication-optimal for any
0 <= k <= n(and adaptive as the cluster scheduler changeskover time), with proofs. - Optimal intra-machine communication for diverse PCIe/NVLink topologies; all-reduce and PS are special cases.
- Communication time close to the theoretical optimum in practice — original PS is far from its theoretical limit, so BytePS removes the implementation bottlenecks.
- Generic to DNN training; supports TensorFlow, PyTorch, MXNet.
3.2 Architecture Overview
- BytePS components (Fig. 5):
- Communication Service (CS) — runs on every GPU machine, internally aggregates/broadcasts among local GPUs and externally exchanges tensors with SS.
- Summation Service (SS) — runs on the CPU of every machine (CPU-only machines and GPU machines), receives tensors, sums them, returns sums to CS.
- "CPU machine" includes both dedicated CPU-only machines and GPU machines whose spare CPU cores and NIC are scheduled to run SS — improving cluster-wide utilization.
- SS is much simpler than classical PS: only sum, no full optimizer state. The full optimizer (parameter update step) lives in CS on GPU machines.
- CS responsibilities: (1) decide traffic allocation to each SS (analyzed in §4.1), (2) pick local tensor aggregation strategy per intra-machine topology (§4.2), (3) be RDMA-optimized (§6.2).
- BytePS gracefully degenerates:
k=0-> SS only on GPU machines (= all-reduce);k=n-> SS only on CPU machines (= non-colocated PS); for intermediatek, all SS instances cooperate with the analytically-derived optimal split.
4. BytePS Communication Design
4.1 Inter-machine Communication
- All inter-machine traffic is between CS and SS; the design balances per-machine traffic so no node is the bottleneck.
- Assumes full-bisection-bandwidth network (common in DL clusters
[52]) and modern RDMA congestion control such as DCQCN [75], so any node
can hit its NIC bandwidth
B. - SS is split into SSCPU (running on CPU machines, traffic governed only by the CPU machine's NIC) and SSGPU (running on GPU machines, sharing NIC bandwidth with that machine's CS).
- BytePS assigns workload (bytes summed per machine) per the
equations:
- Eq. 1:
M_{SSCPU} = 2(n-1) / (n^2 + kn - 2k) * M - Eq. 2:
M_{SSGPU} = (n-k) / (n^2 + kn - 2k) * M - Constraints
k >= 1,n >= 2,k <= n. Outside these, BytePS degenerates to all-reduce or PS.
- Eq. 1:
- Since DNN tensors have variable sizes, BytePS partitions tensors
into parts of at most 4 MB and consistently hashes each
part's index into
[0, n^2 + kn - 2k)to approximate Eq. 1 and Eq. 2; consistent indexing/hashing guarantees the same part from all GPUs lands at the same SS.
4.1.1 Communication Efficiency Analysis
- The paper proves Theorem 1: the workload assignment in Eq. 1 and Eq. 2 minimizes inter-machine communication time.
- Per-machine traffic accounting (assuming
M >> 4 MBpartition):- GPU machine runs CS and SS; its CS sends/receives
M - M_{SSGPU}bytes externally (the SSGPU portion stays loop-back). Its SSGPU receives(n-1) * M_{SSGPU}bytes fromn-1other GPU machines. - Eq. 3:
t_g = [M + (n-2) * M_{SSGPU}] / B. - Eq. 4:
t_c = M_{SSCPU} / B(per CPU machine). - Eq. 5:
M = k * M_{SSCPU} + n * M_{SSGPU}(total work conservation).
- GPU machine runs CS and SS; its CS sends/receives
- Final communication time =
max(t_c, t_g). Minimum occurs whent_c = t_g; solving gives Eq. 1 and Eq. 2. - Combining Eq. 3 and Eq. 2 yields the optimal time:
- Eq. 6:
t_opt = 2n(n-1)M / [(n^2 + kn - 2k) B].
- Eq. 6:
- Two acceleration ratios are defined:
- Eq. 7 (left):
g_a = (n^2 + kn - 2k) / n^2— speedup vs. plain all-reduce. - Eq. 7 (right):
g_p = (n^2 + kn - 2k) / [2k(n-1)]— speedup vs. non-colocated PS.
- Eq. 7 (left):
- Boundary behaviour:
- When
k = nandn -> infinity:g_a = 2. BytePS doubles all-reduce throughput in the large-cluster, abundant-CPU regime. - When
kis small,g_pis large, because non-colocated PS bottlenecks on CPU NICs. - Concrete example:
n = 32,k = 16->g_a = 1.46,g_p = 1.52— BytePS is 46% faster than all-reduce and 52% faster than PS.
- When
- Adding more CPU machines beyond
k = ndoes not help because the GPU NIC becomes the wall.
4.2 Intra-machine Communication
- Intra-machine matters because multiple GPUs per machine must aggregate/broadcast tensors before/after CS-SS exchange; uncoordinated traffic causes PCIe contention and prevents NIC saturation.
- Internal topologies vary widely; paper covers two representative ones from production and gives a generalization rule.
4.2.1 PCIe-only Topology
- Topology (Fig. 6a): 2 NUMA CPUs joined by QPI, 8 GPUs split into two groups of 4, each group on a PCIe switch (PCIe 3.0 x16 = 128 Gb/s theoretical), 100 Gb/s NIC connected via PCIe to one CPU. CPU memory and QPI exceed 300 Gb/s — not the bottleneck.
- Measured GPU-to-GPU memory copy: ~105 Gb/s within same PCIe switch vs ~80 Gb/s across PCIe switches.
- TensorFlow PS, MXNet PS, and Horovod's "hierarchical all-reduce" do reduce / reduce-scatter across all 8 GPUs, hitting the slow cross-switch path.
- BytePS approach: CPU-assisted aggregation — let
GPUs under one PCIe switch reduce locally, copy to CPU, let CPU do
global summation, broadcast back. Six steps:
- Reduce-Scatter within each PCIe switch (l GPUs):
(l-1)M/ltraffic inside the switch. - GPU-CPU Copy: each GPU sends
M/lto CPU memory. - CPU-Reduce across switches: no PCIe traffic.
- Networking: CS sends to SS, receives reduced result.
- CPU-GPU Copy: each GPU pulls back
M/l. - All-Gather under each PCIe switch:
(l-1)M/ltraffic.
- Reduce-Scatter within each PCIe switch (l GPUs):
- With
l = 4, cross-PCIe-switch GPU-GPU traffic is replaced byMbytes on the PCIe-CPU link, much less than direct collective on 8 GPUs (which would put7M/4traffic on the PCIe-to-GPU link). - Optimality analysis (Fig. 7). Generic graph
G = (V, E)with leaf GPUsN, switchesS, CPU nodesC; assumptions: edges duplex with equal-direction bandwidth; symmetric per layer; memory/QPI not the bottleneck. - Combining brute-force copy on
xof data with CPU-assisted ony = 1 - x:- Eq. 8:
t(S_j, C_j) = n * xM + (yM/n) * n = (nx + y) M. - Eq. 9:
t(N_i, S_j) = xM + (2(n-1)/n + 1/n) * yM = ((2n-1)/n * y + x) M. - Job time
J = max(t(N_i, S_j) / b(N_i, S_j), t(S_j, C_j) / b(S_j, C_j)).
- Eq. 8:
- Plugging in the production machine:
b(N_i, S_j) = b(S_j, C_j) = 13.1 GB/s = 105 Gb/s,M = 1 GB,n = 4->arg min_x J(x) = max((3x+1)/13.1, (7-3x)/52.4). Optimum atx* = 1/5,J* = 0.129 s. - Pure CPU-assisted (
x = 0) givesJ(0) = 0.141 s— 9% worse than the true optimum, but brute-force copy stresses CPU memory by 4x, especially under FP16 summation (Fig. 9b), so BytePS sticks with CPU-assisted aggregation only. - CPU-assisted aggregation beats ring all-reduce.
Ring all-reduce time
J_ar = 2(np-1)M / (np * b_bottleneck); CPU-assisted timeJ_ca = (M / b(S_j, C_j)) * max(1, (2n-1)/(kn))withkappa = b(N_i, S_j) / b(S_j, C_j). With production values (p=2, n=4, b_bottleneck = 80 Gb/s, b(S_j, C_j) = 105 Gb/s),J_cais 23.7% smaller thanJ_ar.
4.2.2 NVLink-based Topology
- Topology (Fig. 8a): 4 PCIe switches each connecting 2 GPUs; all 8 GPUs are on an NVLink mesh giving 1.2 Tb/s GPU-GPU bandwidth; NIC sits on one PCIe switch (P0_CPU0).
- NVLink lets GPU-GPU communication bypass PCIe, so CPU-assisted aggregation is not needed. NCCL is still suboptimal because the topology is asymmetric: NIC and the two GPUs under P0_CPU0 share the same PCIe path, and SS on the same machine also uses it.
- BytePS approach: reduce/broadcast with a non-NIC root. Pick a GPU under a PCIe switch that does not carry the NIC (e.g., GPU2) as the reduce/broadcast root. All GPUs reduce to GPU2 over NVLink; GPU2 copies the result to CPU0 memory; CS sends to SS; on the way back, GPU2 receives the reduced data and broadcasts it over NVLink. This frees P0_CPU0 entirely for the NIC.
- The hotspot at GPU2 is irrelevant because NVLink is much wider than PCIe; the P1_CPU0 PCIe link used for the GPU-CPU copy is ~100 Gb/s, matching the NIC, so it is also not a bottleneck.
- NCCL by default tends to use GPUs adjacent to the NIC, hitting the bottleneck.
4.2.3 Discussion
- The PCIe-only and NVLink optima differ; intra-machine communication
is not one-size-fits-all; the strategies generalize via
two principles:
- Avoid direct GPU-GPU memory copy when GPUs are not under the same PCIe switch.
- Minimize traffic on the PCIe-CPU link shared between GPUs and NIC.
- Practical procedure. Let
Sn= #PCIe switches that host both GPUs and NIC;Sg= #PCIe switches with only GPUs:Sn > 0andSg > 0(asymmetric, NVLink-like): CS uses reduce/broadcast with non-NIC roots.Sn = 0orSg = 0(symmetric, PCIe-only-like): CS uses reduce-scatter / all-gather; if no NVLink, apply CPU-assisted aggregation.
- Multi-NIC topologies follow the same principles by adjusting
SnandSg. - GPU-Direct RDMA (GDR) is not adopted: requires GPU and NIC on the same PCIe switch (else throughput drops below 50 Gb/s on a 100 GbE NIC [12]); PCIe-only topology violates that, NVLink-based already avoids the bottleneck, and most clouds (e.g., AWS) do not support GDR.
5. Summation Service
- Why not run the full optimizer on CPU (the classical PS choice)?
- Experiment: 1x V100 GPU machine + 1x Intel Xeon Platinum CPU machine (32 cores w/ HT, MKL [7]), 100 GbE; CPU runs the full optimizer.
- Fig. 9a: even with 32 cores + MKL, running the optimizer on CPU slows end-to-end training; the more complex the optimizer (SGD -> RMSProp), the worse the bottleneck.
- Root cause: CPU memory bandwidth. Adam, e.g., needs >10x memory accesses per parameter update; a 6-channel DDR4-2666 setup peaks at 1024 Gb/s read+write [8]; a 100 Gb/s NIC alone consumes 200 Gb/s of memory bandwidth, so the residual is insufficient for Adam over a 100 Gb/s gradient stream.
- CPU is great at summation only. AVX [47] gives modern x86 CPUs >200 Gb/s summation throughput at FP16 and FP32 (Fig. 9b), exceeding the 100 Gb/s NIC; SS will not bottleneck.
- BytePS solution: split optimizer. Keep gradient summation on CPU (= SS); move parameter update (more compute) to GPU.
- Fig. 10 compares component placement between PS,
all-reduce, and BytePS:
- PS: FP/BP on GPU; comm; sum + update on CPU (full optimizer on CPU).
- All-reduce: FP/BP on GPU; sum across GPUs; update on GPU (no CPU role).
- BytePS: FP/BP on GPU; comm; sum on CPU, update on GPU.
- AVX + OpenMP implementation: SS uses fewer than 3 CPU cores at 100 Gb/s throughput.
- SS overhead ratio = parameter-update FLOPs / (FP +
BP FLOPs):
- VGG-16 (SGD): 138 MFLOPs / 32 GFLOPs ≈ 0.43%
- ResNet-50 (SGD): 26 MFLOPs / 7.8 GFLOPs ≈ 0.33%
- BERT-Large (SGD): 387 MFLOPs / 494 GFLOPs ≈ 0.078%
- With realistic batch sizes (tens to hundreds), the per-iteration overhead is even smaller because update happens once per batch.
- Horovod's "CPU all-reduce" option only relocates the gradient aggregation to CPU on GPU machines; it does not use additional CPU machines, so it provides no communication advantage. BytePS is fundamentally different in that it uses spare CPU machines for SS.
Asynchronous training support:
- Splitting summation and update breaks classical async semantics (which keep the latest parameters on PS).
- BytePS redesigns the async workflow (Fig. 11b): GPU computes new
params, sends delta
Δw_t = w'_{t+1} - w_tto SS; SS keepsw_tand overwrites withw_t + Δw_t. - Theorem 2. The BytePS async algorithm is equivalent to Async Parallel [25] in terms of convergence.
- Proof sketch: induction on
t. Base caset=1: workers and CSes start fromw_0, soΔw_{i,1} = f(g_{i,1}). Inductive step: assumingΔw_{i,k} = f(g_{i,k})fort = k, gradients att = k+1are computed from the samew_k, sow_{ps,k+1} = w_k + f(g_{i,k+1}) = w_{byteps,k+1}. Thereforew_ps = w_bytepsafter anyTiterations. (Eq. 10 and Eq. 11).
6. Implementation
- BytePS core in C++ (10K LoC); plugins for TensorFlow, PyTorch, MXNet in C++ and Python (7.8K LoC Python). Open-sourced [4].
6.1 Multi-Stage Pipeline
- BytePS borrows tensor partitioning + pipelining from prior work [34, 55]. Each step of CS becomes a pipeline stage; PCIe-only path has 6 stages -> 6-stage pipeline.
- Each stage = independent thread with a priority queue; tensors >4 MB are partitioned; priorities follow [34, 55] (former layers have higher priority because they are needed sooner for the next iteration's FP).
6.2 Address RDMA Performance Issues
- Network: RDMA RoCEv2, one 100 GbE NIC per machine, full bisection.
- RDMA Memory Management: BytePS uses RDMA WRITE for performance [39]; conventional one-sided WRITE/READ needs two round trips (get remote address, then write/read). BytePS exploits the fact that DNN training reuses the same tensor set every iteration: register and exchange addresses once on the first iteration, then reuse.
- Slow Receiver Symptom (PFC storms) [30]: three
causes identified, each with a fix.
- Internal RDMA loopback creates 2:1 incast on NIC between RX and loopback; both compete for DMA-to-memory egress. Fix: detect SS on same machine, switch to shared memory (shm) path; SS reads from shm, writes summation back to shm.
- Page alignment. RDMA must use page-aligned memory; misalignment forces multi-page DMA and triggers PFC.
- Single SGE per WRITE. Receiver throughput is impacted by sender-side scatter-gather count; BytePS enforces exactly one SGE per RDMA WRITE.
- Table 2 — RDMA optimization gains (CPU machine + GPU machine microbenchmark):
| Solution | Throughput (Gb/s) | Speedup vs. baseline |
|---|---|---|
| baseline | 41 | 1.00x |
| +shm | 52 | 1.27x |
| +shm +aligned | 76 | 1.85x |
| all (above + 1 SGE) | 89 | 2.17x |
- After all three fixes, NIC PFC generation is negligible.
- Many-to-one comms in BytePS would cause TCP incast [66], but RDMA RoCEv2 + DCQCN [75] congestion control prevents it.
6.3 BytePS Usage
- Python interfaces near-identical to Horovod / PyTorch native API / TensorFlow native API.
- Migrating Horovod-MNIST [19] takes one line:
import horovod->import byteps. Most internal Horovod jobs were converted automatically.
7. Evaluation
Fidelity highlights:
- All resources allocated by the production cluster scheduler (non-preemptive); even the largest job in the paper uses < 5% of GPUs in a busy cluster.
- Large per-GPU batch sizes are used (close to GPU memory limit), so reported speedups are a lower bound on what smaller-batch jobs would achieve.
- Models are representative production workloads; code public [5].
- Comparisons use unmodified state-of-the-art PS and all-reduce; the §6.2 RDMA fixes are not applied to the baselines.
- All cluster machines have one 100 GbE NIC; full-bisection RoCEv2.
7.1 Inter-machine Microbenchmarks
- Setup: 8 1-GPU machines, all GPU workers reduce dummy tensors and record goodput. Point-to-point RDMA goodput in this network ≈ 90 Gb/s.
- Fig. 12: BytePS is within 1-9% of
theoretical optimum (line based on
B = 90 Gb/sand §4.1 analysis) for allk ∈ [0, 8]. All-reduce matches optimum only atk = 0and is flat ink. MXNet PS (no optimizer in this benchmark) is bottlenecked by the §6.2 issues; with the optimizer, PS would be worse than all-reduce even atk = n. - Because of SS, BytePS is unaffected by the optimizer issues that plague classical PS in real workloads.
7.2 Leverage CPU Machines
- 8 GPU machines, each 8 V100 32 GB GPUs (PCIe-only or NVLink); CPU machines varied 0..8; baseline = Horovod 0.19 + NCCL 2.5.7. Models: UGATIT GAN (PyTorch, batch=2 images), GPT-2 (PyTorch, batch=80 tokens).
- Fig. 13: BytePS goes up to 20% faster with extra CPU machines vs. without; SS uses < 4 CPU cores per CPU machine, easy to schedule among non-distributed jobs — effectively free speedup for the cluster.
- BytePS is consistently faster than all-reduce; the best-case speedup vs. all-reduce is 45%. NVLink-based machines see larger gains because the network (rather than PCIe) is the binding constraint.
- Communication-intensive models (e.g., GAN UGATIT) gain more end-to-end.
7.3 Adapt to Intra-machine Topology
- Same hardware as §7.2; runs without CPU machines so the §7.2 advantage is removed. Compares BytePS strategies head-to-head.
- PCIe-only (Fig. 14a): the §4.2.1 optimal (CPU-assisted aggregation) gives up to 20% over a strawman all-reduce-style strategy.
- NVLink-based (Fig. 14b): roots {root=2}, {root=2,3
round-robin}, root=all (≈ Horovod hierarchical), root=0 (≈ plain NCCL).
- root=2 is optimal as predicted (§4.2.2);
- root=2,3 ties with root=2 because GPU3 also avoids NIC contention;
- root=all is worse;
- root=0 is the worst because it competes hardest with the NIC.
- Even without intra-machine optimization, BytePS still beats all-reduce (analyzed in §8).
7.4 Scalability
- 6 jobs scaled 8 -> 256 V100 32 GB GPUs (1 -> 32 GPU machines), NVLink-based machines only.
- Frameworks: TensorFlow, PyTorch, MXNet. Models: ResNet-50 [32], VGG-16 [63], Transformer [67], BERT-Large [26], UGATIT [41], GPT-2 [57]. Each model required <20 LoC change to switch among PS, all-reduce, and BytePS.
- For BytePS, two configurations: with CPU machines
(
k = n) and without; baselines: Horovod + NCCL (all-reduce), TensorFlow native PS, MXNet native PS (RDMA enabled). PyTorch lacks official PS. - Fig. 15 (CV models) and Fig. 16 (NLP models): consistent ranking — BytePS-with-CPU > BytePS-no-CPU > all-reduce > PS.
- Headline numbers at 256 GPUs:
| Comparison | Speedup range |
|---|---|
| BytePS (with CPU) over all-reduce | 10% — 84% |
| BytePS (no CPU) over all-reduce | 9% — 53% |
| BytePS over native PS | up to 245% (and 52% on VGG-16 specifically) |
- Scaling efficiency at 256 GPUs:
| Model | BytePS scaling efficiency |
|---|---|
| ResNet-50 | 97.5% (all-reduce: 88%) |
| 5 of 6 models other than UGATIT | >= 91.6% |
| UGATIT (least scalable) | 74% (BytePS), with the largest gap over all-reduce — 84% gain |
- Improvement breakdown on VGG-16 at 256 GPUs (BytePS vs. native PS, total 52%): 19% from optimal communication design (intra-server), 18% from Summation Service, 15% from RDMA / pipelining implementation.
8. Observations and Discussion
- BytePS beats all-reduce even without extra CPU
machines. Theoretically equal at
k = 0; in practice, BytePS still wins, partly from better intra-machine strategy (§7.3) but also without that optimization (Fig. 14). Authors hypothesize BytePS allows more "asynchronicity" — all-reduce needs cross-node order synchronization, BytePS does not. A distributed profiler is needed to confirm. - Cluster scheduler should consider dynamic CPU
resources. Because BytePS adapts to any
kandkonly affects performance (not convergence), the scheduler can scale CPU machines in/out per job mid-training. GPU count usually cannot change because of convergence concerns [16, 74]. Adding elasticity to BytePS is future work. - Model-parallelism support. BytePS accelerates any tensor reduction across GPUs, including those used by Megatron-LM [62] and Mesh-TensorFlow [61], which rely on all-reduce.
9. Related Work
- Compute acceleration: cuDNN [10], MKL [7], TVM [23], XLA [17], Astra [64], Tensor Fusion [14], graph substitution [37] — complementary, can be combined with BytePS.
- Gradient compression [21, 45] reduces traffic at potential accuracy cost — orthogonal.
- Communication scheduling and pipelining [31, 34, 55]: priority + tensor partitioning, integrated into BytePS implementation. PipeDream [51] adds parallelism between batches; BytePS can accelerate its data-parallel stages.
- Hierarchical all-reduce [24, 49] minimizes traffic at bottleneck links but assumes homogeneous resources, ignoring CPU; recent NCCL has tree-based hierarchical all-reduce, which §7 results show does not outperform BytePS.
- Intra-machine optimization: Blink [68] uses hybrid NVLink+PCIe transfers but only inside a single machine; in distributed training, the binding constraint is the NIC and its PCIe path, which Blink does not address.
- New hardware: TPU [38], Habana [6] — BytePS is not GPU-specific and applies to any PCIe-attached accelerator. InfiniBand-switch ASIC for all-reduce [28], P4-switch in-network aggregation [58, 59], SmartNICs (E3 [46]) potentially complement BytePS. PHub [48] proposes a rack-scale parameter server with custom hardware (10 NICs/server); BytePS targets commodity datacenter hardware instead.
10. Conclusion
- BytePS is a unified, communication-optimal distributed-training architecture for heterogeneous GPU/CPU clusters; all-reduce and PS are special cases.
- Summation Service splits a DNN optimizer into CPU-resident gradient summation (AVX, OpenMP, <3 cores at 100 Gb/s) and GPU-resident parameter update.
- Numerous implementation details — including RDMA performance fixes (shm, page alignment, single SGE), priority-pipelining, and topology-aware intra-machine schedules — bring real-world performance close to the theoretical optimum.
- Deployed at ByteDance, open-sourced [4], with production speedups of up to 84% over all-reduce and 245% over PS at 256 GPUs across six models and three frameworks. Artifact appendix at [3].
11. Acknowledgement
- Shepherd Rachit Agarwal and the OSDI reviewers acknowledged.
- Yimin Jiang and Yong Cui supported by NSFC No. 61872211 and National Key R&D Program of China No. 2018YFB1800303.
Note on NCCL Tuning
BytePS measures and uses two intra-machine collective layouts that NCCL itself selects between (reduce-scatter+all-gather for symmetric PCIe-only topologies vs. reduce/broadcast with a non-NIC root for asymmetric NVLink topologies); the paper shows root choice alone changes throughput substantially (Fig. 14b: root=2 best, root=0 worst). For NCCL configuration tuning, the directly relevant finding is that the optimal collective and its root depend on whether the NIC shares a PCIe switch with one of the GPUs — a property invisible to default NCCL ranking heuristics on asymmetric machines. The paper also quantifies the small-tensor partitioning bound (4 MB) at which pipelining and load-balancing become beneficial, a useful prior on chunk granularity.