Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Eric P. Xing | Carnegie Mellon University / Petuum Inc. / Tsinghua | USENIX ATC '17 | July 2017
Problem
Modern DNNs take days or weeks to train on a single GPU and so are typically scaled across many GPU machines, yet existing distributed implementations scale poorly because parameter synchronization saturates commodity Ethernet. The root cause is the gap between modern GPU compute throughput and inter-node bandwidth: each iteration finishes very quickly, so synchronization is invoked frequently, and each invocation has to push very large gradient matrices for fully-connected layers. A concrete failure case frames the paper — open-source TensorFlow on 32 machines is actually slower than a single machine for VGG19-22K (229 M parameters), and even with 40 GbE Ethernet plus Titan X GPUs the same network only reaches 8x speedup on 16 machines. Two structural causes are identified: gradient updates for large FC layers are dense matrices that saturate the network, and the iterative SGD pattern produces bursty traffic concentrated at the end of each iteration with idle gaps in between.
Core Insight
A scalable distributed DL system needs both algorithmic-level
scheduling that overlaps compute and comm, and matrix-level structure
exploitation that reduces bytes on the wire — and these two ideas are
independent across layers, so each layer can be handled with the cheaper
of multiple transport schemes. Concretely, fully-connected gradients are
rank-1 outer products ∇θ = u v^T, so broadcasting the small
Sufficient Factors u, v (Sufficient Factor Broadcasting,
SFB) is much cheaper than pushing a full M×N matrix to a parameter
server when M, N are large and the cluster size is moderate;
convolutional gradients have no such structure, so PS is the right
transport for them. Picking the cheaper transport per layer, and
starting each layer's communication the instant its backward pass
finishes, is enough to get near-linear scaling at 32 nodes.
Method
Poseidon's two main techniques layer cleanly on existing frameworks (Caffe, TensorFlow) and address the two structural causes of poor scaling.
Wait-Free Backpropagation (WFBP):
Decompose per-layer sync s_t^l = [o_t^l, i_t^l] (send gradient, receive new weights).
Both halves are independent of backward ops b_t^i for i < l.
-> Start sync(l) immediately after BackwardThrough(l); overlap with b_t^{l-1}, ..., b_t^1.
Most beneficial when comm concentrates at top FC layers and compute at bottom CONV layers
(VGG-style nets: ~90% comm at top, ~90% compute at bottom).
Hybrid Communication (HybComm) — per-layer transport selection:
For an M x N parameter matrix, P_1 workers, P_2 server shards, batch K:
PS combined cost = 2 M N (P_1 + P_2 - 2) / P_2
SFB cost (per worker) = 2 K (P_1 - 1)(M + N)
Algorithm 1 (BestScheme(l)):
if layer l is FC and 2 K (P_1 - 1)(M + N) <= 2 M N (P_1 + P_2 - 2) / P_2:
return SFB
else: return PS
CONV layers always use PS (no rank-1 structure).
Implementation:
Components: Coordinator | KV Store (PS) | per-worker Client Library (Syncers)
APIs: BestScheme, Query, Send/Receive, Move
KV Store: bulk synchronous; tensors split into ~2 MB KV pairs for load balance
Consistency: BSP, with binary-vector C tracking Syncer completion
Stragglers: dropped from that step's aggregation
Algorithm 2: forward pass; for l = L..1: net.BackwardThrough(l); thread_pool.Schedule(sync(l))
Integration: ~150 LoC into Caffe; ~250 LoC into TensorFlow
Experimental Setup
| Component | Value |
|---|---|
| GPU | NVIDIA Titan X |
| CPU / RAM | Intel 16-core / 64 GB per node |
| Network | 40 GbE Ethernet (also tested at 2/5/10 GbE) |
| OS / Stack | Ubuntu 16.04, CUDA 8.0, cuDNN v5 |
| Frameworks | Caffe (Poseidon-Caffe, ~150 LoC) and TensorFlow (~250 LoC) |
| Scale | up to 32 single-GPU nodes; also 4 nodes × 8 K80 (AWS p2.8xlarge) |
| Datasets | CIFAR-10, ILSVRC12 (1.28 M images / 1000 classes), ImageNet22K (14.2 M / 22000) |
| Models | CIFAR-10 quick (145.6 K, batch 100), GoogLeNet (5 M, 128), Inception-V3 (27 M, 32), VGG19 (143 M, 32), VGG19-22K (229 M, 32), ResNet-152 (60.2 M, 32) |
| Metric | throughput (images/sec, iters/sec); speedup vs single-GPU; convergence (top-1 error, training loss) for sanity |
Headline Quantitative Results
Single-node sanity check (Caffe).
- Poseidon-Caffe matches stock Caffe (GoogLeNet 257 vs 257 images/s; VGG19 35.5 vs 35.5).
- A naive PS retrofit on Caffe drops VGG19 throughput to 21.3 images/s, isolating transport as the source of degradation.
Distributed scaling at 32 nodes.
| Workload | Framework | Best method | Speedup vs 1 GPU |
|---|---|---|---|
| VGG19-22K (229 M params) | Caffe | WFBP only | 21.5x |
| VGG19-22K | Caffe | Full Poseidon (WFBP + HybComm) | 29.5x |
| Inception-V3 | TensorFlow | Poseidon | 31.5x |
| Inception-V3 | TensorFlow | stock TF | ~10x |
| VGG19-22K | stock TF | — | fails to scale |
- WFBP alone delivers 21.5x; HybComm adds ~8x on top, biggest gain on FC-heavy nets.
- Stock TF's coarse tensor partitioning concentrates large FC tensors on a single shard; Poseidon's 2 MB KV pairs balance load and account for the 31.5x vs ~10x gap.
Multi-GPU within a node (AWS p2.8xlarge, 4 × 8 K80 = 32 GPUs).
- GoogLeNet: 32x speedup. VGG19: 28x speedup.
Statistical performance.
- ResNet-152 on 16 and 32 nodes reaches 0.24 top-1 error in fewer than 90 epochs, mirroring the 8-node baseline — speedups are not cancelled by worse convergence.
Bandwidth experiments (10 GbE, VGG19, 16 nodes).
- Naive PS reaches only 8x; Poseidon reaches near-linear scaling.
- Across 2 / 5 / 10 GbE, Poseidon outperforms Caffe + WFBP for all benchmarked models, with the gap widening as bandwidth shrinks.
- HybComm correctly falls back to PS for GoogLeNet's thin 1000×1024 FC layer — SFB is not forced.
Comparisons.
- Project Adam on 8 nodes for VGG19: 5x speedup, hampered by imbalanced PS pull loads.
- CNTK 1-bit on VGG19, 32 nodes: ~20x speedup; Poseidon: near-linear ~30x.
- CNTK 1-bit converges to ~0.5 error at 3 K iterations on the comparison task; Poseidon reaches ~0.3 error at ~1 K iterations — Poseidon is faster and converges better.
Sample worked decision.
- FC layer 4096×4096 on 8 worker nodes: PS transfers ~34 M parameters per worker; SFB transfers ~3.7 M floats — SFB wins by roughly an order of magnitude, matching Algorithm 1's choice.
Limitations
- Stragglers are handled by dropping them from the step's aggregation; the consequences at very large cluster sizes are not characterized.
- 1-bit quantization residuals act like delayed updates that hurt image-task convergence; Poseidon avoids the issue by not quantizing rather than solving it.
- SFB cost grows as O(P_1^2) in worker count, so HybComm necessarily reverts to PS once cluster size is large enough; the paper's evaluation caps at 32 nodes.
- Evaluation is one hardware generation (Titan X, 40 GbE); modern NVLink/NVSwitch and 100 / 400 Gb fabrics could shift the WFBP and HybComm trade-offs.
- Integration is data-parallelism-only on Caffe / TensorFlow graph mode; eager-mode, pipeline parallelism, and model parallelism beyond data parallelism are out of scope.
Open Problems
- Generalizing the per-layer transport-selection rule beyond {PS, SFB} to additional schemes (e.g., quantization, sparsification) without losing the per-layer independence guarantee that makes HybComm safe.
- Characterizing straggler-drop behavior at large scale, where dropped workers may bias the gradient.
- Closing the SFB scaling gap — the O(P_1^2) cost is fundamental for the all-broadcast pattern, so a different low-rank exchange scheme would be needed for very large clusters.
- Extending the design to NVLink / NVSwitch interconnects and to non-data-parallel training regimes (pipeline, model, hybrid).
Note on NCCL Tuning
Poseidon's BestScheme(l) rule is a per-layer
transport-selection decision driven by an analytical cost model in (M,
N, P, K) — structurally the same kind of decision an NCCL tuner makes
when picking algorithm and protocol per collective. The paper's evidence
that fine-grained 2 MB KV partitioning outperforms coarse tensor
splitting at 32-node scale (31.5x vs ~10x for Inception-V3) is a
concrete data point that chunk granularity is a first-class tuning axis.
The cost-model formulas in Table 1 are also useful as analytical priors
for per-collective transport cost as a function of message shape and
worker count.