Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Eric P. Xing | Carnegie Mellon University / Petuum Inc. / Tsinghua | USENIX ATC '17 | July 2017

Problem

Modern DNNs take days or weeks to train on a single GPU and so are typically scaled across many GPU machines, yet existing distributed implementations scale poorly because parameter synchronization saturates commodity Ethernet. The root cause is the gap between modern GPU compute throughput and inter-node bandwidth: each iteration finishes very quickly, so synchronization is invoked frequently, and each invocation has to push very large gradient matrices for fully-connected layers. A concrete failure case frames the paper — open-source TensorFlow on 32 machines is actually slower than a single machine for VGG19-22K (229 M parameters), and even with 40 GbE Ethernet plus Titan X GPUs the same network only reaches 8x speedup on 16 machines. Two structural causes are identified: gradient updates for large FC layers are dense matrices that saturate the network, and the iterative SGD pattern produces bursty traffic concentrated at the end of each iteration with idle gaps in between.

Core Insight

A scalable distributed DL system needs both algorithmic-level scheduling that overlaps compute and comm, and matrix-level structure exploitation that reduces bytes on the wire — and these two ideas are independent across layers, so each layer can be handled with the cheaper of multiple transport schemes. Concretely, fully-connected gradients are rank-1 outer products ∇θ = u v^T, so broadcasting the small Sufficient Factors u, v (Sufficient Factor Broadcasting, SFB) is much cheaper than pushing a full M×N matrix to a parameter server when M, N are large and the cluster size is moderate; convolutional gradients have no such structure, so PS is the right transport for them. Picking the cheaper transport per layer, and starting each layer's communication the instant its backward pass finishes, is enough to get near-linear scaling at 32 nodes.

Method

Poseidon's two main techniques layer cleanly on existing frameworks (Caffe, TensorFlow) and address the two structural causes of poor scaling.

Wait-Free Backpropagation (WFBP):
  Decompose per-layer sync s_t^l = [o_t^l, i_t^l] (send gradient, receive new weights).
  Both halves are independent of backward ops b_t^i for i < l.
  -> Start sync(l) immediately after BackwardThrough(l); overlap with b_t^{l-1}, ..., b_t^1.
  Most beneficial when comm concentrates at top FC layers and compute at bottom CONV layers
  (VGG-style nets: ~90% comm at top, ~90% compute at bottom).

Hybrid Communication (HybComm) — per-layer transport selection:
  For an M x N parameter matrix, P_1 workers, P_2 server shards, batch K:
    PS combined cost   = 2 M N (P_1 + P_2 - 2) / P_2
    SFB cost (per worker) = 2 K (P_1 - 1)(M + N)
  Algorithm 1 (BestScheme(l)):
    if layer l is FC and 2 K (P_1 - 1)(M + N) <= 2 M N (P_1 + P_2 - 2) / P_2:
        return SFB
    else: return PS
  CONV layers always use PS (no rank-1 structure).

Implementation:
  Components:    Coordinator | KV Store (PS) | per-worker Client Library (Syncers)
  APIs:          BestScheme, Query, Send/Receive, Move
  KV Store:      bulk synchronous; tensors split into ~2 MB KV pairs for load balance
  Consistency:   BSP, with binary-vector C tracking Syncer completion
  Stragglers:    dropped from that step's aggregation
  Algorithm 2:   forward pass; for l = L..1: net.BackwardThrough(l); thread_pool.Schedule(sync(l))
  Integration:   ~150 LoC into Caffe; ~250 LoC into TensorFlow

Experimental Setup

Component	Value
GPU	NVIDIA Titan X
CPU / RAM	Intel 16-core / 64 GB per node
Network	40 GbE Ethernet (also tested at 2/5/10 GbE)
OS / Stack	Ubuntu 16.04, CUDA 8.0, cuDNN v5
Frameworks	Caffe (Poseidon-Caffe, ~150 LoC) and TensorFlow (~250 LoC)
Scale	up to 32 single-GPU nodes; also 4 nodes × 8 K80 (AWS p2.8xlarge)
Datasets	CIFAR-10, ILSVRC12 (1.28 M images / 1000 classes), ImageNet22K (14.2 M / 22000)
Models	CIFAR-10 quick (145.6 K, batch 100), GoogLeNet (5 M, 128), Inception-V3 (27 M, 32), VGG19 (143 M, 32), VGG19-22K (229 M, 32), ResNet-152 (60.2 M, 32)
Metric	throughput (images/sec, iters/sec); speedup vs single-GPU; convergence (top-1 error, training loss) for sanity

Headline Quantitative Results

Single-node sanity check (Caffe).

Poseidon-Caffe matches stock Caffe (GoogLeNet 257 vs 257 images/s; VGG19 35.5 vs 35.5).
A naive PS retrofit on Caffe drops VGG19 throughput to 21.3 images/s, isolating transport as the source of degradation.

Distributed scaling at 32 nodes.

Workload	Framework	Best method	Speedup vs 1 GPU
VGG19-22K (229 M params)	Caffe	WFBP only	21.5x
VGG19-22K	Caffe	Full Poseidon (WFBP + HybComm)	29.5x
Inception-V3	TensorFlow	Poseidon	31.5x
Inception-V3	TensorFlow	stock TF	~10x
VGG19-22K	stock TF	—	fails to scale

WFBP alone delivers 21.5x; HybComm adds ~8x on top, biggest gain on FC-heavy nets.
Stock TF's coarse tensor partitioning concentrates large FC tensors on a single shard; Poseidon's 2 MB KV pairs balance load and account for the 31.5x vs ~10x gap.

Multi-GPU within a node (AWS p2.8xlarge, 4 × 8 K80 = 32 GPUs).

GoogLeNet: 32x speedup. VGG19: 28x speedup.

Statistical performance.

ResNet-152 on 16 and 32 nodes reaches 0.24 top-1 error in fewer than 90 epochs, mirroring the 8-node baseline — speedups are not cancelled by worse convergence.

Bandwidth experiments (10 GbE, VGG19, 16 nodes).

Naive PS reaches only 8x; Poseidon reaches near-linear scaling.
Across 2 / 5 / 10 GbE, Poseidon outperforms Caffe + WFBP for all benchmarked models, with the gap widening as bandwidth shrinks.
HybComm correctly falls back to PS for GoogLeNet's thin 1000×1024 FC layer — SFB is not forced.

Comparisons.

Project Adam on 8 nodes for VGG19: 5x speedup, hampered by imbalanced PS pull loads.
CNTK 1-bit on VGG19, 32 nodes: ~20x speedup; Poseidon: near-linear ~30x.
CNTK 1-bit converges to ~0.5 error at 3 K iterations on the comparison task; Poseidon reaches ~0.3 error at ~1 K iterations — Poseidon is faster and converges better.

Sample worked decision.

FC layer 4096×4096 on 8 worker nodes: PS transfers ~34 M parameters per worker; SFB transfers ~3.7 M floats — SFB wins by roughly an order of magnitude, matching Algorithm 1's choice.

Limitations

Stragglers are handled by dropping them from the step's aggregation; the consequences at very large cluster sizes are not characterized.
1-bit quantization residuals act like delayed updates that hurt image-task convergence; Poseidon avoids the issue by not quantizing rather than solving it.
SFB cost grows as O(P_1^2) in worker count, so HybComm necessarily reverts to PS once cluster size is large enough; the paper's evaluation caps at 32 nodes.
Evaluation is one hardware generation (Titan X, 40 GbE); modern NVLink/NVSwitch and 100 / 400 Gb fabrics could shift the WFBP and HybComm trade-offs.
Integration is data-parallelism-only on Caffe / TensorFlow graph mode; eager-mode, pipeline parallelism, and model parallelism beyond data parallelism are out of scope.

Open Problems

Generalizing the per-layer transport-selection rule beyond {PS, SFB} to additional schemes (e.g., quantization, sparsification) without losing the per-layer independence guarantee that makes HybComm safe.
Characterizing straggler-drop behavior at large scale, where dropped workers may bias the gradient.
Closing the SFB scaling gap — the O(P_1^2) cost is fundamental for the all-broadcast pattern, so a different low-rank exchange scheme would be needed for very large clusters.
Extending the design to NVLink / NVSwitch interconnects and to non-data-parallel training regimes (pipeline, model, hybrid).

Note on NCCL Tuning

Poseidon's BestScheme(l) rule is a per-layer transport-selection decision driven by an analytical cost model in (M, N, P, K) — structurally the same kind of decision an NCCL tuner makes when picking algorithm and protocol per collective. The paper's evidence that fine-grained 2 MB KV partitioning outperforms coarse tensor splitting at 32-node scale (31.5x vs ~10x for Inception-V3) is a concrete data point that chunk granularity is a first-class tuning axis. The cost-model formulas in Table 1 are also useful as analytical priors for per-collective transport cost as a function of message shape and worker count.