Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Eric P. Xing | Carnegie Mellon University / Petuum Inc. / Tsinghua | USENIX ATC '17 | July 2017


Problem

Modern DNNs take days or weeks to train on a single GPU and so are typically scaled across many GPU machines, yet existing distributed implementations scale poorly because parameter synchronization saturates commodity Ethernet. The root cause is the gap between modern GPU compute throughput and inter-node bandwidth: each iteration finishes very quickly, so synchronization is invoked frequently, and each invocation has to push very large gradient matrices for fully-connected layers. A concrete failure case frames the paper — open-source TensorFlow on 32 machines is actually slower than a single machine for VGG19-22K (229 M parameters), and even with 40 GbE Ethernet plus Titan X GPUs the same network only reaches 8x speedup on 16 machines. Two structural causes are identified: gradient updates for large FC layers are dense matrices that saturate the network, and the iterative SGD pattern produces bursty traffic concentrated at the end of each iteration with idle gaps in between.


Core Insight

A scalable distributed DL system needs both algorithmic-level scheduling that overlaps compute and comm, and matrix-level structure exploitation that reduces bytes on the wire — and these two ideas are independent across layers, so each layer can be handled with the cheaper of multiple transport schemes. Concretely, fully-connected gradients are rank-1 outer products ∇θ = u v^T, so broadcasting the small Sufficient Factors u, v (Sufficient Factor Broadcasting, SFB) is much cheaper than pushing a full M×N matrix to a parameter server when M, N are large and the cluster size is moderate; convolutional gradients have no such structure, so PS is the right transport for them. Picking the cheaper transport per layer, and starting each layer's communication the instant its backward pass finishes, is enough to get near-linear scaling at 32 nodes.


Method

Poseidon's two main techniques layer cleanly on existing frameworks (Caffe, TensorFlow) and address the two structural causes of poor scaling.

Wait-Free Backpropagation (WFBP):
  Decompose per-layer sync s_t^l = [o_t^l, i_t^l] (send gradient, receive new weights).
  Both halves are independent of backward ops b_t^i for i < l.
  -> Start sync(l) immediately after BackwardThrough(l); overlap with b_t^{l-1}, ..., b_t^1.
  Most beneficial when comm concentrates at top FC layers and compute at bottom CONV layers
  (VGG-style nets: ~90% comm at top, ~90% compute at bottom).

Hybrid Communication (HybComm) — per-layer transport selection:
  For an M x N parameter matrix, P_1 workers, P_2 server shards, batch K:
    PS combined cost   = 2 M N (P_1 + P_2 - 2) / P_2
    SFB cost (per worker) = 2 K (P_1 - 1)(M + N)
  Algorithm 1 (BestScheme(l)):
    if layer l is FC and 2 K (P_1 - 1)(M + N) <= 2 M N (P_1 + P_2 - 2) / P_2:
        return SFB
    else: return PS
  CONV layers always use PS (no rank-1 structure).

Implementation:
  Components:    Coordinator | KV Store (PS) | per-worker Client Library (Syncers)
  APIs:          BestScheme, Query, Send/Receive, Move
  KV Store:      bulk synchronous; tensors split into ~2 MB KV pairs for load balance
  Consistency:   BSP, with binary-vector C tracking Syncer completion
  Stragglers:    dropped from that step's aggregation
  Algorithm 2:   forward pass; for l = L..1: net.BackwardThrough(l); thread_pool.Schedule(sync(l))
  Integration:   ~150 LoC into Caffe; ~250 LoC into TensorFlow

Experimental Setup

Component Value
GPU NVIDIA Titan X
CPU / RAM Intel 16-core / 64 GB per node
Network 40 GbE Ethernet (also tested at 2/5/10 GbE)
OS / Stack Ubuntu 16.04, CUDA 8.0, cuDNN v5
Frameworks Caffe (Poseidon-Caffe, ~150 LoC) and TensorFlow (~250 LoC)
Scale up to 32 single-GPU nodes; also 4 nodes × 8 K80 (AWS p2.8xlarge)
Datasets CIFAR-10, ILSVRC12 (1.28 M images / 1000 classes), ImageNet22K (14.2 M / 22000)
Models CIFAR-10 quick (145.6 K, batch 100), GoogLeNet (5 M, 128), Inception-V3 (27 M, 32), VGG19 (143 M, 32), VGG19-22K (229 M, 32), ResNet-152 (60.2 M, 32)
Metric throughput (images/sec, iters/sec); speedup vs single-GPU; convergence (top-1 error, training loss) for sanity

Headline Quantitative Results

Single-node sanity check (Caffe).

Distributed scaling at 32 nodes.

Workload Framework Best method Speedup vs 1 GPU
VGG19-22K (229 M params) Caffe WFBP only 21.5x
VGG19-22K Caffe Full Poseidon (WFBP + HybComm) 29.5x
Inception-V3 TensorFlow Poseidon 31.5x
Inception-V3 TensorFlow stock TF ~10x
VGG19-22K stock TF fails to scale

Multi-GPU within a node (AWS p2.8xlarge, 4 × 8 K80 = 32 GPUs).

Statistical performance.

Bandwidth experiments (10 GbE, VGG19, 16 nodes).

Comparisons.

Sample worked decision.


Limitations


Open Problems

  1. Generalizing the per-layer transport-selection rule beyond {PS, SFB} to additional schemes (e.g., quantization, sparsification) without losing the per-layer independence guarantee that makes HybComm safe.
  2. Characterizing straggler-drop behavior at large scale, where dropped workers may bias the gradient.
  3. Closing the SFB scaling gap — the O(P_1^2) cost is fundamental for the all-broadcast pattern, so a different low-rank exchange scheme would be needed for very large clusters.
  4. Extending the design to NVLink / NVSwitch interconnects and to non-data-parallel training regimes (pipeline, model, hybrid).

Note on NCCL Tuning

Poseidon's BestScheme(l) rule is a per-layer transport-selection decision driven by an analytical cost model in (M, N, P, K) — structurally the same kind of decision an NCCL tuner makes when picking algorithm and protocol per collective. The paper's evidence that fine-grained 2 MB KV partitioning outperforms coarse tensor splitting at 32-node scale (31.5x vs ~10x for Inception-V3) is a concrete data point that chunk granularity is a first-class tuning axis. The cost-model formulas in Table 1 are also useful as analytical priors for per-collective transport cost as a function of message shape and worker count.