Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters — Detailed Summary

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Eric P. Xing | Carnegie Mellon University / Petuum Inc. / Tsinghua | USENIX ATC '17 | July 2017

Per-section summary organized by paper headings. Each section provides paragraph-level bullets, with all named techniques, equations, and quantitative results preserved.


Abstract


1. Introduction

Motivation.

Why scaling fails.

Two primary causes of poor scaling.

Design requirements.

Poseidon's two ideas.

Headline results in introduction.


2. Large-Scale Deep Learning (Background)

2.1 Distributed Deep Learning

2.2 Parallel DL on Distributed GPUs


3. Poseidon Design

3.1 Wait-Free Backpropagation (WFBP)

3.2 Hybrid Communication (HybComm)


4. Implementation

4.1 System Implementation and APIs

4.2 Integrate Poseidon with DL Libraries


5. Evaluation

Setup

Model # Params Batch (per worker)
CIFAR-10 quick 145.6 K 100
GoogLeNet 5 M 128
Inception-V3 27 M 32
VGG19 143 M 32
VGG19-22K 229 M 32
ResNet-152 60.2 M 32

5.1 Scalability

5.2 Bandwidth Experiments

5.3 Comparisons



7. Conclusion


8. Limitations and Open Questions Implied by the Paper


9. Cross-Cutting Empirical Take-aways

Take-away Quantitative evidence
WFBP alone gives a substantial fraction of the speedup 21.5x for WFBP vs 29.5x for full Poseidon on VGG19-22K (32 nodes)
HybComm matters most for FC-heavy networks VGG19-22K (229 M params, FC-dominated) gains ~8x from adding HybComm to WFBP
Fine-grained PS partitioning is a real win Poseidon's 2 MB KV pairs vs TF's coarse partitioning on 32-node Inception-V3 (31.5x vs 10x)
Bandwidth scarcity widens Poseidon's lead Caffe-PS + 10 GbE on VGG19, 16 nodes: 8x; Poseidon: near-linear
Statistical performance is preserved ResNet-152 reaches 0.24 top-1 error in <90 epochs at 16 / 32 nodes
SFB is conditional, not universal GoogLeNet's thin 1000×1024 FC layer is correctly served by PS; Poseidon does not force SFB

10. Named Methods Catalogue

Name Purpose Location
Wait-Free Backpropagation (WFBP) Overlap per-layer comm with backward compute §3.1
Hybrid Communication (HybComm) Per-layer choice of PS vs SFB §3.2
Sufficient Factor Broadcasting (SFB) Send rank-1 factors u, v instead of full FC gradient §2.1, §3.2
Parameter Server (PS) Centralized push/pull aggregation §2.1
BestScheme(l) (Algorithm 1) Select PS or SFB per layer using cost-model inequality §4.1
Pipelined training loop (Algorithm 2) Schedule per-layer Syncer immediately after BackwardThrough(l) §4.1
2 MB KV-pair partitioning Load-balanced PS sharding §4.1

Note on NCCL Tuning

Poseidon's PS-vs-SFB cost inequality (Algorithm 1) is essentially a per-layer transport-selection rule based on shape, batch, and worker count — a structurally identical decision to choosing among NCCL collective algorithms / protocols at the per-collective level. The paper's empirical finding that fine-grained 2 MB partitioning beats coarse-grained tensor splitting on the same hardware (31.5x vs ~10x at 32 nodes for Inception-V3) is direct evidence that chunk granularity is a first-class tuning axis, not a second-order detail. The cost-model formulas in Table 1 are also a useful reference for analytical priors on per-collective transport cost as a function of (M, N, P) — exactly the inputs an NCCL tuner sees.