Brief Summary: PipeDream

Full title: PipeDream: Generalized Pipeline Parallelism for DNN Training Authors: Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, Matei Zaharia (Microsoft Research, CMU, Stanford) Year: 2019 Venue: SOSP '19


Problem

Data parallelism (DP) training on multi-GPU clusters suffers from growing communication overhead as worker count increases — for some models (VGG-16, GNMT), communication overhead reaches 90% of total training time over 32 GPUs. This is because DP requires synchronizing all model parameters via AllReduce after every backward pass, with cost proportional to both model size and worker count. Model parallelism (MP) reduces per-worker parameter storage but results in severe GPU under-utilization since at most one GPU is active at a time in a naive pipeline.

Core Insight

DNN training can be viewed as a computation pipeline: if different inputs (minibatches) are processed concurrently through different pipeline stages, GPUs can be kept busy in steady state with no pipeline bubbles. This inter-batch pipelining reduces communication to only layer activations and gradients passed between adjacent stages — peer-to-peer point-to-point sends rather than all-to-all AllReduce — dramatically cutting communication volume for models with compact weight representations.

Method

PipeDream combines intra-batch parallelism (DP, MP) with inter-batch pipelining through three innovations:

1. Automated Work Partitioning (DP algorithm): A profiler records per-layer compute time (T_l), output activation size (a_l), and parameter size (w_l) on a single GPU for 1000 minibatches. A dynamic programming algorithm finds the optimal assignment of layers to stages and replication factor per stage, minimizing time taken by the slowest stage while respecting hardware topology (hierarchical bandwidth levels B_k). Runs in <8 seconds.

2. 1F1B Scheduling: In steady state, each worker strictly alternates between one forward pass and one backward pass for different minibatches. This eliminates pipeline bubbles (no idle time in steady state) and keeps the number of in-flight minibatches to the minimum needed to fill the pipeline (NOAM = ⌈num_workers / num_input_stage_replicas⌉). Memory overhead is bounded: at most NOAM weight + activation versions are stashed.

3. Weight Stashing: To ensure numerically valid gradients, each stage stores multiple weight versions — one per active in-flight minibatch. A minibatch's backward pass uses the exact weight version that was used during its forward pass, preventing the weight mismatch that would corrupt gradients in a naive pipeline.

Optional Vertical Sync: Extends weight stashing to enforce consistent weight versions across stages (semantically equivalent to BSP with n-step staleness). Adds coordination overhead but ensures the same weight update is seen across all stages for a minibatch.

Communication: Point-to-point sends of activations (forward) and gradients (backward) between adjacent pipeline stages. For replicated stages, NCCL AllReduce is used for gradient aggregation within the stage group. For inter-stage communication, Gloo is used (faster than NCCL for small activation/gradient tensors).

Key Results

Model Cluster Config Speedup over DP
VGG-16 4×4 V100s (A) 15-1 5.28×
VGG-16 2×8 V100s (B) 15-1 2.98×
GNMT-16 4×4 V100s Straight 2.34×
GNMT-8 2×8 V100s Straight 2.95×
AWD LM 1×4 V100s Straight 4.25×
S2VT 4×1 Titan X 2-1-1 3.01×

Limitations

Relevance to DynamICCL

PipeDream establishes a fundamentally different communication pattern from DP: point-to-point activation/gradient sends between pipeline stages, plus AllReduce within replicated stages. This is directly relevant to DynamICCL in two ways.

First, PipeDream's empirical data (Figure 1, Figure 17) provides strong motivation for DynamICCL: DP communication overhead reaches 90% for large models on commodity clusters. This is precisely the problem DynamICCL aims to mitigate by choosing optimal NCCL configurations that reduce AllReduce latency and improve overlap.

Second, the co-existence of pipeline point-to-point sends (inter-stage) and AllReduce (intra-stage, for replicated stages) creates a mixed collective environment. DynamICCL's Config Agent must distinguish these traffic types: the point-to-point inter-stage sends are NCCL Send/Recv and require different channel/protocol settings than AllReduce. The paper's choice to use Gloo instead of NCCL for inter-stage sends (because NCCL is slower for small tensors) is a direct confirmation that the choice of communication library and protocol matters significantly at the per-message level — exactly DynamICCL's optimization target.