Brief Summary: P3
Full title: P3: Distributed Deep Graph Learning at Scale Authors: Swapnil Gandhi, Anand Padmanabha Iyer (Microsoft Research) Year: 2021 Venue: OSDI '21
Problem
Training Graph Neural Networks (GNNs) on large real-world graphs (billions of nodes and edges) in a distributed setting is dominated by network communication, not computation. In existing frameworks (DGL), GPUs are blocked on communication approximately 80% of the time. This happens because GNNs must pull k-hop neighborhood features from remote machines to construct each node's computation graph — and those features are large (100–1000s of dimensions per node), making the feature movement across the network the dominant cost. Standard distributed graph processing techniques (METIS, vertex-cut, 3D partitioning) make this worse, not better: they optimize only the first hop of communication and incur high preprocessing overhead.
Core Insight
GNN computation graphs are easy to partition cleanly (k-hop boundaries are well-defined), but features are not. Rather than moving features over the network, P3 avoids moving them at all by distributing the features independently of the graph structure and computing partial activations of the compute-intensive first layer (layer 1M) via intra-layer model parallelism across all machines. Only small partial activations (not features) are transferred. Layers 2 through K then proceed in standard data-parallel fashion. This pipelined push-pull execution eliminates feature movement and dramatically reduces network traffic.
Method
P3 introduces three main techniques:
1. Independent Random Hash Partitioning: The graph topology and input features are partitioned independently using random hash — topology by node id, features by the feature dimension (F/N features per machine for N machines). This provides load balance without preprocessing overhead and enables model parallelism on the feature dimension.
2. Push-Pull Parallelism: Execution proceeds in two phases for each minibatch:
- Push (layer 1M, model parallel): P3 pushes the graph structure of layer 1 to all machines. All machines compute partial activations for layer 1 using the F/N features they own locally. Partial activations are then aggregated via a reduce operation (ReduceScatter pattern).
- Pull (layers 1D through K, data parallel): The aggregated activations are passed through remaining layers in standard data-parallel fashion, with gradient AllReduce at each layer boundary.
3. Pipelining (inspired by PipeDream): P3 overlaps the model-parallel forward pass of minibatch m+1 with the data-parallel forward pass of minibatch m using a static 2-forward, 2-backward schedule. Pipeline delay is bounded at 3 (weight staleness bound w_{t+1} = w_t − α·∇f(w_{t−3})). Pipelining recovers 30–50% additional performance.
4. Caching: Since graph structure and features are partitioned independently, P3 can cache either independently on machines with spare memory. Caching up to 4 partitions of both graph and features reduces epoch time by up to 1.7×.
P-TAGS API: P3 exposes six functions:
partition, scatter, gather,
transform, sync, apply — allowing
developers to implement new GNN architectures that benefit from P3's
execution strategy.
Key Results
| Graph | Model | DGL (s/epoch) | P3 (s/epoch) | Speedup |
|---|---|---|---|---|
| OGB-Product | SGCN | 4.535 | 1.019 | 4.45× |
| OGB-Product | GAT | 5.067 | 1.912 | 2.65× |
| SGCN | 22.264 | 4.102 | 5.43× | |
| GraphSAGE | 23.936 | 6.298 | 3.80× | |
| UK-2006-05 | GraphSAGE (no sampling) | 151.3 | 21.3 | 7.09× |
- P3 outperforms ROC (prior SOTA) by up to 2.2× on large graphs; benefits increase with graph size.
- Pipelining adds 30–50% speedup on top of base P3.
- Caching adds up to 1.7× speedup beyond base P3.
- P3 achieves near-linear scaling: doubling machines doubles throughput (DGL throughput is essentially flat with more machines).
- GPU utilization: P3 keeps GPUs busy ~85% of the time; DGL keeps them busy ~20%.
- Same accuracy as DGL: P3 reaches 76.2% on OGB-Product in 61.65s vs. DGL's 236.37s.
Limitations
- P3's benefit depends on activation size being smaller than feature size. When hidden dimensions approach or exceed the feature dimension, P3's advantage shrinks and can become a penalty (Figure 15). The crossover point is graph- and architecture-dependent.
- Pipelining introduces weight staleness of 3 steps. For large GNN models (future architectures), this may hurt convergence.
- The model-parallel phase (layer 1M) requires that the layer transformation be aggregatable from partial results. Non-linear GAT layers require developer annotation to identify which tensors need global synchronization (§3.5). This places burden on the developer.
- Evaluation uses 10 Gbps Ethernet. On InfiniBand (100 Gbps), the feature-movement bottleneck is proportionally smaller and P3's advantage is reduced relative to Ethernet-connected clusters.
- Scaling beyond 4 nodes (16 GPUs total) was not evaluated. At larger machine counts, the pull of partial activations from all machines grows linearly, potentially limiting scalability.
Relevance to DynamICCL
P3 introduces a distinct communication pattern from standard DP training: intra-layer partial activation reduction (a ReduceScatter-like collective over small activation tensors) combined with standard data-parallel gradient AllReduce for layers 2 through K. This is directly relevant to DynamICCL in two ways.
First, P3's partial activation sync (the sync API call)
is a small-message ReduceScatter fired once per minibatch at the
model-parallel/data-parallel switch point. This message is a different
size class than the gradient AllReduce at the end of backward.
DynamICCL's Config Agent must distinguish these and apply different NCCL
configurations — the small activation sync benefits from LL/LL128
protocols while the gradient AllReduce may benefit from ring or tree
algorithms depending on message size.
Second, P3's evaluation uses 10 Gbps Ethernet — the same network class as Chameleon Cloud (1 GbE). On bandwidth-constrained clusters, P3's elimination of feature movement is a direct complement to DynamICCL's optimization of the residual communication. The two systems target orthogonal reduction strategies: P3 reduces communication volume; DynamICCL optimizes the execution of the residual volume that remains.