Brief Summary: P3

Full title: P3: Distributed Deep Graph Learning at Scale Authors: Swapnil Gandhi, Anand Padmanabha Iyer (Microsoft Research) Year: 2021 Venue: OSDI '21


Problem

Training Graph Neural Networks (GNNs) on large real-world graphs (billions of nodes and edges) in a distributed setting is dominated by network communication, not computation. In existing frameworks (DGL), GPUs are blocked on communication approximately 80% of the time. This happens because GNNs must pull k-hop neighborhood features from remote machines to construct each node's computation graph — and those features are large (100–1000s of dimensions per node), making the feature movement across the network the dominant cost. Standard distributed graph processing techniques (METIS, vertex-cut, 3D partitioning) make this worse, not better: they optimize only the first hop of communication and incur high preprocessing overhead.

Core Insight

GNN computation graphs are easy to partition cleanly (k-hop boundaries are well-defined), but features are not. Rather than moving features over the network, P3 avoids moving them at all by distributing the features independently of the graph structure and computing partial activations of the compute-intensive first layer (layer 1M) via intra-layer model parallelism across all machines. Only small partial activations (not features) are transferred. Layers 2 through K then proceed in standard data-parallel fashion. This pipelined push-pull execution eliminates feature movement and dramatically reduces network traffic.

Method

P3 introduces three main techniques:

1. Independent Random Hash Partitioning: The graph topology and input features are partitioned independently using random hash — topology by node id, features by the feature dimension (F/N features per machine for N machines). This provides load balance without preprocessing overhead and enables model parallelism on the feature dimension.

2. Push-Pull Parallelism: Execution proceeds in two phases for each minibatch:

3. Pipelining (inspired by PipeDream): P3 overlaps the model-parallel forward pass of minibatch m+1 with the data-parallel forward pass of minibatch m using a static 2-forward, 2-backward schedule. Pipeline delay is bounded at 3 (weight staleness bound w_{t+1} = w_t − α·∇f(w_{t−3})). Pipelining recovers 30–50% additional performance.

4. Caching: Since graph structure and features are partitioned independently, P3 can cache either independently on machines with spare memory. Caching up to 4 partitions of both graph and features reduces epoch time by up to 1.7×.

P-TAGS API: P3 exposes six functions: partition, scatter, gather, transform, sync, apply — allowing developers to implement new GNN architectures that benefit from P3's execution strategy.

Key Results

Graph Model DGL (s/epoch) P3 (s/epoch) Speedup
OGB-Product SGCN 4.535 1.019 4.45×
OGB-Product GAT 5.067 1.912 2.65×
Facebook SGCN 22.264 4.102 5.43×
Facebook GraphSAGE 23.936 6.298 3.80×
UK-2006-05 GraphSAGE (no sampling) 151.3 21.3 7.09×

Limitations

Relevance to DynamICCL

P3 introduces a distinct communication pattern from standard DP training: intra-layer partial activation reduction (a ReduceScatter-like collective over small activation tensors) combined with standard data-parallel gradient AllReduce for layers 2 through K. This is directly relevant to DynamICCL in two ways.

First, P3's partial activation sync (the sync API call) is a small-message ReduceScatter fired once per minibatch at the model-parallel/data-parallel switch point. This message is a different size class than the gradient AllReduce at the end of backward. DynamICCL's Config Agent must distinguish these and apply different NCCL configurations — the small activation sync benefits from LL/LL128 protocols while the gradient AllReduce may benefit from ring or tree algorithms depending on message size.

Second, P3's evaluation uses 10 Gbps Ethernet — the same network class as Chameleon Cloud (1 GbE). On bandwidth-constrained clusters, P3's elimination of feature movement is a direct complement to DynamICCL's optimization of the residual communication. The two systems target orthogonal reduction strategies: P3 reduces communication volume; DynamICCL optimizes the execution of the residual volume that remains.