P3: Distributed Deep Graph Learning at Scale — Detailed Summary

Authors: Swapnil Gandhi, Anand Padmanabha Iyer (Microsoft Research) Venue: USENIX OSDI 2021 URL: https://www.usenix.org/conference/osdi21/presentation/gandhi

Abstract

P3 (Pipelined Push-Pull) is a distributed system for training Graph Neural Networks (GNNs) on large, real-world graphs. The core claim is that scalability challenges in GNN training are fundamentally different from those in classical DNN training, and that standard techniques — intelligent graph partitioning, data parallelism, model parallelism — all fail when applied directly to GNNs. P3 proposes: (1) independent random hash partitioning of graph structure and features, (2) a hybrid push-pull parallelism execution strategy combining intra-layer model parallelism with data parallelism, (3) pipelining to overlap computation and communication, and (4) a greedy caching mechanism. Together these techniques deliver up to 7x speedup over DGL and up to 2.2x over ROC on graphs with billions of nodes and edges.

1. Motivation and Problem Statement

GNNs learn low-dimensional embeddings of graph nodes by combining node feature information with graph structure through iterative neighborhood aggregation. The k-hop computation graph for each node can be exponentially large (neighborhood explosion), and feature vectors per node are dense and large (100–1000s of dimensions). In distributed GNN training this creates a compound bottleneck:

The computation graph for a minibatch must be generated by pulling k-hop neighborhood data, including features, from remote machines. For a 2-layer GNN with fan-out (25, 10), layer-0 has 188,339 nodes; pulling their 100-dimensional features costs ~71.84 MB per minibatch. GPUs sit idle up to 80% of training time waiting on this communication in DGL.
Standard graph partitioning (METIS, edge-cut, vertex-cut, 3D) optimizes only the first-hop communication, provides diminishing returns as the number of layers grows, and incurs heavy preprocessing time and memory overhead (METIS: 4264s partitioning, 63 GB memory for OGB-Product).
Alternative parallelism strategies (pure data parallelism, pure model parallelism) either fail to help or introduce additional intra-layer communication, making GPU underutilization worse.

The key empirical observation: GPUs spend ~80% of time blocked on network I/O in DGL, compared to ~15% in P3.

2. Background

2.1 Graph Neural Networks

A GNN with k layers computes node embeddings via repeated neighborhood aggregation:

h^k_{N(v)} = AGGREGATE^(k)({ h^(k-1)_u : u in N(v) })
h^k_v     = sigma( W_k * COMBINE^(k)(h^(k-1)_v, h^k_{N(v)}) )

Representative architectures include GraphSAGE (inductive, sampling-based), GCN (spectral), GAT (attention-based), SGCN (simplified GCN). All share the dependency structure: each node's embedding depends on its multi-hop neighborhood.

2.2 Existing Distributed GNN Frameworks

DGL (Deep Graph Library) combines distributed graph processing with DNN data parallelism:

Partition graph + features across machines.
For each minibatch, pull k-hop neighborhood + features (inter-machine communication).
Build computation graph locally.
Execute in data-parallel fashion on GPU.

This leads to the communication bottleneck described above.

2.3 Three Core Challenges

Challenge 1 — Communication from data dependencies. Features are the dominant traffic source (100–1000x larger than activations). Pulling features for k-hop neighbors floods the network.

Challenge 2 — Ineffectiveness of partitioning. Intelligent partitioning (METIS, vertex-cut) optimizes only the first hop. For a k-layer GNN, communication at hops 2..k is unaffected. Moreover, sophisticated schemes incur preprocessing overhead comparable to or exceeding total training time, and vertex-cut schemes inflate memory via replication.

Challenge 3 — GPU underutilization. Due to communication blocking, GPUs in DGL are active only ~20% of the time. Model parallelism is not a remedy because GNNs lack the clean pipeline structure of sequential DNN layers; intra-layer communication compounds the data-dependency communication.

3. P3 System Design

3.1 Independent Hash Partitioning

P3's foundational insight: partition graph topology and input features independently using a random hash partitioner.

Graph nodes are assigned to machines via random hash. Edges are co-located with their destination node (1D partitioning). This is computationally trivial — no preprocessing, no routing tables.
Input features with dimension F are sliced along the feature dimension. Each of N machines holds F/N features for every node. This ensures perfect load balance across the feature computation (the bottleneck layer) regardless of graph structure.

This independence is what makes the push-pull execution strategy feasible and what enables independent caching of structure vs. features.

Traditional partitioning:
  Machine M_i owns: nodes_i + features_for(nodes_i)
  -> pulling neighbors requires moving BOTH structure AND features

P3 independent partitioning:
  Machine M_i owns: nodes_i (graph) AND F/N features of ALL nodes
  -> structure pull: lightweight (graph is small)
  -> feature computation: local (each machine computes its F/N slice)

3.2 Push-Pull Parallelism (Hybrid Execution)

P3 adopts a hybrid execution model that switches between intra-layer model parallelism (for the first, most compute-intensive layer) and data parallelism (for remaining layers).

3.2.1 Computation Graph Generation

At the start of each minibatch, P3 pulls only the k-hop graph structure (not features) for the nodes being embedded. Since graph structure is lightweight, this communication is minimal. If the entire graph fits in host memory on every machine (which is common), this step is entirely local.

3.2.2 Execution Flow

FORWARD PASS:

Step 1 [PUSH, Layer 1_M]: P3 pushes computation graph for layer 1
       to ALL machines in the cluster.

Step 2 [Model Parallel, Layer 1_M]: Each machine computes
       PARTIAL activations for layer 1 using its LOCAL F/N feature
       slice. No cross-machine feature movement needed.

Step 3 [PULL + Reduce]: Machine owning each node pulls partial
       activations from all other machines and reduces (sums) them.
       Result: full activation for layer 1 at each node's home machine.

Step 4 [Data Parallel, Layer 1_D]: Non-linear ops, normalization, etc.
       applied locally (data-parallel, no sync needed for element-wise ops).

Step 5 [Data Parallel, Layers 2..K]: Standard data-parallel execution
       for remaining layers. Gradient sync at each boundary.

BACKWARD PASS (mirror of forward):

Step 6 [Data Parallel, Layers K..2]: Standard data-parallel backward.

Step 7 [PUSH, Layer 1_D -> 1_M]: Error gradients pushed to all machines.

Step 8 [Model Parallel, Layer 1_M backward]: Local backward pass
       for each machine's F/N feature slice.

Why this eliminates feature movement: P3 never moves features across the network. Instead it moves the much smaller partial activations (size: hidden_dim, not feature_dim) during the reduce step. For the OGB-Product experiment: feature pull = 71.84 MB vs. P3 partial activation transfer = 5 MB (3 machines * 24703 nodes * 16 hidden dims).

3.3 Pipelining

P3 uses a static 2-forward, 2-backward pipeline schedule inspired by PipeDream, scheduling 3 concurrent minibatches to overlap communication with computation.

Timeline (simplified, 3 minibatches):

Minibatch:    1_M  1_D  2_M  2_D  3_M  3_D  ...
              |----|----|
                   Comm 3_M overlaps with 3_D of prior batches

The pipeline introduces bounded staleness: the gradient used to update a weight was computed from a weight version 3 steps old:

w_{t+1} = w_t - alpha * grad_f(w_{t-3})

This delay is fixed and bounded (unlike unbounded asynchronous SGD), matching the staleness structure of PipeDream. The paper argues this is acceptable and empirically shows accuracy parity with data-parallel training.

Pipelining yields 30–50% additional performance improvement over non-pipelined P3.

3.4 Caching

Because graph topology and features are partitioned independently, P3 can cache them independently when host memory permits. A greedy scheme finds the minimum number of machines needed to hold graph or feature partitions and creates copies on remaining machines, reducing communication further.

Caching delivers up to 1.7x additional speedup over non-cached P3. This is impossible in DGL because structure and features are co-partitioned — replicating one forces replication of the other.

3.5 P-TAGS API

P3 exposes a six-function API for developers to implement new GNN architectures:

Function	Description
`partition(graph, feat, topo_part_fn, ft_part_fn)`	Independently partition topology and features
`scatter(graph, feat, udf) -> msg`	Generate messages per edge combining src/edge/dst features
`gather(msg, udf) -> a_ft`	Aggregate neighborhood messages at each vertex
`transform(v_ft, a_ft) -> t_ft`	Compute partial representation (requires sync)
`sync(t_ft, op='sum') -> sync_ft`	Accumulate partial representations via all-reduce
`apply(v_ft, sync_ft) -> ft`	Compute final vertex representation

Listing 1 in the paper shows GraphSAGE implemented in ~15 lines using this API. The API is general enough to express GCN, GAT, SGCN, and GraphSAGE.

3.6 Implementation

P3 is built on DGL (Deep Graph Library) with PyTorch as the neural network backend. Key implementation changes to DGL:

Replaced co-located graph+feature partitioning with independent partitioning.
DGL's KVStore was extended to accumulate partial activations via RPC and copy them to device memory for the trainer.
Minibatches are assigned unique IDs and placed in a work queue; the trainer uses a 2F2B static schedule.
PyTorch DistributedDataParallel handles weight gradient synchronization.

Hardware: 4 nodes, each with 1x 12-core Intel Xeon E5-2690v4 CPU, 441 GB RAM, 4x NVIDIA Tesla P100 16 GB GPUs, connected via 10 Gbps Ethernet.

4. Key Algorithm / Formulation

The embedding computation graph size for a k-layer GNN with fan-out (f_1, f_2, ..., f_k) is:

|Layer 0 nodes| = prod_{i=1}^{k} f_i  (exponential in k)

P3's partial activation size transferred per node (at the reduce step) = hidden_dim H, independent of feature size F. Communication savings scale as F/H, which is large when features are high-dimensional and GNN models are relatively shallow (as is typical: H is 16–128, F is 100–256+).

The staleness-bounded weight update from pipelining:

w_{t+1} = w_t - alpha * grad_f(w_{t-3})

where the delay of 3 corresponds to 2 forward + 1 backward phase in the pipeline.

5. Evaluation

5.1 Datasets

Graph	Nodes	Edges	Features
OGB-Product	2.4M	123.7M	100
OGB-Paper	111M	1.6B	128
UK-2006-05	77.7M	2.9B	256
UK-Union	133.6M	5.5B	256
Facebook	30.7M	10B	256

5.2 Baselines

DGL with METIS partitioner (default) and hash partitioner
ROC (Jia et al., MLSys 2020) — online-partitioner-based distributed GNN system

5.3 Overall Performance (vs. DGL)

P3 consistently outperforms DGL across all graphs and all GNN models (SGCN, GCN, GraphSAGE, GAT):

Graph	Best DGL Epoch (s)	Best P3 Epoch (s)	Speedup
OGB-Product (SGCN)	4.535	1.019	4.45x
OGB-Paper (SGCN)	9.059	2.230	4.06x
UK-2006-05 (SGCN)	6.435	1.481	4.34x
UK-Union (SGCN)	11.472	2.379	4.82x
Facebook (SGCN)	22.264	4.102	5.43x

Without sampling (full-batch), DGL cannot train UK-Union and Facebook at all (OOM). P3 achieves up to 7.69x speedup.

Breakdown of epoch time shows that DAG creation time (the computation graph generation phase) is the dominant term in DGL epoch time; P3 reduces it by 4–15x.

5.4 Impact of Partitioning Strategy

P3's random hash partitioner outperforms all DGL partitioning schemes including METIS (best intelligent scheme). METIS incurs 4264s preprocessing overhead for OGB-Product — often exceeding total training time. Vertex-cut schemes (RandomVertexCut, GRID) balloon memory to 185+ GB. P3 hash partitioning: no preprocessing cost, minimal memory overhead.

5.5 Impact of Layers

As GNN depth increases from 2 to 4 layers, P3's advantage grows: DGL epoch time grows super-linearly (computation graph size explodes), while P3 epoch time grows much more slowly. At 4 layers, P3 outperforms DGL by up to 6.07x.

5.6 Impact of Features

As feature dimension increases from 16 to 512, DGL epoch time grows proportionally (it must pull more features). P3 epoch time grows much more slowly (only activations of fixed hidden size are transferred). At 512 features, P3 outperforms DGL by 4.77x.

5.7 Caching

When memory permits caching of graph and/or feature partitions on multiple machines, speedup over DGL extends from 3.6x (table 4) to 5.23x.

5.8 Pipelining

Pipelining adds 30–50% performance improvement over non-pipelined P3.

5.9 GPU Utilization

DGL: GPU busy ~20% of time (blocked on network ~80%). P3: GPU busy ~85% of time.

5.10 Scaling

P3 exhibits near-linear strong scaling: throughput approximately doubles when machine count doubles. DGL throughput remains flat with additional machines due to network bottleneck.

5.11 Accuracy

On OGB-Product with GraphSAGE, both DGL and P3 achieve ~76.2% accuracy after ~50 epochs. P3 completes this training in 61.65s; DGL takes 236.37s with hash partitioner, 126.39s with METIS (including partitioning overhead).

5.12 Comparison with ROC

Without sampling (ROC does not support sampling), P3 outperforms ROC by up to 2.2x on OGB-Paper and UK-2006-05. ROC's benefits come from its online partitioner but it still moves features during training; as graph size grows, P3's advantage increases.

6. Limitations

Hidden dimension assumption. P3's benefit depends on the GNN's hidden dimension H being smaller than the feature dimension F. When H approaches F, partial activations are no longer smaller than features, and P3 loses its advantage. It can become strictly worse than DGL at large H. The paper does not propose an automatic mechanism to detect this crossover and switch strategies.
Static pipeline schedule. The 2F2B static schedule may not be optimal for all configurations. A profiling-based adaptive scheduler is deferred to future work.
Memory overhead from weight stashing. Bounded-staleness pipelining requires maintaining multiple weight versions in memory. For current GNN models (few layers, small parameters) this is negligible, but the paper acknowledges it could become relevant for larger GNN models.
Model parallelism scaling. In P3, each machine must pull partial activations from all other machines during the reduce step; this communication grows linearly with the number of machines, limiting scaling at very large cluster sizes.
Homogeneous cluster assumption. The caching strategy assumes homogeneous machines. Heterogeneous environments may require a more sophisticated caching policy.
No support for full-batch training in ROC comparison. The comparison with ROC is limited to GCN (full-batch, no sampling) due to ROC's API limitations at time of evaluation.
Evaluation on 10 Gbps Ethernet. High-bandwidth interconnects (InfiniBand) would reduce feature-movement overhead in DGL/ROC, potentially narrowing P3's advantage.

DGL (Wang et al., 2019): de-facto distributed GNN framework, uses METIS + data parallelism; baseline in this paper.
ROC (Jia et al., MLSys 2020): online-partitioner-based distributed GNN training; leverages NVLink + InfiniBand; still moves features.
PaGraph (Lin et al., SoCC 2020): single-machine multi-GPU, focuses on computation-aware caching; orthogonal.
NeuGraph (Ma et al., USENIX ATC 2019): single-machine multi-GPU, stream scheduler; orthogonal.
GPipe (Huang et al., NeurIPS 2019): pipeline parallelism for DNNs; P3 borrows its pipelining idea.
PipeDream (Narayanan et al., SOSP 2019): bounded-staleness pipeline; P3's staleness model follows this work.
PyTorch-Geometric (Fey & Lenssen, 2019): single-machine GNN framework.

8. Relevance to DynamICCL

P3 is indirectly relevant to DynamICCL. The paper does not address NCCL, AllReduce, or collective communication tuning. However, several conceptual connections exist:

Communication volume as the primary optimization target. P3's core thesis — that reducing inter-node communication volume is more impactful than sophisticated partitioning or scheduling — mirrors DynamICCL's motivation: that NCCL's default collective configuration is communication-suboptimal and RL-based tuning of algorithm/protocol/channel choices can reduce effective communication time.

Push-pull vs. collective semantics. P3's partial-activation reduce step is semantically an AllReduce (sum) across all machines. The efficiency of this step depends on the same parameters DynamICCL tunes: algorithm choice (ring vs. tree), protocol, channel count. In a P3 deployment on an HPC GPU cluster, DynamICCL could potentially optimize the AllReduce calls P3 issues during the sync step.

Network congestion interplay. P3 relies heavily on Ethernet (10 Gbps). The congestion patterns from P3's push-pull communication — bursty partial-activation transfers followed by quiescent compute phases — are exactly the kind of traffic DynamICCL's Agent-1 (LSTM+CUSUM congestion detector) is designed to detect and respond to.

Feature dimension vs. hidden dimension tradeoff. P3's limitation at large hidden dimensions maps to the scenario where AllReduce traffic volume grows (larger tensors), making DynamICCL's ability to select lower-overhead collective configurations (e.g., tree over ring for small message counts) more valuable.

In summary, P3 is a consumer of collective communication infrastructure. DynamICCL operates one level below P3 in the software stack and could serve as an enabling optimization layer for P3's distributed training workload.

Citation

@inproceedings{gandhi2021p3,
  title     = {P3: Distributed Deep Graph Learning at Scale},
  author    = {Gandhi, Swapnil and Iyer, Anand Padmanabha},
  booktitle = {15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21)},
  pages     = {551--568},
  year      = {2021},
  publisher = {USENIX Association}
}