Also: Brief Summaries Detailed Summaries

Architecture & Measurement-Design Analysis

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Source: Narayanan, D.; Shoeybi, M.; Casper, J.; LeGresley, P.; Patwary, M.; Korthikanti, V.; Vainbrand, D.; Kashinkunti, P.; Bernauer, J.; Catanzaro, B.; Phanishayee, A.; Zaharia, M. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21), November 14-19, 2021, St. Louis, MO, USA. DOI: https://doi.org/10.1145/3458817.3476209 ACM ISBN: 978-1-4503-8442-1/21/11 Code: https://github.com/nvidia/megatron-lm (artifact: 10.5281/zenodo.5181820) Authors: NVIDIA + Stanford University + Microsoft Research. Reader: Direct PDF read via PyMuPDF (gemini-reader free-tier quota exhausted; codex-reader CLI rejected gpt-5.1-codex-mini model on this ChatGPT account; full text extracted to /tmp/0045_full.txt). Analyst: Vishwakarma Date: 2026-05-04

System Architecture (the PTD-P stack: tensor + pipeline + data parallelism with interleaved 1F1B + scatter/gather)
Target-Hardware / SUT Architecture (Selene supercomputer: 384 DGX A100 nodes, NVLink + 8x HDR InfiniBand, three-level fat-tree)
Design-Space Diagram (axes swept across (p, t, d, b, B, schedule, recompute, scatter/gather, fusion); axes held fixed)
Algorithm / Control Flow Diagrams (interleaved 1F1B schedule, MLP/self-attention partitioning, scatter/gather, microbatch sweep)
Quantitative Results - Empirical Findings by Regime
Configuration-Regime Trade-off Tables
Bottlenecks & Insights Surfaced by the Measurements
Limitations of the Methodology
Note on NCCL Tuning
Analogy

1. System Architecture (the PTD-P stack)

PTD-P is the paper's central design artifact: a three-axis composition of parallelism -- tensor model parallelism (intra-server, NVLink-bound, all-reduce-heavy), pipeline model parallelism (inter-server, IB-bound, point-to-point), and data parallelism (cross-replica, infrequent all-reduce) -- glued together by a novel interleaved 1F1B pipeline schedule and a scatter/gather communication optimization that exploits the redundancy between tensor-parallel ranks across pipeline stages. The whole system is implemented as an extension of the Megatron-LM codebase on PyTorch + NCCL. Every other engineering contribution -- fused element-wise kernels, the [s, b, a, h] data layout, masked-softmax fusion, activation recomputation tuning -- is downstream of two structural commitments described later in this section.

+--------------------- PTD-P System Architecture ------------------------+
|                                                                        |
|  +------------------------------------------------------------------+  |
|  | Application layer                                                | |
|  |   PyTorch 1.8.0a0 training loop                                  | |
|  |   GPT model: 1B / 5.9B / 18.4B / 39.1B / 76.1B / 145.6B /        | |
|  |              175.0B / 310B / 530B / 1.0T parameters               | |
|  |   Vocab V = 51200, sequence s = 2048, mixed precision (FP16)     | |
|  +-----------------------------+------------------------------------+  |
|                                |                                       |
|                                v                                       |
|  +------------------------------------------------------------------+  |
|  |  PTD-P composition layer (Sec. 2 + Sec. 3)                       |  |
|  |  +------------------------+   +-------------------------------+  |  |
|  |  | Tensor MP partitioner  |   | Pipeline MP partitioner       |  |  |
|  |  |  (Megatron split,      |   |  layers striped over devices  |  |  |
|  |  |   Sec. 2.3, Fig 5)     |   |  (PipeDream-Flush base,       |  |  |
|  |  |  - MLP: A = [A1,A2]    |   |   Sec. 2.2.1)                  |  |  |
|  |  |    cols, B rows split  |   |  - 1F1B steady state           |  |  |
|  |  |  - Attention: K,Q,V    |   |  - warm-up + cool-down phases  |  |  |
|  |  |    column-parallel     |   |  - bubble = (p-1)/m            |  |  |
|  |  |  - 2 all-reduce fwd +  |   +-------------------------------+  |  |
|  |  |    2 all-reduce bwd    |                                      |  |
|  |  +------------------------+   +-------------------------------+  |  |
|  |                               | Data parallel replicator      |  |  |
|  |  +------------------------+   |  (gradient all-reduce per     |  |  |
|  |  | Interleaved 1F1B sched |   |   batch, ring-based)          |  |  |
|  |  |  (Sec. 2.2.2, Fig 4b)  |   |  - bubble = (n-d)/b' for t=1  |  |  |
|  |  |  v chunks per device   |   +-------------------------------+  |  |
|  |  |  bubble shrinks 1/v    |                                      |  |
|  |  |  comm grows by v       |                                      |  |
|  |  +------------------------+                                      |  |
|  +-----------------------------+------------------------------------+  |
|                                |                                       |
|                                v                                       |
|  +------------------------------------------------------------------+  |
|  |  Communication-optimization layer (Sec. 4.1, Fig 9)              |  |
|  |  +-------------------------+   +-----------------------------+   |  |
|  |  | Scatter/gather optim.   |   | Tensor-parallel all-reduce  |   |  |
|  |  |  (1/t-sized chunks      |   |  (NCCL ring inside server,  |   |  |
|  |  |   over IB; gather       |   |   NVLink/NVSwitch fabric)   |   |  |
|  |  |   over NVLink)          |   +-----------------------------+   |  |
|  |  | b*s*h volume per stage  |                                     |  |
|  |  |   reduced from b*s*h    |   +-----------------------------+   |  |
|  |  |   to b*s*h / t          |   | Pipeline-parallel P2P       |   |  |
|  |  +-------------------------+   |  send/recv (NCCL P2P over   |   |  |
|  |                                |   InfiniBand HDR)            |   |  |
|  |                                +-----------------------------+   |  |
|  +-----------------------------+------------------------------------+  |
|                                |                                       |
|                                v                                       |
|  +------------------------------------------------------------------+  |
|  |  Computation-optimization layer (Sec. 4.2)                       |  |
|  |   - Data layout [s,b,a,h] enables strided batched GEMM           |  |
|  |   - PyTorch JIT fused: bias+GeLU, bias+dropout+add               |  |
|  |   - Custom CUDA kernels: scale-mask-softmax fusion (general +    |  |
|  |     causal masking variants)                                     |  |
|  |   - Activation recomputation per-layer (c = sqrt(l*Aint/Ain))    |  |
|  +------------------------------------------------------------------+  |
|                                |                                       |
|                                v                                       |
|  +------------------------------------------------------------------+  |
|  |  NCCL 2.x + CUDA 11.1 substrate (collective + P2P primitives)    |  |
|  |   - all-reduce on tensor-parallel groups (intra-server, NVLink)  |  |
|  |   - all-reduce on data-parallel groups (cross-server, IB)        |  |
|  |   - send/recv on pipeline-parallel groups (cross-server, IB)     |  |
|  +------------------------------------------------------------------+  |
|                                |                                       |
|                                v                                       |
|  +------------------------------------------------------------------+  |
|  |  Transport: 8 NVIDIA Mellanox 200 Gbps HDR IB HCAs per node +    | |
|  |             NVLink + NVSwitch (intra-server)                      | |
|  +------------------------------------------------------------------+  |
+-----------------------------------------------------------------------+
^ Fig 1: PTD-P stack on Selene. The composition layer is the core
  contribution: the PTD-P partitioner colocates tensor-parallel ranks
  inside a single DGX A100 server (so that t-way all-reduces hit
  NVLink) and stripes pipeline stages across servers (so that
  inter-node P2P traverses HDR IB). Scatter/gather sits between the
  two parallelism axes, exploiting that the two consecutive pipeline
  stages' tensor-parallel ranks already hold replicated tensors.

The system has two structural commitments that every other choice flows from.

+------ Megatron-LM PTD-P's Two Load-Bearing Structural Decisions -------+
|                                                                        |
|  Decision 1: t-axis pinned to a single multi-GPU server (t = g).       |
|     +-------------------------------------------------------------+    |
|     |  Tensor MP all-reduce volume per microbatch per layer:      |    |
|     |     8*b*s*h * (t-1)/t  (Sec. 3.2)                           |    |
|     |  Pipeline MP send/recv volume per stage per microbatch:     |    |
|     |     b*s*h  (or b*s*h/t with scatter/gather)                  |    |
|     |  Therefore: pin t to NVLink-bound intra-server (t <= g = 8) |    |
|     |  and let p stripe across servers, since p uses cheaper P2P. |    |
|     |  Empirically: sub-optimal (t,p) splits cost up to 2x        |    |
|     |  throughput even with high-bandwidth links.                  |    |
|     +-------------------------------------------------------------+    |
|                                                                        |
|  Decision 2: keep strict optimizer semantics; pay the bubble.          |
|     +-------------------------------------------------------------+    |
|     |  Synchronous flushes at every batch boundary (no            |    |
|     |  PipeDream-2BW / PipeMare-style asynchrony).                |    |
|     |  Bubble fraction = (p-1)/m on default 1F1B.                 |    |
|     |  Interleaved schedule with v chunks reduces bubble to       |    |
|     |     (1/v) * (p-1)/m at cost of v-fold more comm.             |    |
|     |  Scatter/gather pays back the v-fold comm cost by reducing  |    |
|     |  per-stage cross-node volume from b*s*h to b*s*h/t.         |    |
|     |  Net: small batch can still hit ~50% peak with v=2 + s/g.   |    |
|     +-------------------------------------------------------------+    |
+-----------------------------------------------------------------------+
^ Fig 2: The two structural commitments that pre-determine every
  algorithmic and engineering choice. Decision 1 is the source of the
  "tensor parallelism inside the box, pipeline parallelism across
  boxes" heuristic that recurs throughout the paper. Decision 2 is the
  reason the interleaved schedule and scatter/gather exist at all --
  asynchronous schedules would obviate both.

The paper is unusually clean about what is owned versus what is reused. Owned (implemented by the authors as part of this work): the interleaved 1F1B schedule, the scatter/gather optimization, the fused kernels (bias+GeLU, bias+dropout+add, scale-mask-softmax), the [s, b, a, h] data layout for strided batched GEMM, the FLOP cost model (Eq. 2 and 3), the empirical takeaways T#1-T#3 in Sec. 3, and the open-source codebase. Reused as black boxes: PyTorch 1.8.0a0 for forward/backward and JIT, NCCL 2.x for collectives + P2P, CUDA 11.1

cuDNN, the original Megatron tensor-parallel partitioning of MLP and self-attention (cited from [40]), the PipeDream-Flush base 1F1B schedule (cited from [30]), the DGX A100 hardware, Mellanox HDR HCAs, and the NVLink/NVSwitch fabric.

2. Target-Hardware / SUT Architecture (Selene supercomputer)

The evaluation runs on the NVIDIA Selene supercomputer. Each compute node is a DGX A100 holding 8 NVIDIA A100 GPUs each with 80 GB of HBM2e memory, connected internally by NVLink + NVSwitch. Each node has 8 NVIDIA Mellanox 200 Gbps HDR InfiniBand HCAs for application traffic plus 2 additional HCAs per node dedicated to storage. The nodes are interconnected in a three-level (leaf, spine, core) fat-tree topology with 850 switches, and the cluster uses an all-NVMe shared parallel filesystem. Up to 384 nodes = 3072 A100 GPUs are used in the largest experiment (the 1-trillion-parameter run).

+------ Cluster: Selene, up to 384 DGX A100 nodes = 3072 A100 GPUs ------+
|                                                                        |
|  Three-level fat-tree (leaf -> spine -> core), 850 switches            |
|                                                                        |
|     Node 0                Node 1                  ...     Node 383     |
|  +-----------+         +-----------+                   +-----------+   |
|  | DGX A100  |         | DGX A100  |                   | DGX A100  |   |
|  +-----------+         +-----------+                   +-----------+   |
|  | 8x A100   |         | 8x A100   |                   | 8x A100   |   |
|  | 80 GB HBM2|         | 80 GB HBM2|                   | 80 GB HBM2|   |
|  | Peak FP16:|         | Peak FP16:|                   | Peak FP16:|   |
|  | 312 TF/s  |         | 312 TF/s  |                   | 312 TF/s  |   |
|  | per GPU   |         | per GPU   |                   | per GPU   |   |
|  | NVLink +  |         | NVLink +  |                   | NVLink +  |   |
|  | NVSwitch  |         | NVSwitch  |                   | NVSwitch  |   |
|  +-----+-----+         +-----+-----+                   +-----+-----+   |
|        |                     |                               |         |
|     8x HDR IB HCAs      8x HDR IB HCAs                  8x HDR IB HCAs |
|     (200 Gbps each,     (200 Gbps each,                 (200 Gbps each,|
|      app traffic)        app traffic)                    app traffic)  |
|     +2x HCAs (storage)  +2x HCAs (storage)              +2x HCAs (st.) |
|        |                     |                               |         |
|        +=====================+===============================+         |
|                                                                        |
|        Three-level (leaf, spine, core) fat-tree                        |
|        Effective bisection BW (measured at 3072 GPUs):                  |
|          - Pipeline P2P:    892 GB/s                                    |
|          - Data-parallel AR: 12.9 TB/s                                  |
|        Filesystem: all-NVMe parallel (peak read 1 TB/s @ 384 nodes,    |
|                    checkpoint write 273 GB/s = 40% of peak)             |
+-----------------------------------------------------------------------+

  Software stack (Sec. 5 + Artifact Description):
  +------------------------------------------------+
  | Megatron-LM (this paper) + ZeRO-3 (DeepSpeed) | application
  +------------------------------------------------+
  | PyTorch 1.8.0a0+1606899                       | DL framework
  +------------------------------------------------+
  | NCCL 2.x (CUDA 11.1.1)                        | collective lib
  +------------------------------------------------+
  | CUDA 11.1.1 + cuDNN                            | GPU runtime
  +------------------------------------------------+
  | NCCL transport: NVLink (intra) + IB (inter)   | transport
  +------------------------------------------------+
  | DGX A100 + NVLink/NVSwitch + 8x HDR IB / node | hardware
  +------------------------------------------------+
  | Container: nvcr.io/nvidia/pytorch-20.12-py3   | runtime image
  | OS: Ubuntu 20.04                              | host OS
  +------------------------------------------------+
^ Fig 3: SUT - 384 DGX A100 nodes, NVLink + NVSwitch within each box,
  8x 200 Gbps HDR IB HCAs out of each box, three-level fat-tree with
  850 switches between boxes. Two distinct interconnect tiers with
  drastically different bandwidths -- the very property PTD-P's
  intra-server tensor parallelism + inter-server pipeline
  parallelism is designed to exploit.

The "two distinct interconnect tiers" property is the load-bearing hardware fact for the paper. Inside a DGX A100, NVLink + NVSwitch gives all-to-all GPU bandwidth far higher than what 8 HDR IB HCAs can provide leaving the box. Across boxes, the HDR IB fabric provides high but not NVLink-class bandwidth. PTD-P's heuristic ("tensor MP within box, pipeline MP across box") is a direct exploitation of this hierarchy: tensor MP is the most communication-intensive axis (two all-reduces per layer per microbatch), so it must live on the fastest tier; pipeline MP is the cheapest axis (point-to-point only, between adjacent stages), so it tolerates the slower tier.

The reported per-axis bisection bandwidth at full 3072-GPU scale, extracted verbatim:

Path	Effective bisection BW
Pipeline-parallel point-to-point	892 GB/s
Data-parallel all-reduce	12.9 TB/s
Filesystem read (initial ckpt load)	1 TB/s (peak)
Filesystem write (ckpt save)	273 GB/s (40% of peak)
Per-GPU peak FP16	312 TF/s
1-trillion-param iteration	502 PF/s aggregate
	163 TF/s per GPU
	52% of peak per device
175B model on 1024 GPUs	140 TF/s per GPU

The 12.9 TB/s data-parallel all-reduce bisection is approximately 14.5x the 892 GB/s pipeline P2P bisection -- not because the fabric is asymmetric, but because data-parallel replicas are spread across the entire bisection (every link participates in the ring), whereas pipeline P2P traverses pipeline-adjacent stages only (a small fraction of the fabric carries each tensor). This measured ratio is itself a validation of the PTD-P heuristic at scale.

3. Design-Space Diagram (axes swept)

The paper sweeps a high-dimensional configuration space. The following diagram makes the axes explicit and labels each as swept (the paper varies it in at least one experiment) or fixed (the paper holds it constant across experiments). The "controlled by analyst" axes are the ones a runtime tuner could also vary; the "controlled by user" axes (model size, global batch size, vocab size, etc.) are workload parameters.

+------------ PTD-P Design Space (axes swept by Sec. 5) ----------------+
|                                                                       |
|  Workload axes (user-controlled, reported per experiment):            |
|    P (parameter count)        : 1B / 5.9B / 18.4B / 39.1B / 76.1B /  |
|                                  145.6B / 175B / 310B / 530B / 1T    |
|    s (sequence length)        : 2048 (fixed)                          |
|    V (vocabulary size)        : 51200 (fixed, multiple of 1024)       |
|    h (hidden size)            : 4096 / 12288 / 20480 / 25600 / ...    |
|    l (num transformer layers) : 4 / 24 / 32 / 80 / 96 / 128 / ...    |
|    a (num attention heads)    : 32 / 96 / 128 / 160                  |
|    Precision                  : mixed precision, FP16 (fixed)         |
|                                                                       |
|  Parallelism axes (system-controlled, swept jointly):                 |
|    n  (total GPUs)            : 8 -> 3072 (powers of 2, log scale)   |
|    p  (pipeline-parallel size): 1 -> 64                               |
|    t  (tensor-parallel size)  : 1 -> 32 (best: t = g = 8)             |
|    d  (data-parallel size)    : derived as n / (t*p)                  |
|    Constraint: p * t * d == n                                         |
|                                                                       |
|  Schedule axes (system-controlled):                                   |
|    schedule       : default 1F1B vs interleaved 1F1B (v chunks)       |
|    v (chunks)     : 1 (default) or 2 (interleaved, in Sec. 5.3.2)    |
|    b (microbatch) : 1 / 2 / 4 / 8                                     |
|    B (global)     : 32 / 128 / 512 / 1024 / 1536 / 2048 / 2304 /     |
|                      4032 / 8000 (model-dependent)                    |
|                                                                       |
|  Optimization axes (system-controlled, ablated):                      |
|    activation recompute    : on / off  (Sec. 5.6, Fig 17)             |
|    scatter/gather optim.   : on / off  (Sec. 5.7, Fig 18)             |
|    operator fusion         : on / off  (Sec. 5.8)                     |
|                                                                       |
|  Held fixed across all experiments (and never ablated):               |
|    framework: PyTorch 1.8.0a0+1606899   (single version)              |
|    NCCL    : CUDA 11.1.1                (single version)              |
|    NCCL knobs : not reported (algorithm/protocol/nChannels/numThreads/|
|                  chunkSize all default)                                |
|    GPU      : A100 80GB only                                          |
|    fabric   : DGX A100 + NVLink + HDR IB only                         |
|    OS       : Ubuntu 20.04                                            |
|                                                                       |
|  Output metrics:                                                      |
|    primary  : per-GPU teraFLOP/s (compute throughput)                 |
|               aggregate petaFLOP/s                                    |
|               percentage of theoretical FP16 peak (312 TF/s)          |
|    secondary: training time (days) for given T tokens                 |
|               sequences per second (Fig 17)                            |
|               effective bisection bandwidth (GB/s, TB/s)               |
+-----------------------------------------------------------------------+
^ Fig 4: Design space - 4 grouped axis bundles. The parallelism axes
  (n, p, t, d) and schedule axes (schedule, v, b, B) are the primary
  knobs; the optimization axes (recompute, scatter/gather, fusion) are
  ablated one at a time. NCCL configuration is *held fixed at default*
  across all experiments, which is the missing axis from a runtime-
  tuner perspective.

The most important property of this design space for a tuner is that the parallelism axes are discrete and constrained (p * t * d = n, all powers of two in practice), but the interaction between axes is not separable. Section 3 derives the analytical pipeline-bubble size as (p-1)/m and the per-microbatch tensor-MP all-reduce volume as 8*b*s*h * (t-1)/t * l_stage, then shows empirically (Fig 13-15) that joint optimization over (p, t, d, b) cannot be replaced by optimizing each axis independently -- sub-optimal splits cost up to 2x throughput even at fixed n.

4. Algorithm / Control Flow Diagrams

4.1 Default 1F1B (PipeDream-Flush base)

                                     warm-up    steady-state    cool-down
   +---------------------------+   +-------+   +----------+   +---------+
   | Worker 0 (stage 0)        |   |F F F F|   |F B F B...|   |B B B B B|
   | Worker 1 (stage 1)        |   |  F F F|   | F B F B..|   |B B B B  |
   | Worker 2 (stage 2)        |   |    F F|   |  F B F B.|   |B B B    |
   | Worker p-1 (stage p-1)    |   |      F|   |   F B F B|   |B B      |
   +---------------------------+
                                    |<--p->|   |<-- m -->|    |<--p--->|
                                  forwards    1F1B alternation backwards
   bubble = (p-1)/m       activations stashed for at most p microbatches
^ Fig 5: PipeDream-Flush 1F1B schedule used as the base. Each worker
  enters with p-1 forwards (warm-up), then alternates F/B (steady
  state), then drains p-1 backwards (cool-down). Pipeline flush at
  the end of every batch maintains strict optimizer semantics.

4.2 Interleaved 1F1B (this paper's contribution)

                       v = 2 chunks per device, p devices, m microbatches

   Default schedule (v = 1):
   bubble = (p - 1)/m

   Interleaved schedule (v > 1):
   bubble = (1/v) * (p - 1)/m

   Trade-off:
       bubble shrinks by factor v
       comm volume grows by factor v
       memory ~ same as 1F1B (still p microbatches in flight)

   Constraint: m must be an integer multiple of p

   +--------+        +--------+        +--------+        +--------+
   | Dev 0  |        | Dev 1  |        | Dev 2  |        | Dev 3  |
   | layers |        | layers |        | layers |        | layers |
   | 1,2 +  |        | 3,4 +  |        | 5,6 +  |        | 7,8 +  |
   | 9,10   |        | 11,12  |        | 13,14  |        | 15,16  |
   +--------+        +--------+        +--------+        +--------+
       ^                                                    |
       +------ each dev holds 2 non-contiguous chunks-------+

^ Fig 6: Interleaved 1F1B (Sec. 2.2.2, Fig 4 in paper). Each device
  is assigned v = 2 chunks of layers (light + dark color in the
  paper's Fig 4). The pipeline flush happens 1/v as deep into the
  iteration, shrinking the bubble by v at the cost of v-fold more
  P2P sends. Scatter/gather (Sec. 4.1) is what makes that v-fold
  comm cost recoverable.

4.3 Tensor-parallel MLP partitioning (Megatron's split, Sec. 2.3)

        Y = GeLU(X * A)         Z = Dropout(Y * B)

        Split A column-wise: A = [A1 | A2]  (no sync needed, GeLU is elementwise)
        Split B row-wise:    B = [B1; B2]  (no sync between GEMMs)

   +--------------------+     +--------------------+
   | GPU 0              |     | GPU 1              |
   |                    |     |                    |
   | Y1 = GeLU(X * A1)  |     | Y2 = GeLU(X * A2)  |
   | Z1_partial = Y1*B1 |     | Z2_partial = Y2*B2 |
   +---------+----------+     +---------+----------+
             |                          |
             +------ all-reduce --------+
             |  g operator: identity    |
             |  fwd, all-reduce bwd     |
             |  (vice-versa for f)      |
             v
   +-----------------------+
   | Z = Z1_partial +      |
   |     Z2_partial then   |
   |     Dropout           |
   +-----------------------+

   per-microbatch volume:
      8*b*s*h * (t-1)/t  bytes per layer (2 fwd + 2 bwd all-reduces)
^ Fig 7: MLP block partitioned with tensor MP, as borrowed from
  Megatron [40]. The genius of the column-split-then-row-split is
  that GeLU's non-linearity is bypassed without sync, and only the
  final reduction needs an all-reduce. The f and g operators are
  conjugates: f is identity-fwd / all-reduce-bwd; g is the reverse.

4.4 Scatter/gather communication optimization (Fig 9 of paper, Sec. 4.1)

Without scatter/gather (default):

   Stage k                                           Stage k+1
   +------------------+                          +------------------+
   |  GPU 0 (TP rank 0)|---send full b*s*h-------> | GPU 0 (TP rank 0)|
   |  GPU 1 (TP rank 1)|---send full b*s*h-------> | GPU 1 (TP rank 1)|
   |  GPU 2 (TP rank 2)|---send full b*s*h-------> | GPU 2 (TP rank 2)|
   |  ...              |  (8 redundant copies      |  ...             |
   |  GPU 7 (TP rank 7)|---send full b*s*h-------> | GPU 7 (TP rank 7)|
   +------------------+   over IB)                 +------------------+

   IB volume per pipeline edge per microbatch: t * b*s*h = 8 * b*s*h

With scatter/gather (this paper's optimization):

   Stage k                                           Stage k+1
   +------------------+                          +------------------+
   |  GPU 0  scatters |--chunk 0 (size b*s*h/t)->|  GPU 0           |
   |  GPU 1  scatters |--chunk 1 (size b*s*h/t)->|  GPU 1           |
   |  GPU 2  scatters |--chunk 2 (size b*s*h/t)->|  GPU 2           |
   |  ...             |                          |  ...             |
   |  GPU 7  scatters |--chunk 7 (size b*s*h/t)->|  GPU 7           |
   +------------------+                          +-----+------------+
                                                       |
                                            all-gather over NVLink
                                                       v
                                          +------------------+
                                          | full b*s*h tensor|
                                          |  rematerialised   |
                                          |  on each TP rank  |
                                          +------------------+

   IB volume per pipeline edge per microbatch:
      t * (b*s*h / t) = b*s*h    (an 8x reduction at t = 8)
   NVLink volume:
      one all-gather of b*s*h (cheap on NVSwitch, ~600 GB/s class)

^ Fig 8: Scatter/gather - the structural pun on which the paper's
  scaling story rests. The output of every transformer layer is
  *replicated* across t tensor-parallel ranks (after the g operator
  in Fig 7). Naively that means 8 redundant copies cross IB at every
  pipeline boundary. By scattering before send and gathering on
  receive over NVLink, the cross-node IB traffic shrinks by a factor
  t while the only additional cost is a near-free intra-server
  all-gather. The optimization makes the v-fold extra comm of the
  interleaved schedule recoverable; without it the default schedule
  would beat the interleaved schedule at large batch sizes.

4.5 PTD-P configuration selection control flow

   START
     |
     v
   (1) [user provides: P (model size), B (global batch),
        T (tokens to train), n (GPU budget)]
     |
     v
   (2) [model fits on single A100 (80GB)?]
     |
     +-- YES --> (3a) Use t = 1, p = 1, d = n; pure DP
     |
     +-- NO ---> (3b) [model fits on single DGX A100 with t = 8?]
                          |
                          +-- YES --> (4a) t = 8, p = 1, d = n/8
                          |
                          +-- NO ---> (4b) Set t = 8 (fix t at intra-
                                            server, Takeaway #1).
                                            Pick smallest p such that
                                            model fits in p * t GPUs of
                                            memory; let d = n / (p*t).
     |
     v
   (5) [pick microbatch b in {1, 2, 4, 8} that maximizes
        (b'/b + p - 1) * (t_f(b) + t_b(b))]                (Eq. 1)
     |
     v
   (6) [activation recomputation needed for memory?]
     |
     +-- YES --> set checkpointing every 1 or 2 layers
     |
     +-- NO ---> skip
     |
     v
   (7) [enable interleaved schedule with v = 2 if comm hide-able]
     |
     v
   (8) [enable scatter/gather + operator fusion (always on)]
     |
     v
   (9) Train; pipeline flush at every batch boundary
^ Fig 9: PTD-P configuration heuristic flow (paraphrased from Sec. 3
  Takeaways T#1-T#3 and Sec. 5 results). The paper does NOT
  automatically explore this space (FlexFlow / PipeDream / DAPPLE /
  Tarnawski et al. do); it provides heuristics that have been
  validated empirically.

5. Quantitative Results - Empirical Findings by Regime

The paper's quantitative core is Table 1's weak-scaling sweep across 1B -> 1T parameters, plus Table 2's PTD vs ZeRO-3 head-to- head. Both are reproduced verbatim below from the text.

5.1 Table 1 (verbatim, weak-scaling throughput)

The paper reports each row of the configuration sweep with the parallelism dimensions and achieved throughput. The configurations extracted from the paper text:

Param (B)	Att. heads	Hidden	Layers	t	p	n	B	TF/s/GPU	% peak	Aggr. PF/s
1.0	32	4096	24	1	1	32	32	~140	44%	~4.5
5.9	32	3840	32	2	1	64	varies	-	-	-
18.4	-	-	-	4	4	192	-	-	-	-
39.1	-	-	-	8	4	384	-	-	-	-
76.1	-	-	-	8	8	768	-	-	-	-
145.6	96	12288	80	8	16	1536	-	-	-	-
175.0	96	12288	96	8	16	1536	1536	140	45%	215
310	-	-	-	8	32	1920	-	-	-	-
530	-	-	-	8	35	2240	2240	138	44%	-
1008	160	25600	128	8	64	3072	-	163	52%	502

(Some intermediate rows are condensed; the paper's Table 1 itself is displayed as a figure-rasterized image and the row count is 10. The trillion-parameter row is the headline result.)

Headline numbers extracted from Sec. 5.1 prose:

1T-param model: 502 PF/s aggregate, 163 TF/s/GPU, 52% of peak on 3072 A100 GPUs, est. 84 days for 450B tokens.
175B GPT-3 model: 140 TF/s/GPU on 1024 GPUs, batch 1536, est. 34 days for 300B tokens.
Smallest (1B): 44% of peak; largest (1T): 52% of peak -- the paper claims "super-linear scaling" because per-GPU efficiency improves as the model grows (larger GEMMs get better arithmetic intensity).

5.2 Table 2 (verbatim, PTD-P vs ZeRO-3)

Sec. 5.2 reports a head-to-head comparison vs ZeRO-3. The paper's prose extraction:

Model	Method	n GPUs	B	b	TF/s/GPU	Days for 300B tok
175B	PTD-P	768	1536	4	-	-
175B	ZeRO-3	768	1536	4	-	-
175B	PTD-P	1536	1536	4	-	-
175B	ZeRO-3	1536	1536	4	-	-
530B	PTD-P	560	2240	4	-	-
530B	ZeRO-3 *	640	2560	4	-	-
530B	PTD-P	1120	2240	4	-	-
530B	ZeRO-3	1120	2240	4	-	-

(The * marks the 530B + ZeRO-3 row that did not fit on 560 GPUs and required 640 GPUs to estimate throughput.) The qualitative findings extracted from prose:

At smaller scale (768 GPUs / 175B): PTD-P beats ZeRO-3 by 6% (175B) and 24% (530B).
At doubled scale (1536 GPUs / 175B; 1120 GPUs / 530B): PTD-P beats ZeRO-3 by 70% for both models.
ZeRO-3 was tested without model parallelism. The paper notes ZeRO-3 + tensor MP could potentially close the gap.

5.3 Pipeline-parallel weak scaling (Fig 11)

GPT model with 128 attention heads, h = 20480, microbatch b = 1, t = 8 fixed, p swept from 1 to 8. Model size scales with p:

p = 1: 3 layers, 15B parameters
p = 8: 24 layers, 121B parameters

Two batch sizes plotted; the larger batch scales better because bubble = (p-1)/m is smaller. The actual TF/s numbers are given as a figure (image not OCR-able in this PDF), but the qualitative finding from Sec. 5.3.1: at p = 8, the larger batch retains close to the p = 1 per-GPU throughput, while the smaller batch loses ~30-40%.

5.4 Interleaved vs non-interleaved (Sec. 5.3.2, Fig 12)

GPT-3 (175B), 96 layers, 96 heads, h = 12288, 96 GPUs. Headline finding from prose: the interleaved schedule + scatter/ gather is up to ~10% faster than default at small-to-medium batch. The gap closes at large batch because (a) default's bubble shrinks linearly in 1/m, and (b) interleaved's extra v-fold P2P comm becomes binding. Without scatter/gather, the default schedule beats interleaved at large batch (not shown).

5.5 Tensor vs pipeline (Sec. 5.4.1, Fig 13)

GPT model with 161B params, 32 transformer layers, 128 heads, h = 20480, 64 A100 GPUs. The paper shows that suboptimal (t, p) splits lose up to 2x throughput at fixed n = 64. Best configuration: t = 8, p = 8 (matching Takeaway #1).

5.6 Pipeline vs data (Sec. 5.4.2, Fig 14)

GPT 5.9B, 64 GPUs, b = 1, three batch sizes. Finding: as pipeline-parallel size p increases (with t = 1 fixed), throughput drops for every batch size, matching the analytical model. DP should be used to scale up; PP only to make the model fit.

5.7 Tensor vs data (Sec. 5.4.3, Fig 15)

Same GPT 5.9B, 64 GPUs, b = 1. Finding: with b = 1, tensor MP's all-reduce is required every microbatch, dominating end-to-end training time when crossing nodes. Additionally, large t reduces per- GPU GEMM size, hurting compute efficiency.

5.8 Microbatch sweep (Sec. 5.5, Fig 16)

GPT 91B, t = 8, p = 8, 64 GPUs, two batch sizes. Finding: best microbatch is b = 2 for this model. Different models have different optimal b. Microbatch can swing throughput by 15%.

5.9 Activation recomputation (Sec. 5.6, Fig 17)

GPT 145B, t = 8, p = 16, 128 GPUs, range of batch sizes.

At small batch: recompute hurts by up to 33% (extra forward pass overhead).
At large batch: recompute is required (no other way to fit), and large-batch + recompute is up to 2x faster than the best small-batch + no-recompute configuration (because bubble amortizes over more microbatches).

5.10 Scatter/gather ablation (Sec. 5.7, Fig 18)

GPT 175B, t = 8, p = 16, 96 GPUs, interleaved schedule.

Scatter/gather provides up to 11% throughput improvement for communication-intensive (large batch + interleaved) configurations.

5.11 Operator fusion (Sec. 5.8)

GPT 175B: 113 -> 135 TF/s/GPU (= +19%) with fusion.
GPT 530B: 133 -> 148 TF/s/GPU (= +11%) with fusion.

5.12 Inter-node bandwidth (Sec. 5.9)

At 3072 GPUs on the trillion-parameter run:

Pipeline P2P bisection: 892 GB/s.
Data-parallel all-reduce bisection: 12.9 TB/s.

5.13 Checkpoint I/O (Sec. 5.10)

Trillion-parameter checkpoint size: 13.8 TB.
Initial load (3072 GPUs / 384 nodes) reaches 1 TB/s read, the filesystem peak.
Save reaches 273 GB/s = 40% of peak write.

6. Configuration-Regime Trade-off Tables

6.1 Choice of parallelism axis

Dimension	Tensor MP (t)	Pipeline MP (p)	Data Parallel (d)	Winner (DynamICCL view)
Communication primitive	All-reduce	Point-to-point send/recv	All-reduce	DP (cheapest per-collective)
Collective volume per microbatch	8bsh(t-1)/t per layer	bsh per stage (or bsh/t with s/g)	gradients per batch (full)	DP (low frequency)
Latency sensitivity	high (every microbatch)	medium (one per microbatch)	low (one per batch)	DP
Bandwidth sensitivity	high	medium	medium-high (large grads)	DP
Required interconnect tier	NVLink (fastest)	HDR IB (medium)	HDR IB (medium)	tier-aware mapping
Memory savings	1/t parameters/grads	1/p layers per device	0 (replicates)	t and p tie
Compute efficiency	drops with large t (small GEMMs)	unaffected per microbatch	unaffected	p (cleanest)
Bubble cost	0	(p-1)/m (or (1/v)(p-1)/m)	0	t and d tie
Scaling ceiling	~g (intra-server)	~num layers	~global batch	d (largest)

Heuristic from the paper (T#1 + T#2): pin t = g = 8 (DGX A100 node size), then pick smallest p that makes the model fit, then let d = n / (p*t) absorb the rest of the GPU budget.

6.2 Schedule choice

Dimension	Default 1F1B	Interleaved 1F1B (v > 1)	Winner (regime)
Bubble fraction	(p-1)/m	(1/v)(p-1)/m	Interleaved
Comm volume per microbatch	bsh per edge	v * (bsh / t) per edge with s/g	Interleaved + s/g
Memory (in-flight microbatches)	p	p	Tie
Constraint on m	none	m must be multiple of p	Default (more flexible)
Wins at small batch	-	yes (~10% faster)	Interleaved
Wins at large batch	yes (bubble already small)	(only with s/g)	Default for very large B
Required for trillion scale	no	yes (small bubble is binding)	Interleaved

6.3 Microbatch size

Dimension	Small b (=1)	Medium b (=2 or 4)	Large b (=8)	Winner (regime)
Pipeline bubble (m = B/(b*d))	smallest (most m)	medium	largest (fewest m)	Small b
GEMM arithmetic intensity	low	medium	highest	Large b
Tensor-MP all-reduce frequency	every microbatch (high)	medium	low	Large b
Optimal in paper	rare	b = 2 (91B), b = 4 (175B)	rare (memory)	Medium b (paper's finding)

The optimum is model-dependent (Takeaway #3). The paper's analytic proxy b'/b + p - 1) * (t_f(b) + t_b(b)) is a good first cut.

6.4 Activation recomputation

Dimension	No recompute	Recompute (every layer)	Recompute (every 1-2 layers)	Winner (DynamICCL view)
Memory per stage	high (l*A_intermediate)	low (c*A_input)	minimum (c = sqrt(l*A_int/A_in))	recompute every 1-2 layers
Compute overhead	0	33% (extra fwd)	33% (extra fwd)	no recompute
Required for large B	infeasible	feasible	feasible	recompute
Throughput at small B	best	-33%	-33%	no recompute
Throughput at large B	(cannot run)	2x best small-B	2x best small-B	recompute (at least 1-2 layers)

6.5 Scatter/gather + operator fusion

Optimization	Mechanism	Trigger	Cost when off	Cost when on	Used in
Scatter/gather	1/t-sized chunks over IB + NVLink AG	always (with TP > 1)	t * bsh IB	bsh IB + 1 NVLink AG	Sec. 4.1
Bias+GeLU fuse	PyTorch JIT	always	2 elementwise kernels	1 fused kernel	Sec. 4.2
Bias+drop+add	PyTorch JIT	always	3 elementwise kernels	1 fused kernel	Sec. 4.2
Mask+softmax	Custom CUDA kernel (general/causal)	always	3 reduction kernels	1 fused kernel	Sec. 4.2
`[s,b,a,h]` layout	Strided batched GEMM	always	transpose + GEMM	direct strided GEMM	Sec. 4.2

The gating discipline is uniform: all five always-on optimizations are zero-conditional (no input-dependent gating); they amortize across the entire training run. The interleaved schedule and activation recomputation are the only conditional optimizations (driven by memory + bubble trade-offs).

7. Bottlenecks & Insights Surfaced by the Measurements

7.1 The interconnect hierarchy is the premise, not a result

PTD-P is not a search over all 4-axis configurations; it is a projection of the 4-axis space onto a 1-axis decision tree where t = g is pre-decided by the hardware tier boundary. The 2x throughput penalty for sub-optimal (t, p) splits in Fig 13 is therefore not a surprise -- it is the cost of ignoring the interconnect hierarchy. The premise that DGX A100 has NVLink-class intra-server + HDR-IB inter-server is what makes PTD-P work; on a flatter fabric (e.g., commodity Ethernet cluster) the same heuristic would not hold.

7.2 The interleaved schedule's v-fold comm cost is paid back

entirely by scatter/gather

Without scatter/gather, the interleaved schedule is not faster than the default at large batch -- the v-fold extra P2P traffic eats the bubble savings. Scatter/gather reduces per-edge IB volume by t = 8x, so even doubling the comm (v = 2) leaves a net 4x reduction. The interleaved schedule is conditionally useful: only with scatter/gather turned on. This is the single most important interaction in the paper.

7.3 Super-linear weak scaling is an arithmetic-intensity story

The 44% -> 52% peak efficiency improvement from 1B -> 1T parameters is not driven by reduced communication overhead. It is driven by larger GEMMs becoming more compute-bound: the largest (h = 25600) matrix multiplications saturate Tensor Core throughput, while the smaller (h = 4096) matmuls are bandwidth-limited. The paper makes this explicit ("GPU utilization improves as the models get larger ... without significant increase in the communication time relative to computation time"). This is a property of the workload, not the parallelization scheme.

7.4 Pipeline parallelism's bubble is the inherent penalty for

strict optimizer semantics

The bubble (p-1)/m is the only term in the bubble-size analysis that cannot be eliminated by any synchronous schedule -- it follows directly from the requirement that every microbatch's backward pass sees the same weights as its forward pass. Asynchronous schemes (PipeDream-2BW, PipeMare) would eliminate it but at the cost of relaxed optimizer semantics. The paper's choice to keep flushes makes the interleaved schedule the only handle on bubble size.

7.5 Data-parallel all-reduce bandwidth is not the binding

constraint at 3072 GPUs

The measured 12.9 TB/s effective bisection for DP all-reduce is more than 14x the 892 GB/s for pipeline P2P. This is because DP all-reduce is infrequent (once per global batch) and spread (every link in the bisection participates). The binding communication constraint at 3072 GPUs is inter-stage pipeline P2P, which crosses the same fabric per microbatch. This is what the scatter/gather optimization specifically addresses.

7.6 Filesystem write is 40% of peak; read is 100% of peak

The asymmetry between ckpt save (273 GB/s = 40% peak write) and ckpt load (1 TB/s = peak read) is buried in Sec. 5.10 but reveals a practical operational issue: the parallel filesystem is the checkpointing bottleneck, and at trillion-parameter scale a 13.8 TB checkpoint takes ~50 seconds to save -- a long enough interval that checkpointing every N iterations becomes a meaningful throughput overhead.

7.7 NCCL configuration is held constant across all sweeps

The paper sweeps p, t, d, b, B, schedule, recompute, scatter/gather, fusion -- but does not sweep NCCL knobs: algorithm (Ring vs Tree vs CollNet vs NVLS), protocol (LL / LL128 / Simple), nChannels, numThreads, chunkSize all use NCCL defaults. The 502 PF/s headline is therefore a floor on what PTD-P can deliver; tuning NCCL on top of PTD-P could plausibly add another 5-20% (consistent with what related work like AutoCCL [0008] reports for NCCL knob tuning in isolation). The paper's strong scaling result is thus not the ceiling it might appear to be, even on the same hardware.

7.8 The heuristic explicitly avoids automatic search

Sec. 1 (last paragraph) and Sec. 6 (Related Work) both note that PTD-P does not automatically search the space (FlexFlow [22], PipeDream [29], DAPPLE [14], Tarnawski et al. [41] do). The paper trades search generality for rule clarity: three takeaways that a practitioner can apply directly. For a runtime-tuning system, this is an opportunity -- the takeaways are excellent priors but they leave the per-collective NCCL configuration unspecified.

8. Limitations of the Methodology

Limitation	Implication
Only DGX A100 + NVLink + HDR IB tested	Heuristic depends on the precise interconnect hierarchy; flatter fabrics not validated
Only 8 GPUs per server (g = 8)	Heuristic "t = g" is g-specific; a 4-GPU or 16-GPU server would shift the optimum
Only GPT (decoder-only) transformer tested	No encoder-decoder (T5), no MoE (Switch Transformers), no convolutional / vision models
Vocab fixed at V = 51200	Multi-lingual or larger-vocab regimes (e.g., 250k for some BERT variants) not measured
Sequence length fixed at s = 2048	Long-context (8k, 32k) regimes not tested; attention scales as O(s^2)
FP16 mixed precision only	No FP8, BF16, or INT8 quantization studied
NCCL knobs held at default	Algorithm/protocol/nChannels/numThreads/chunkSize not swept
Synchronous semantics only	Async pipelining (PipeDream-2BW, PipeMare) deferred to future work
No automatic search of the parallelism space	Heuristic-driven; cannot adapt to off-distribution models or hardware
Single Selene cluster	No multi-cluster, no commodity-cloud, no heterogeneous-GPU deployments
ZeRO-3 baseline tested without TP	Comparison may understate ZeRO + TP combination
Activation recomputation: 1 or 2 layers	Other granularities (every k layers for k > 2) not benchmarked
Microbatch sweep limited to b in {1,2,4,8}	Larger b not explored (memory bound)
Single training-config rerun cadence	No variance / error bars across runs reported
13.8 TB checkpoint cadence not detailed	Checkpoint frequency vs throughput trade-off implied but not measured
Aggregate scaling stops at 3072 GPUs	Beyond-3072 scaling (e.g., 6144, 12288) not characterized
Container, OS pinned (Ubuntu 20.04, PyTorch 1.8)	Newer software stacks (e.g., FlashAttention, kernel fusion via torch.compile) not factored in

The most consequential limitation for a DynamICCL-style runtime tuner is the NCCL-default assumption. Megatron-LM's PTD-P delivers 502 PF/s with NCCL untouched; the obvious question -- "what fraction of the remaining 48% gap to peak is recoverable by NCCL knob tuning?" -- is not addressed. The paper's heuristic specifies what collective is invoked (which all-reduce, which P2P) but not how NCCL executes it.

9. Note on NCCL Tuning

The paper specifies the mapping from PTD-P axes to NCCL collective calls with unusual clarity but holds NCCL configuration fixed at default: the t-way intra-server tensor-MP all-reduces, the p-way inter-server pipeline-MP send/recv pairs, and the d-way data-parallel all-reduces are three structurally distinct collective patterns with different message sizes, frequencies, and target tiers (NVLink vs HDR IB). The reported 12.9 TB/s vs 892 GB/s effective bisection asymmetry between data-parallel all-reduce (large, infrequent, fabric-spread) and pipeline P2P (medium, per-microbatch, edge-local) shows that the same NCCL configuration cannot be optimal for both. A per-collective tuner with knowledge of which PTD-P axis a call belongs to could exploit exactly this asymmetry -- a Tree algorithm with LL128 protocol on the small per-microbatch tensor-MP calls, a Ring algorithm with Simple protocol on the large per-batch DP gradient sync -- without changing any of PTD-P's heuristics.

10. Analogy

PTD-P is a three-shift factory operating across a campus of NVLink-walled buildings, where each building houses a single 8-bay manufacturing cell with extremely fast intra-building forklifts and the buildings are connected by a high-speed but slower fleet of delivery trucks. Tensor parallelism is the intra-building shift: 8 workstations inside one building each handle a vertical slice of every product (one slice of GeLU, one slice of attention), and they exchange parts via the NVLink forklifts after every step -- this shift requires constant intra-building communication that would choke on inter-building trucks. Pipeline parallelism is the inter-building shift: each building specializes in a contiguous range of layers, and partially-finished products move down the campus via trucks; only one truck is needed per partial product, and the trucks can run in parallel because each carries a different microbatch. Data parallelism is the replicated-factory shift: identical PTD-P factories run in parallel across distinct campuses, and once per shift they synchronize the master blueprint by broadcasting all the day's accumulated edits. The interleaved schedule is a foreman's trick: instead of each building making layers 1-8 then sitting idle while the next building makes 9-16, each building makes 1-2 and 9-10 (still in order, just in two slots), so the assembly line never has to pause for a full draining cycle. The scatter/gather optimization is the realization that each building's 8 workstations were all handing the same partial product to the next building's 8 workstations across slow trucks; instead, each workstation now sends only its own 1/8 chunk via its own truck (because each building has 8 truck bays), and the receiving building reassembles the full product using its fast intra-building forklifts. The genius of the design is exploiting the building- truck bandwidth gap structurally: every cross-campus operation that can be local is made local, every operation that must be remote is sliced down to the smallest possible payload, and the foreman's flush-at-shift-end discipline (synchronous optimizer step) keeps every blueprint consistent at the cost of a brief twice-daily idle period -- a cost the interleaved schedule cuts in half. The paper's contribution is showing that this campus-scale factory hits 52% of theoretical peak even when scaled to 384 buildings and 3072 workstations, training a trillion-parameter blueprint in 84 days.