Detailed Summary: Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Citation: Biyao Zhang, Mingkai Zheng, Debargha Ganguly, Xuecen Zhang, Vikash Singh, Vipin Chaudhary, Zhao Zhang. Case Western Reserve University / Rutgers University. arXiv:2509.22832v1, September 26, 2025.

PDF: 0013_GPU_Perf_modeling_LLM.pdf


Abstract

The paper proposes a hybrid bottom-up framework for predicting end-to-end training time of LLMs distributed across HPC GPU clusters. It decomposes LLMs into fundamental computational operators (GEMMs, normalization, attention, communication collectives), trains lightweight hardware-aware regression models (RandomForest / XGBoost) per operator type using small empirical samples, and integrates predictions into an analytical pipeline timeline. The framework achieves 4.98% average prediction error on Perlmutter (A100-SXM4, 128 GPUs) and 9.38% on Vista (GH200, 128 GPUs) for models up to 20B parameters. It runs entirely on CPUs, eliminating the need for on-cluster profiling during prediction.


1. Motivation

LLMs require 10x-100x more compute per model generation. Training GPT-4 or LLaMA-3 requires on the order of 10^25 FLOPs spread across tens of thousands of GPUs for months. System operators must make high-stakes resource allocation decisions before training begins. An accurate, cheap performance predictor would allow exploration of parallelism strategies (DP/MP/PP) and hardware configurations without committing to costly trial runs.

Existing approaches have gaps:

The three fundamental challenges identified are: (C1) heterogeneous GPU architectures with nonlinear, hardware-specific performance (A100 vs. GH200), (C2) diverse computational patterns within transformers (compute-bound GEMM vs. memory-bound LayerNorm vs. attention), and (C3) complex interactions between parallelism strategies and communication overhead on multi-tier interconnects.


2. Background

2.1 GPU Architecture Considerations

Modern NVIDIA GPUs contain Tensor Cores within Streaming Multiprocessors (SMs) plus hierarchical memory (L1 per-SM, L2 shared, HBM global). The A100 supports TF32/BF16; the H100 adds FP8 and doubles CUDA core density per sub-partition. Library-level auto-tuning in cuBLAS and cuDNN selects different kernels depending on matrix dimensions, creating step-like, non-monotonic performance curves that purely analytical models cannot capture.

2.2 Transformer Computational Patterns

Each transformer block contains: (a) Multi-Head Attention (MHA) with diverse GEMM shapes determined by batch size, sequence length, hidden dimension, and head count; (b) Feed-Forward / MLP layers that expand hidden dimension by 4x; (c) Normalization layers (LayerNorm, RMSNorm) that are memory-bandwidth limited rather than compute-bound. These heterogeneous bottlenecks make a single unified performance model insufficient.

2.3 Distributed Training Strategies

Three primary parallelism dimensions are considered:

These three can be combined as "3D parallelism." The paper uses GPT-NeoX with DeepSpeed + Megatron-LM as the implementation framework.


3. Methodology

3.1 Performance Data Collection

Operators are profiled in isolation via PyTorch profiler (1 µs recording resolution, CUDA event timing). Profiling protocol: 10-iteration warmup with a saturating input, followed by 10 measurement iterations. Final measurement is the mean of the sorted median 5 samples to remove outliers.

Communication operators are benchmarked across layouts (varying node count, GPU count per node) to capture topology effects. Sampling spans for key operators:

Parameter Range
Model parallelism degree 1 to 16 (powers of 2)
Batch size 4 to 8
Attention heads 16 to 80 (step 8)
Sequence length 1024 to 5120 (step 512)
Hidden dimension 2048 to 8129 (step 512)
MP AllReduce count 2.09e7 to 1.34e8 elements, 2 to 8 processes
DP AllReduce / AllGather count 1.34e8 to 1.20e9 elements, 2 to 8 processes

3.2 Workload Representation

Each operator is characterized by a feature vector capturing model dimensions and distributed training parameters. Key features include:

Full feature-to-operator mapping (Table I in paper):

Operator Feature Vector
Embedding [bl, v/|mp|, d]
LayerNorm / RMSNorm [b, l, d]
Linear1 (QKV projection) [bl, d, 3d/|mp|]
Flash Attention [b, l, h/|mp|, d/h]
MP AllReduce [bld, |nodes|, |GPUs/nodes|]
DP AllReduce / AllGather [|entries|, |nodes|, |GPUs/nodes|]
PP P2P [bld/|mp|, |nodes|, |GPUs/nodes|]

3.3 Per-Operator Regressors

RandomForest and XGBoost are selected for their ability to capture non-linear, piecewise performance behaviors (hardware thresholds from auto-tuning, memory hierarchy transitions, NCCL algorithm switching). 80/20 train/validation split; regressor type and hyperparameters selected by minimizing validation error. The final regressor is retrained on the full dataset.

Tree-based models outperform neural networks on this tabular, relatively small dataset because of: (a) interpretability (useful for debugging), (b) robustness to outliers from system jitter, and (c) fast inference (critical for CPU-only deployment).

3.4 End-to-End Timeline Modeling (1F1B Pipeline)

Pipeline parallelism introduces the 1F1B (One-Forward-One-Backward) schedule where each stage executes backward propagation as early as possible. The total runtime formula is:

Runtime = (#MicroBatches - 1 + #PipelineStages) × (MaxFwd + MaxBwd)
          + FirstStageGradientSynchronization
          + MaxUpdate

Where:

Pipeline partitioning formulas:

First stage encoders  = floor((#encoders + 5) / #PP_stages) - 2
Middle stage encoders = floor((#encoders + 5) / #PP_stages)
Last stage encoders   = floor((#encoders + 5) / #PP_stages) - 3

Vocabulary size alignment (for memory access efficiency):

divisibility_factor = 128 × num_MP_partitions
vocab_size = floor(original_vocab_size / divisibility_factor) × divisibility_factor

4. Evaluation Setup

4.1 Target Models

Config GPT-20B LLaMA-13B Llemma-7B
Hidden dim 6144 5120 4096
Sequence length 2048 2048 4096
Attention heads 64 40 32
Encoder layers 44 40 32
Fused Softmax Yes Yes No
Flash Attention No No Yes
Norm type Basic LayerNorm RMSNorm RMSNorm
Micro-batch size 4 4 4
Iters/Update 16 16 8

4.2 HPC Clusters

Feature Perlmutter TACC Vista
CPU AMD EPYC 7763 (64 cores) NVIDIA Grace (72 cores)
GPU A100-SXM4 (40 GB HBM2) GH200 (96 GB HBM3)
GPUs/node 4 1
Intra-node NVLink 3.0 (600 GB/s) NVLink-C2C (900 GB/s)
Inter-node Slingshot-10 (4×50 Gb/s) NDR InfiniBand (400 Gb/s)
Scale Up to 32 nodes (128 A100) Up to 128 nodes (128 GH200)

The key architectural difference is that Perlmutter has 4 GPUs per node (enabling intra-node pre-reduction for MP_AllReduce), while Vista has 1 GPU per node (all collectives traverse the inter-node InfiniBand fabric, increasing jitter and prediction difficulty).


5. Key Results with Numbers

5.1 Performance Stability

Perlmutter shows high stability: all configurations exhibit <1% variation between minimum and average training times.

Vista shows significantly higher variability (5.21%–108.3% increase from minimum to average), primarily from GPT-20B(4-8-4) at 70.38% and GPT-20B(8-4-4) at 108.3%. Cause: single-GPU-per-node design forces MP_AllReduce over inter-node InfiniBand, which is susceptible to congestion and scheduling jitter.

5.2 Component-Level Prediction Errors (Table IX)

The framework uses the minimum batch time as the prediction target (to avoid modeling transient congestion effects).

Component Perlmutter range Vista range
Encoder Fwd -13.60% to -1.84% -14.37% to -2.24%
Encoder Bwd -11.93% to -1.55% -13.33% to -8.05%
Stage Fwd Max -13.41% to 0.53% -11.99% to 0.32%
Stage Bwd Max -12.49% to -1.66% -13.73% to -8.33%
DP AllReduce -31.21% to 5.06% -49.69% to -6.36%
DP AllGather -1.89% to 37.86% -52.47% to 18.68%
PP P2P -21.98% to 3.84% -36.26% to -1.65%
MP AllReduce -4.85% to 10.67% -7.66% to 2.33%

Key observations:

5.3 End-to-End Prediction Errors

Perlmutter:

Vista:

Perlmutter's 4-8-4 configuration achieves the lowest error (3.94%), confirming balanced model parallelism is easiest to predict. Data-parallel-heavy configurations (4-4-8) and deep pipeline configurations (8-4-4) show larger errors due to more complex communication patterns. Vista shows a consistent underestimation trend attributed to unpredictable network jitter on the all-InfiniBand fabric.

5.4 Runtime Breakdown (Figure 3)

Compute dominates: encoder forward + backward passes and pipeline stage execution account for 70–95% of total runtime. Communication operations collectively account for <5–10%, except for MP_AllReduce-heavy configurations (8-way model parallelism on Perlmutter, or all Vista configs where MP_AllReduce is invoked once/twice per encoder pass over InfiniBand).


6.1 Analytical Communication Models

LogP (Culler 1993), BSP (Valiant 1990), and LogGOPS (Hoefler 2010) parametrize communication with latency, bandwidth, and synchronization costs. Effective for simple topologies but require calibrating 4–7+ parameters and struggle with vendor-specific library behavior. MPI communication optimization work (Thakur et al. 2005) follows this tradition.

6.2 GPU Kernel Analytical Models

IPP model (Hong and Kim 2009, 2010): 8.9% error for single GPU workloads. Boyer et al. (2013): extended with data transfer modeling (8% error). DeLTA (Lym et al. 2019): 7.9% error modeling arithmetic intensity and memory traffic, but requires low-level profiling unavailable without hardware. All of these are single-GPU and do not model distributed communication.

6.3 DNN-Specific Learned Models

Habitat (Yu et al. 2021): runtime-based GPU performance predictor via small-scale execution. DayDream (Zhu et al. 2020): dependency-graph modeling for training time estimation. Both require on-hardware sampling. DTS (Esposito 2022): simulation-based time estimation with substantial profiling cost.

6.4 Distributed LLM Performance Modeling

Kundu et al. (2024) (IISWC): analytical method for distributed LLM training with <10% error across 8–3072 GPUs for 22B–1008B parameter models. Most closely related to this work but their implementation is not open source, preventing direct comparison.


7. Limitations


8. Section-by-Section Paragraph Summaries

Section I — Introduction

LLMs now require 10^25 FLOPs for training, making pre-training performance prediction critical for HPC resource allocation. Existing methods are either too expensive (learned models with costly sampling) or too approximate (analytical models missing hardware specifics). The paper proposes hierarchical operator decomposition + targeted lightweight sampling as a middle path. Three contributions: operator-level decomposition, hardware-aware regression per operator type, end-to-end pipeline timeline integrating both.

Section II-A — Foundations: GPU Architecture

Modern GPUs have SM-level compute with hierarchical memory. A100/H100 differ in Tensor Core generation, FP8 support, and cache sizes. cuBLAS/cuDNN auto-tune kernel selection to input shape, creating discontinuous performance curves. Mixed precision (FP16, BF16, FP8) affects both compute and memory bandwidth.

Section II-A — Foundations: Transformer Patterns

MHA creates variable GEMM shapes depending on batch/sequence/head parameters. MLP layers are compute-bound. LayerNorm/RMSNorm are memory-bandwidth bound. Flash Attention fuses multiple steps to reduce memory overhead. These heterogeneous bottlenecks make a single performance model insufficient — motivating per-operator regressors.

Section II-A — Foundations: Distributed Training Strategies

DP, MP (tensor), and PP are described. DP requires AllReduce at gradient synchronization; tensor MP requires intra-layer AllReduce; PP requires P2P between stages and introduces pipeline bubbles. All three interact in 3D parallelism. Communication efficiency depends on multi-tier interconnects (NVLink intra-node, InfiniBand/Ethernet inter-node).

Section II-B — Problem Statement

Three challenges formalized. Challenge 1: heterogeneous hardware with nonlinear scaling. Challenge 2: diverse transformer operator patterns. Challenge 3: interplay of parallelism strategies and stochastic communication overhead. Standard analytical scaling laws are insufficient.

Section III-A — Data Collection Methodology

Micro-benchmark protocol: isolation profiling with PyTorch profiler, 10-iteration warmup, sorted median of 5 samples to reduce outlier impact. Communication operators benchmarked across GPU/node count ranges. Sampling strategy designed to cover high-impact configurations without exhaustive enumeration.

Section III-B — Regressors

RandomForest and XGBoost selected for handling tabular data, piecewise nonlinearities, and robustness to outliers. 80/20 train/validation split with hyperparameter selection by validation error minimization. Regressors capture auto-tuning discontinuities, memory hierarchy transitions, and NCCL algorithm switching thresholds.

Section III-C — Workload Representation

Feature vectors encode all relevant model and system parameters per operator type (Table I). Communication operators use [entries, nodes, GPUs/node] to capture topology effects. Pipeline stage roles differentiate first/middle/last stages for accurate parameter count and synchronization modeling.

Section III-D — Timeline Modeling

The 1F1B pipeline timeline formula accounts for micro-batch count, pipeline stages, maximum forward/backward time, DP AllReduce cost for first stage, and MaxUpdate (Optimizer + DP AllGather). Gradient synchronization for all but the first stage is overlapped with pipeline backward propagation. Pipeline stage encoder allocation formulas balance workload across stages by offsetting the 5 non-encoder blocks.

Section IV-A — Experimental Setup

Two HPC clusters evaluated: Perlmutter (4 A100s/node, NVLink3, Slingshot-10) and Vista (1 GH200/node, NVLink-C2C, NDR InfiniBand). Three LLMs tested: GPT-20B, LLaMA-13B, Llemma-7B. Multiple 3D parallelism configurations per model.

Section IV-B — Performance Stability

Perlmutter is highly stable (<1% min-to-average variation). Vista shows 5-108% variation, worst for GPT-20B(8-4-4) due to MP_AllReduce traversing InfiniBand exclusively. The framework targets minimum batch time to decouple prediction accuracy from network variability.

Section IV-C — Prediction Errors

Compute-dominated components (70-95% of runtime) are predicted with 1-15% error. Communication components have high individual errors (up to 52%) but low weight in total runtime. MP_AllReduce is accurate (<5% in most cases) due to high invocation frequency. Overall: 4.98% on Perlmutter, 9.38% on Vista. Balanced MP configurations (4-8-4) are easiest to predict; data-parallel-heavy configurations (4-4-8) are hardest.

Surveys LogP/BSP (foundational communication models), bottom-up GPU models (IPP, DeLTA), top-down approaches (GCoM), DNN-specific models (DayDream, Habitat), and the closest work (Kundu et al. 2024 IISWC analytical LLM training model). Positions this paper as combining targeted empirical sampling with hierarchical analytical composition.

Section VI — Conclusion

4.98%/9.38% average prediction errors across two clusters. Communication operations (especially PP P2P on unified memory architectures) remain the hardest to predict and represent future work. Integration with job scheduling systems and emerging hardware architectures identified as next steps.


9. Relevance to DynamICCL

Moderate direct relevance — confirms the core DynamICCL problem.

1. NCCL communication is the unpredictable component. The paper reports up to 52% component-level prediction error for DP_AllReduce and PP_P2P, explicitly because NCCL's runtime algorithm selection (ring vs. tree vs. CollNet), protocol choices (LL/LL128/Simple), and sensitivity to network congestion introduce discontinuities that regression cannot capture. This is precisely the parameter space DynamICCL optimizes over. The paper's failure mode (predicting NCCL behavior) is DynamICCL's value proposition (controlling NCCL behavior).

2. Feature space for DynamICCL state design. The communication operator feature vectors in Table I — specifically [|entries|, |nodes|, |GPUs/nodes|] for DP_AllReduce/AllGather — provide a principled starting point for Agent-2's state vector. Message size, topology (node count, GPUs/node), and collective type are natural features.

3. Parallelism-induced communication diversity. The paper demonstrates that MP degree significantly affects AllReduce communication volume and frequency: MP_AllReduce is invoked 1-2 times per encoder forward pass and 2 times per backward pass. DynamICCL's NCCL parameter selection must adapt to these different workload regimes, reinforcing the need for collective-type-aware and message-size-aware RL policy.

4. Vista GH200 + InfiniBand architecture parallel. GH200's single-GPU-per-node design forces all collectives onto inter-node InfiniBand, producing high variability (up to 108.3% min-to-average variation). DynamICCL operating on similar infrastructure faces the same jitter challenge, directly motivating Agent-1's (LSTM+CUSUM) congestion detection role: the agent needs to distinguish normal InfiniBand jitter from genuine sustained congestion.

5. CPU-only prediction framework. The framework runs entirely on CPUs, which is also a design goal for DynamICCL's tuner plugin (making parameter selection decisions at collective call time without GPU overhead). The framework's fast inference speed (tree-based regressors) aligns with the latency requirements of a real-time NCCL tuner.

One important gap: This paper assumes NCCL default configuration and models its output as a black box. DynamICCL explicitly replaces the black box with a controllable agent. The two works are therefore complementary: this paper shows why NCCL defaults are hard to predict; DynamICCL shows how to actively improve on them.