Detailed Summary: Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Citation: Biyao Zhang, Mingkai Zheng, Debargha Ganguly, Xuecen Zhang, Vikash Singh, Vipin Chaudhary, Zhao Zhang. Case Western Reserve University / Rutgers University. arXiv:2509.22832v1, September 26, 2025.

PDF: 0013_GPU_Perf_modeling_LLM.pdf

Abstract

The paper proposes a hybrid bottom-up framework for predicting end-to-end training time of LLMs distributed across HPC GPU clusters. It decomposes LLMs into fundamental computational operators (GEMMs, normalization, attention, communication collectives), trains lightweight hardware-aware regression models (RandomForest / XGBoost) per operator type using small empirical samples, and integrates predictions into an analytical pipeline timeline. The framework achieves 4.98% average prediction error on Perlmutter (A100-SXM4, 128 GPUs) and 9.38% on Vista (GH200, 128 GPUs) for models up to 20B parameters. It runs entirely on CPUs, eliminating the need for on-cluster profiling during prediction.

1. Motivation

LLMs require 10x-100x more compute per model generation. Training GPT-4 or LLaMA-3 requires on the order of 10^25 FLOPs spread across tens of thousands of GPUs for months. System operators must make high-stakes resource allocation decisions before training begins. An accurate, cheap performance predictor would allow exploration of parallelism strategies (DP/MP/PP) and hardware configurations without committing to costly trial runs.

Existing approaches have gaps:

Purely analytical models (LogP, BSP, LogGOPS) require calibrating many parameters and fail to capture vendor-specific auto-tuning in cuBLAS / NCCL, which creates discontinuous performance curves.
Learned black-box models (DayDream, Habitat) require sampling actual GPU execution. Sampling 60 seconds of a 20B-parameter model on 128 A100s costs 2 node-hours; exploring 5-10 configurations consumes 5-10% of the total training budget.
Memory-focused models (ZeRO, ZeRO-Infinity) address memory footprint but not runtime prediction.

The three fundamental challenges identified are: (C1) heterogeneous GPU architectures with nonlinear, hardware-specific performance (A100 vs. GH200), (C2) diverse computational patterns within transformers (compute-bound GEMM vs. memory-bound LayerNorm vs. attention), and (C3) complex interactions between parallelism strategies and communication overhead on multi-tier interconnects.

2. Background

2.1 GPU Architecture Considerations

Modern NVIDIA GPUs contain Tensor Cores within Streaming Multiprocessors (SMs) plus hierarchical memory (L1 per-SM, L2 shared, HBM global). The A100 supports TF32/BF16; the H100 adds FP8 and doubles CUDA core density per sub-partition. Library-level auto-tuning in cuBLAS and cuDNN selects different kernels depending on matrix dimensions, creating step-like, non-monotonic performance curves that purely analytical models cannot capture.

2.2 Transformer Computational Patterns

Each transformer block contains: (a) Multi-Head Attention (MHA) with diverse GEMM shapes determined by batch size, sequence length, hidden dimension, and head count; (b) Feed-Forward / MLP layers that expand hidden dimension by 4x; (c) Normalization layers (LayerNorm, RMSNorm) that are memory-bandwidth limited rather than compute-bound. These heterogeneous bottlenecks make a single unified performance model insufficient.

2.3 Distributed Training Strategies

Three primary parallelism dimensions are considered:

Data Parallelism (DP): Replicated model, different data per replica. Gradients synchronized via AllReduce at the end of each backward pass. Communication cost scales linearly with model size and GPU count.
Model Parallelism (MP) / Tensor Parallelism: Individual layers partitioned by splitting matrix dimensions (Megatron-LM style). Requires AllReduce to synchronize partial results within each layer forward/backward pass.
Pipeline Parallelism (PP): Model layers segmented across devices; different micro-batches processed concurrently. Introduces pipeline bubbles and P2P communication between adjacent stages.

These three can be combined as "3D parallelism." The paper uses GPT-NeoX with DeepSpeed + Megatron-LM as the implementation framework.

3. Methodology

3.1 Performance Data Collection

Operators are profiled in isolation via PyTorch profiler (1 µs recording resolution, CUDA event timing). Profiling protocol: 10-iteration warmup with a saturating input, followed by 10 measurement iterations. Final measurement is the mean of the sorted median 5 samples to remove outliers.

Communication operators are benchmarked across layouts (varying node count, GPU count per node) to capture topology effects. Sampling spans for key operators:

Parameter	Range
Model parallelism degree	1 to 16 (powers of 2)
Batch size	4 to 8
Attention heads	16 to 80 (step 8)
Sequence length	1024 to 5120 (step 512)
Hidden dimension	2048 to 8129 (step 512)
MP AllReduce count	2.09e7 to 1.34e8 elements, 2 to 8 processes
DP AllReduce / AllGather count	1.34e8 to 1.20e9 elements, 2 to 8 processes

3.2 Workload Representation

Each operator is characterized by a feature vector capturing model dimensions and distributed training parameters. Key features include:

b (batch size), l (sequence length), d (hidden dimension), h (attention heads), |mp| (model parallel degree)
|entries|, |nodes|, |GPUs/nodes| for communication operators
Pipeline stage role (first, middle, last) affecting activation propagation and synchronization

Full feature-to-operator mapping (Table I in paper):

Operator	Feature Vector
Embedding	[bl, v/\|mp\|, d]
LayerNorm / RMSNorm	[b, l, d]
Linear1 (QKV projection)	[bl, d, 3d/\|mp\|]
Flash Attention	[b, l, h/\|mp\|, d/h]
MP AllReduce	[bld, \|nodes\|, \|GPUs/nodes\|]
DP AllReduce / AllGather	[\|entries\|, \|nodes\|, \|GPUs/nodes\|]
PP P2P	[bld/\|mp\|, \|nodes\|, \|GPUs/nodes\|]

3.3 Per-Operator Regressors

RandomForest and XGBoost are selected for their ability to capture non-linear, piecewise performance behaviors (hardware thresholds from auto-tuning, memory hierarchy transitions, NCCL algorithm switching). 80/20 train/validation split; regressor type and hyperparameters selected by minimizing validation error. The final regressor is retrained on the full dataset.

Tree-based models outperform neural networks on this tabular, relatively small dataset because of: (a) interpretability (useful for debugging), (b) robustness to outliers from system jitter, and (c) fast inference (critical for CPU-only deployment).

3.4 End-to-End Timeline Modeling (1F1B Pipeline)

Pipeline parallelism introduces the 1F1B (One-Forward-One-Backward) schedule where each stage executes backward propagation as early as possible. The total runtime formula is:

Runtime = (#MicroBatches - 1 + #PipelineStages) × (MaxFwd + MaxBwd)
          + FirstStageGradientSynchronization
          + MaxUpdate

Where:

MaxFwd, MaxBwd = maximum forward/backward pass time across all pipeline stages
FirstStageGradientSynchronization = DP_AllReduce cost for the first stage (not hidden by pipeline)
MaxUpdate = max over stages of (Optimizer + DP_AllGather(#StageParams / |dp|))
MP_AllReduce in cross-entropy and optimizer steps are ignored (negligible communication volume)
Gradient synchronization for second-to-last stages is overlapped by backward propagation of earlier stages

Pipeline partitioning formulas:

First stage encoders  = floor((#encoders + 5) / #PP_stages) - 2
Middle stage encoders = floor((#encoders + 5) / #PP_stages)
Last stage encoders   = floor((#encoders + 5) / #PP_stages) - 3

Vocabulary size alignment (for memory access efficiency):

divisibility_factor = 128 × num_MP_partitions
vocab_size = floor(original_vocab_size / divisibility_factor) × divisibility_factor

4. Evaluation Setup

4.1 Target Models

Config	GPT-20B	LLaMA-13B	Llemma-7B
Hidden dim	6144	5120	4096
Sequence length	2048	2048	4096
Attention heads	64	40	32
Encoder layers	44	40	32
Fused Softmax	Yes	Yes	No
Flash Attention	No	No	Yes
Norm type	Basic LayerNorm	RMSNorm	RMSNorm
Micro-batch size	4	4	4
Iters/Update	16	16	8

4.2 HPC Clusters

Feature	Perlmutter	TACC Vista
CPU	AMD EPYC 7763 (64 cores)	NVIDIA Grace (72 cores)
GPU	A100-SXM4 (40 GB HBM2)	GH200 (96 GB HBM3)
GPUs/node	4	1
Intra-node	NVLink 3.0 (600 GB/s)	NVLink-C2C (900 GB/s)
Inter-node	Slingshot-10 (4×50 Gb/s)	NDR InfiniBand (400 Gb/s)
Scale	Up to 32 nodes (128 A100)	Up to 128 nodes (128 GH200)

The key architectural difference is that Perlmutter has 4 GPUs per node (enabling intra-node pre-reduction for MP_AllReduce), while Vista has 1 GPU per node (all collectives traverse the inter-node InfiniBand fabric, increasing jitter and prediction difficulty).

5. Key Results with Numbers

5.1 Performance Stability

Perlmutter shows high stability: all configurations exhibit <1% variation between minimum and average training times.

Vista shows significantly higher variability (5.21%–108.3% increase from minimum to average), primarily from GPT-20B(4-8-4) at 70.38% and GPT-20B(8-4-4) at 108.3%. Cause: single-GPU-per-node design forces MP_AllReduce over inter-node InfiniBand, which is susceptible to congestion and scheduling jitter.

5.2 Component-Level Prediction Errors (Table IX)

The framework uses the minimum batch time as the prediction target (to avoid modeling transient congestion effects).

Component	Perlmutter range	Vista range
Encoder Fwd	-13.60% to -1.84%	-14.37% to -2.24%
Encoder Bwd	-11.93% to -1.55%	-13.33% to -8.05%
Stage Fwd Max	-13.41% to 0.53%	-11.99% to 0.32%
Stage Bwd Max	-12.49% to -1.66%	-13.73% to -8.33%
DP AllReduce	-31.21% to 5.06%	-49.69% to -6.36%
DP AllGather	-1.89% to 37.86%	-52.47% to 18.68%
PP P2P	-21.98% to 3.84%	-36.26% to -1.65%
MP AllReduce	-4.85% to 10.67%	-7.66% to 2.33%

Key observations:

Encoder forward/backward and pipeline stage timings are predicted with good accuracy (mostly within 15%), and these account for 70–95% of total runtime.
Communication operations (DP AllReduce, DP AllGather, PP P2P) have high individual errors (up to 52%) but contribute <5% of total runtime on most configurations, so their inaccuracy is amortized.
MP AllReduce is accurate (mostly <5% error) due to high invocation frequency dampening network variability.
PP P2P on Vista is particularly inaccurate (-32% to -36%), attributed to the single-GPU-per-node architecture causing all P2P to traverse InfiniBand with high jitter.

5.3 End-to-End Prediction Errors

Perlmutter:

GPT-20B(4-4-8): -8.82%
GPT-20B(4-8-4): 3.94%
GPT-20B(8-4-4): -5.87%
LLaMA-13B(4-8-2): -4.95%
Llemma-7B(4-2-2): 1.30%
Average: 4.98%

Vista:

GPT-20B(4-4-8): -9.15%
GPT-20B(4-8-4): -15.16%
GPT-20B(8-4-4): -8.41%
LLaMA-13B(4-8-2): -9.02%
Llemma-7B(4-2-2): -5.18%
Average: 9.38%

Perlmutter's 4-8-4 configuration achieves the lowest error (3.94%), confirming balanced model parallelism is easiest to predict. Data-parallel-heavy configurations (4-4-8) and deep pipeline configurations (8-4-4) show larger errors due to more complex communication patterns. Vista shows a consistent underestimation trend attributed to unpredictable network jitter on the all-InfiniBand fabric.

5.4 Runtime Breakdown (Figure 3)

Compute dominates: encoder forward + backward passes and pipeline stage execution account for 70–95% of total runtime. Communication operations collectively account for <5–10%, except for MP_AllReduce-heavy configurations (8-way model parallelism on Perlmutter, or all Vista configs where MP_AllReduce is invoked once/twice per encoder pass over InfiniBand).

6.1 Analytical Communication Models

LogP (Culler 1993), BSP (Valiant 1990), and LogGOPS (Hoefler 2010) parametrize communication with latency, bandwidth, and synchronization costs. Effective for simple topologies but require calibrating 4–7+ parameters and struggle with vendor-specific library behavior. MPI communication optimization work (Thakur et al. 2005) follows this tradition.

6.2 GPU Kernel Analytical Models

IPP model (Hong and Kim 2009, 2010): 8.9% error for single GPU workloads. Boyer et al. (2013): extended with data transfer modeling (8% error). DeLTA (Lym et al. 2019): 7.9% error modeling arithmetic intensity and memory traffic, but requires low-level profiling unavailable without hardware. All of these are single-GPU and do not model distributed communication.

6.3 DNN-Specific Learned Models

Habitat (Yu et al. 2021): runtime-based GPU performance predictor via small-scale execution. DayDream (Zhu et al. 2020): dependency-graph modeling for training time estimation. Both require on-hardware sampling. DTS (Esposito 2022): simulation-based time estimation with substantial profiling cost.

6.4 Distributed LLM Performance Modeling

Kundu et al. (2024) (IISWC): analytical method for distributed LLM training with <10% error across 8–3072 GPUs for 22B–1008B parameter models. Most closely related to this work but their implementation is not open source, preventing direct comparison.

7. Limitations

Communication stochasticity: Communication operations remain the hardest component to predict accurately. NCCL's runtime algorithm selection, protocol choice, and sensitivity to network congestion and concurrent flows are not explicitly modeled.
Specialized NCCL algorithms not modeled: CollNet, NVLS, PAT, and other non-default NCCL algorithms are not accounted for. The framework assumes NCCL operates under default configuration choices.
No dynamic variability: Framework predicts expected (minimum) training time under stable conditions. Congestion events, GPU preemption, and OS jitter are not modeled.
Closed-source framework dependency: Built on GPT-NeoX with DeepSpeed + Megatron-LM; generalization to other frameworks (Megatron-Core, FSDP) requires re-profiling.
Underestimation bias on unified memory architectures: Vista (GH200) results show a consistent -5% to -15% underestimation trend due to unpredictable InfiniBand jitter that the model cannot capture from its minimum-latency training samples.
No gradient checkpointing / activation recomputation modeling: Acknowledged as a limitation for memory-optimized training configurations.

8. Section-by-Section Paragraph Summaries

Section I — Introduction

LLMs now require 10^25 FLOPs for training, making pre-training performance prediction critical for HPC resource allocation. Existing methods are either too expensive (learned models with costly sampling) or too approximate (analytical models missing hardware specifics). The paper proposes hierarchical operator decomposition + targeted lightweight sampling as a middle path. Three contributions: operator-level decomposition, hardware-aware regression per operator type, end-to-end pipeline timeline integrating both.

Section II-A — Foundations: GPU Architecture

Modern GPUs have SM-level compute with hierarchical memory. A100/H100 differ in Tensor Core generation, FP8 support, and cache sizes. cuBLAS/cuDNN auto-tune kernel selection to input shape, creating discontinuous performance curves. Mixed precision (FP16, BF16, FP8) affects both compute and memory bandwidth.

Section II-A — Foundations: Transformer Patterns

MHA creates variable GEMM shapes depending on batch/sequence/head parameters. MLP layers are compute-bound. LayerNorm/RMSNorm are memory-bandwidth bound. Flash Attention fuses multiple steps to reduce memory overhead. These heterogeneous bottlenecks make a single performance model insufficient — motivating per-operator regressors.

Section II-A — Foundations: Distributed Training Strategies

DP, MP (tensor), and PP are described. DP requires AllReduce at gradient synchronization; tensor MP requires intra-layer AllReduce; PP requires P2P between stages and introduces pipeline bubbles. All three interact in 3D parallelism. Communication efficiency depends on multi-tier interconnects (NVLink intra-node, InfiniBand/Ethernet inter-node).

Section II-B — Problem Statement

Three challenges formalized. Challenge 1: heterogeneous hardware with nonlinear scaling. Challenge 2: diverse transformer operator patterns. Challenge 3: interplay of parallelism strategies and stochastic communication overhead. Standard analytical scaling laws are insufficient.

Section III-A — Data Collection Methodology

Micro-benchmark protocol: isolation profiling with PyTorch profiler, 10-iteration warmup, sorted median of 5 samples to reduce outlier impact. Communication operators benchmarked across GPU/node count ranges. Sampling strategy designed to cover high-impact configurations without exhaustive enumeration.

Section III-B — Regressors

RandomForest and XGBoost selected for handling tabular data, piecewise nonlinearities, and robustness to outliers. 80/20 train/validation split with hyperparameter selection by validation error minimization. Regressors capture auto-tuning discontinuities, memory hierarchy transitions, and NCCL algorithm switching thresholds.

Section III-C — Workload Representation

Feature vectors encode all relevant model and system parameters per operator type (Table I). Communication operators use [entries, nodes, GPUs/node] to capture topology effects. Pipeline stage roles differentiate first/middle/last stages for accurate parameter count and synchronization modeling.

Section III-D — Timeline Modeling

The 1F1B pipeline timeline formula accounts for micro-batch count, pipeline stages, maximum forward/backward time, DP AllReduce cost for first stage, and MaxUpdate (Optimizer + DP AllGather). Gradient synchronization for all but the first stage is overlapped with pipeline backward propagation. Pipeline stage encoder allocation formulas balance workload across stages by offsetting the 5 non-encoder blocks.

Section IV-A — Experimental Setup

Two HPC clusters evaluated: Perlmutter (4 A100s/node, NVLink3, Slingshot-10) and Vista (1 GH200/node, NVLink-C2C, NDR InfiniBand). Three LLMs tested: GPT-20B, LLaMA-13B, Llemma-7B. Multiple 3D parallelism configurations per model.

Section IV-B — Performance Stability

Perlmutter is highly stable (<1% min-to-average variation). Vista shows 5-108% variation, worst for GPT-20B(8-4-4) due to MP_AllReduce traversing InfiniBand exclusively. The framework targets minimum batch time to decouple prediction accuracy from network variability.

Section IV-C — Prediction Errors

Compute-dominated components (70-95% of runtime) are predicted with 1-15% error. Communication components have high individual errors (up to 52%) but low weight in total runtime. MP_AllReduce is accurate (<5% in most cases) due to high invocation frequency. Overall: 4.98% on Perlmutter, 9.38% on Vista. Balanced MP configurations (4-8-4) are easiest to predict; data-parallel-heavy configurations (4-4-8) are hardest.

Surveys LogP/BSP (foundational communication models), bottom-up GPU models (IPP, DeLTA), top-down approaches (GCoM), DNN-specific models (DayDream, Habitat), and the closest work (Kundu et al. 2024 IISWC analytical LLM training model). Positions this paper as combining targeted empirical sampling with hierarchical analytical composition.

Section VI — Conclusion

4.98%/9.38% average prediction errors across two clusters. Communication operations (especially PP P2P on unified memory architectures) remain the hardest to predict and represent future work. Integration with job scheduling systems and emerging hardware architectures identified as next steps.

9. Relevance to DynamICCL

Moderate direct relevance — confirms the core DynamICCL problem.

1. NCCL communication is the unpredictable component. The paper reports up to 52% component-level prediction error for DP_AllReduce and PP_P2P, explicitly because NCCL's runtime algorithm selection (ring vs. tree vs. CollNet), protocol choices (LL/LL128/Simple), and sensitivity to network congestion introduce discontinuities that regression cannot capture. This is precisely the parameter space DynamICCL optimizes over. The paper's failure mode (predicting NCCL behavior) is DynamICCL's value proposition (controlling NCCL behavior).

2. Feature space for DynamICCL state design. The communication operator feature vectors in Table I — specifically [|entries|, |nodes|, |GPUs/nodes|] for DP_AllReduce/AllGather — provide a principled starting point for Agent-2's state vector. Message size, topology (node count, GPUs/node), and collective type are natural features.

3. Parallelism-induced communication diversity. The paper demonstrates that MP degree significantly affects AllReduce communication volume and frequency: MP_AllReduce is invoked 1-2 times per encoder forward pass and 2 times per backward pass. DynamICCL's NCCL parameter selection must adapt to these different workload regimes, reinforcing the need for collective-type-aware and message-size-aware RL policy.

4. Vista GH200 + InfiniBand architecture parallel. GH200's single-GPU-per-node design forces all collectives onto inter-node InfiniBand, producing high variability (up to 108.3% min-to-average variation). DynamICCL operating on similar infrastructure faces the same jitter challenge, directly motivating Agent-1's (LSTM+CUSUM) congestion detection role: the agent needs to distinguish normal InfiniBand jitter from genuine sustained congestion.

5. CPU-only prediction framework. The framework runs entirely on CPUs, which is also a design goal for DynamICCL's tuner plugin (making parameter selection decisions at collective call time without GPU overhead). The framework's fast inference speed (tree-based regressors) aligns with the latency requirements of a real-time NCCL tuner.

One important gap: This paper assumes NCCL default configuration and models its output as a black box. DynamICCL explicitly replaces the black box with a controllable agent. The two works are therefore complementary: this paper shows why NCCL defaults are hard to predict; DynamICCL shows how to actively improve on them.