Brief Summary: Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Citation: Biyao Zhang, Mingkai Zheng, Debargha Ganguly, Xuecen Zhang, Vikash Singh, Vipin Chaudhary, Zhao Zhang. Case Western Reserve University / Rutgers University. arXiv:2509.22832v1, September 26, 2025.

Problem

Predicting end-to-end training time for multi-billion parameter LLMs distributed across hundreds of GPUs is challenging due to: (1) heterogeneous GPU architectures (A100 vs. GH200 cause non-linear performance scaling), (2) complex computational patterns in transformer components (compute-bound GEMM vs. memory-bound normalization vs. attention), and (3) intricate interactions between computation and communication under 3D parallelism (data, model, pipeline) on multi-tier interconnects. Learned black-box models require prohibitively expensive sampling (60 seconds on a 20B model on 128 A100s = 2 node-hours). Purely analytical models cannot capture hardware-specific non-linearities from vendor auto-tuning (cuBLAS, cuDNN) and NCCL.

Core Insight

LLM training runtime can be accurately predicted by operator-level decomposition: decompose the full LLM computational graph into fundamental operators (embedding, LayerNorm, GEMM, attention, communication primitives), build lightweight hardware-aware regression models (RandomForest/XGBoost) per operator type using minimal empirical sampling, and hierarchically aggregate operator predictions into end-to-end runtime using an analytical pipeline timeline model. This hybrid bottom-up approach captures hardware non-linearities through targeted sampling while maintaining analytical tractability for composing predictions.

Method

Three-phase approach:

Performance data collection: Micro-benchmark each fundamental operator in isolation (PyTorch profiler, 1us resolution, 10-iteration warmup + 10 measurement iterations). Collect samples across ranges of batch size, sequence length, hidden dimension, model parallel degree, and GPU/node count.
Per-operator regressors: Train RandomForest or XGBoost models per operator class (compute-bound, memory-bound, communication). Feature vectors encode workload representation for each operator type (Table I). Regressors capture non-linear GPU performance patterns from auto-tuning, memory hierarchy, and NCCL algorithm discontinuities.
End-to-end timeline model: Integrate operator predictions into a pipeline parallelism timeline. Key formula accounts for 1F1B pipeline stages, DP_AllReduce synchronization, optimizer updates, and MP_AllReduce in parallel cross-entropy — all with overlap modeling.

Key Results

Evaluated on GPT-20B, LLaMA-13B, LLeMma-7B on two HPC clusters:

Perlmutter (A100-SXM4, up to 128 GPUs, NVLink3 intra-node, Slingshot-10 inter-node): Average prediction error 4.98%.
Vista (GH200, up to 128 GPUs, NVLink-C2C intra-node, InfiniBand 400 Gb/s inter-node): Average prediction error 9.38%.
Framework runs entirely on CPUs — no GPU required for prediction.
Communication operations (DP_AllReduce, AllGather, PP_P2P) achieve higher individual errors (up to 50% for some configurations) but contribute <5% of total runtime, so overall accuracy remains within 5-15%.
Smaller models (LLeMma-7B, 16 GPUs): 1.3% and -5.18% error on Perlmutter and Vista respectively.
Underestimation trend on Vista (~-5% to -15%) attributed to GH200's single-GPU-per-node design forcing all collectives onto inter-node InfiniBand with higher network jitter.

Limitations

Communication operations remain the hardest to predict (up to 50% component-level error) due to inherent stochasticity of real-world network congestion and NCCL's runtime decisions.
PP_P2P operations on unified memory (GH200) are particularly difficult to model accurately.
CollNet, NVLS, and other specialized NCCL algorithms are not modeled explicitly.
Does not address dynamic runtime variability (network congestion, GPU contention) — predicts expected performance under stable conditions.
Framework does not model NCCL configuration tuning (algorithm/protocol/nChannels selection); assumes NCCL defaults.

Relevance to DynamICCL

Moderate direct relevance. This paper models NCCL communication as a black box (using empirical regression). DynamICCL's goal is to actively optimize those same NCCL communication parameters. The connections are:

Communication as the hardest modeling component: The paper reports up to 50% prediction error for DP_AllReduce and PP_P2P operations, precisely because NCCL's runtime configuration choices (algorithm, protocol, nChannels) introduce discontinuities that regression cannot predict analytically. This is the exact optimization space DynamICCL operates in — and confirms that NCCL behavior is non-deterministic enough to require active RL-based tuning rather than predictive modeling.
Feature space for state design: The paper's operator workload representations (Table I) — specifically the communication operator features [|entries|, |nodes|, |GPUs/node|] for DP_AllReduce — provide a principled basis for DynamICCL's state space. Message size (entries), node count, and GPUs-per-node are natural features for Agent-2's state vector.
Parallelism configuration context: The paper validates that model parallelism degree (mp) significantly affects AllReduce communication volume (MP_AllReduce invoked once/twice per encoder pass). DynamICCL's NCCL configuration must adapt to these different communication volumes, further supporting the need for message-size-aware parameter selection.
Vista cluster (GH200 + InfiniBand 400 Gb/s): The paper identifies that GH200's single-GPU-per-node design forces all collectives onto inter-node InfiniBand, increasing jitter and prediction difficulty. DynamICCL operating on similar HPC infrastructure would face the same jitter challenge — motivating Agent-1's (LSTM+CUSUM) congestion detection role.