Brief Summary: Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM
Citation: Biyao Zhang, Mingkai Zheng, Debargha Ganguly, Xuecen Zhang, Vikash Singh, Vipin Chaudhary, Zhao Zhang. Case Western Reserve University / Rutgers University. arXiv:2509.22832v1, September 26, 2025.
Problem
Predicting end-to-end training time for multi-billion parameter LLMs distributed across hundreds of GPUs is challenging due to: (1) heterogeneous GPU architectures (A100 vs. GH200 cause non-linear performance scaling), (2) complex computational patterns in transformer components (compute-bound GEMM vs. memory-bound normalization vs. attention), and (3) intricate interactions between computation and communication under 3D parallelism (data, model, pipeline) on multi-tier interconnects. Learned black-box models require prohibitively expensive sampling (60 seconds on a 20B model on 128 A100s = 2 node-hours). Purely analytical models cannot capture hardware-specific non-linearities from vendor auto-tuning (cuBLAS, cuDNN) and NCCL.
Core Insight
LLM training runtime can be accurately predicted by operator-level decomposition: decompose the full LLM computational graph into fundamental operators (embedding, LayerNorm, GEMM, attention, communication primitives), build lightweight hardware-aware regression models (RandomForest/XGBoost) per operator type using minimal empirical sampling, and hierarchically aggregate operator predictions into end-to-end runtime using an analytical pipeline timeline model. This hybrid bottom-up approach captures hardware non-linearities through targeted sampling while maintaining analytical tractability for composing predictions.
Method
Three-phase approach:
- Performance data collection: Micro-benchmark each fundamental operator in isolation (PyTorch profiler, 1us resolution, 10-iteration warmup + 10 measurement iterations). Collect samples across ranges of batch size, sequence length, hidden dimension, model parallel degree, and GPU/node count.
- Per-operator regressors: Train RandomForest or XGBoost models per operator class (compute-bound, memory-bound, communication). Feature vectors encode workload representation for each operator type (Table I). Regressors capture non-linear GPU performance patterns from auto-tuning, memory hierarchy, and NCCL algorithm discontinuities.
- End-to-end timeline model: Integrate operator predictions into a pipeline parallelism timeline. Key formula accounts for 1F1B pipeline stages, DP_AllReduce synchronization, optimizer updates, and MP_AllReduce in parallel cross-entropy — all with overlap modeling.
Key Results
Evaluated on GPT-20B, LLaMA-13B, LLeMma-7B on two HPC clusters:
- Perlmutter (A100-SXM4, up to 128 GPUs, NVLink3 intra-node, Slingshot-10 inter-node): Average prediction error 4.98%.
- Vista (GH200, up to 128 GPUs, NVLink-C2C intra-node, InfiniBand 400 Gb/s inter-node): Average prediction error 9.38%.
- Framework runs entirely on CPUs — no GPU required for prediction.
- Communication operations (DP_AllReduce, AllGather, PP_P2P) achieve higher individual errors (up to 50% for some configurations) but contribute <5% of total runtime, so overall accuracy remains within 5-15%.
- Smaller models (LLeMma-7B, 16 GPUs): 1.3% and -5.18% error on Perlmutter and Vista respectively.
- Underestimation trend on Vista (~-5% to -15%) attributed to GH200's single-GPU-per-node design forcing all collectives onto inter-node InfiniBand with higher network jitter.
Limitations
- Communication operations remain the hardest to predict (up to 50% component-level error) due to inherent stochasticity of real-world network congestion and NCCL's runtime decisions.
- PP_P2P operations on unified memory (GH200) are particularly difficult to model accurately.
- CollNet, NVLS, and other specialized NCCL algorithms are not modeled explicitly.
- Does not address dynamic runtime variability (network congestion, GPU contention) — predicts expected performance under stable conditions.
- Framework does not model NCCL configuration tuning (algorithm/protocol/nChannels selection); assumes NCCL defaults.
Relevance to DynamICCL
Moderate direct relevance. This paper models NCCL communication as a black box (using empirical regression). DynamICCL's goal is to actively optimize those same NCCL communication parameters. The connections are:
Communication as the hardest modeling component: The paper reports up to 50% prediction error for DP_AllReduce and PP_P2P operations, precisely because NCCL's runtime configuration choices (algorithm, protocol, nChannels) introduce discontinuities that regression cannot predict analytically. This is the exact optimization space DynamICCL operates in — and confirms that NCCL behavior is non-deterministic enough to require active RL-based tuning rather than predictive modeling.
Feature space for state design: The paper's operator workload representations (Table I) — specifically the communication operator features [|entries|, |nodes|, |GPUs/node|] for DP_AllReduce — provide a principled basis for DynamICCL's state space. Message size (entries), node count, and GPUs-per-node are natural features for Agent-2's state vector.
Parallelism configuration context: The paper validates that model parallelism degree (mp) significantly affects AllReduce communication volume (MP_AllReduce invoked once/twice per encoder pass). DynamICCL's NCCL configuration must adapt to these different communication volumes, further supporting the need for message-size-aware parameter selection.
Vista cluster (GH200 + InfiniBand 400 Gb/s): The paper identifies that GH200's single-GPU-per-node design forces all collectives onto inter-node InfiniBand, increasing jitter and prediction difficulty. DynamICCL operating on similar HPC infrastructure would face the same jitter challenge — motivating Agent-1's (LSTM+CUSUM) congestion detection role.