Brief Summary: Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Citation: Biyao Zhang, Mingkai Zheng, Debargha Ganguly, Xuecen Zhang, Vikash Singh, Vipin Chaudhary, Zhao Zhang. Case Western Reserve University / Rutgers University. arXiv:2509.22832v1, September 26, 2025.


Problem

Predicting end-to-end training time for multi-billion parameter LLMs distributed across hundreds of GPUs is challenging due to: (1) heterogeneous GPU architectures (A100 vs. GH200 cause non-linear performance scaling), (2) complex computational patterns in transformer components (compute-bound GEMM vs. memory-bound normalization vs. attention), and (3) intricate interactions between computation and communication under 3D parallelism (data, model, pipeline) on multi-tier interconnects. Learned black-box models require prohibitively expensive sampling (60 seconds on a 20B model on 128 A100s = 2 node-hours). Purely analytical models cannot capture hardware-specific non-linearities from vendor auto-tuning (cuBLAS, cuDNN) and NCCL.

Core Insight

LLM training runtime can be accurately predicted by operator-level decomposition: decompose the full LLM computational graph into fundamental operators (embedding, LayerNorm, GEMM, attention, communication primitives), build lightweight hardware-aware regression models (RandomForest/XGBoost) per operator type using minimal empirical sampling, and hierarchically aggregate operator predictions into end-to-end runtime using an analytical pipeline timeline model. This hybrid bottom-up approach captures hardware non-linearities through targeted sampling while maintaining analytical tractability for composing predictions.

Method

Three-phase approach:

  1. Performance data collection: Micro-benchmark each fundamental operator in isolation (PyTorch profiler, 1us resolution, 10-iteration warmup + 10 measurement iterations). Collect samples across ranges of batch size, sequence length, hidden dimension, model parallel degree, and GPU/node count.
  2. Per-operator regressors: Train RandomForest or XGBoost models per operator class (compute-bound, memory-bound, communication). Feature vectors encode workload representation for each operator type (Table I). Regressors capture non-linear GPU performance patterns from auto-tuning, memory hierarchy, and NCCL algorithm discontinuities.
  3. End-to-end timeline model: Integrate operator predictions into a pipeline parallelism timeline. Key formula accounts for 1F1B pipeline stages, DP_AllReduce synchronization, optimizer updates, and MP_AllReduce in parallel cross-entropy — all with overlap modeling.

Key Results

Evaluated on GPT-20B, LLaMA-13B, LLeMma-7B on two HPC clusters:

Limitations


Relevance to DynamICCL

Moderate direct relevance. This paper models NCCL communication as a black box (using empirical regression). DynamICCL's goal is to actively optimize those same NCCL communication parameters. The connections are:

  1. Communication as the hardest modeling component: The paper reports up to 50% prediction error for DP_AllReduce and PP_P2P operations, precisely because NCCL's runtime configuration choices (algorithm, protocol, nChannels) introduce discontinuities that regression cannot predict analytically. This is the exact optimization space DynamICCL operates in — and confirms that NCCL behavior is non-deterministic enough to require active RL-based tuning rather than predictive modeling.

  2. Feature space for state design: The paper's operator workload representations (Table I) — specifically the communication operator features [|entries|, |nodes|, |GPUs/node|] for DP_AllReduce — provide a principled basis for DynamICCL's state space. Message size (entries), node count, and GPUs-per-node are natural features for Agent-2's state vector.

  3. Parallelism configuration context: The paper validates that model parallelism degree (mp) significantly affects AllReduce communication volume (MP_AllReduce invoked once/twice per encoder pass). DynamICCL's NCCL configuration must adapt to these different communication volumes, further supporting the need for message-size-aware parameter selection.

  4. Vista cluster (GH200 + InfiniBand 400 Gb/s): The paper identifies that GH200's single-GPU-per-node design forces all collectives onto inter-node InfiniBand, increasing jitter and prediction difficulty. DynamICCL operating on similar HPC infrastructure would face the same jitter challenge — motivating Agent-1's (LSTM+CUSUM) congestion detection role.