Architecture & Design Analysis

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Source: Zhang, Zheng, Ganguly et al., Case Western Reserve / Rutgers, arXiv:2509.22832v1, Sep 2025


1. System Overview Block Diagram

┌──────────────────────────────────────────────────────────────────────┐
│          LLM Performance Prediction Framework (Bottom-Up)            │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                       Inputs                                 │   │
│  │  ┌──────────────────┐ ┌──────────────────┐ ┌─────────────┐  │   │
│  │  │  LLM Model Spec  │ │  GPU Architecture│ │  Distributed│  │   │
│  │  │  (d, l, h, mp,   │ │  (A100/H100/     │ │  Strategy   │  │   │
│  │  │   b, encoders,   │ │   GH200; NVLink/ │ │  (DP×MP×PP) │  │   │
│  │  │   vocab size v)  │ │   IB; GPUs/node) │ │             │  │   │
│  │  └────────┬─────────┘ └────────┬─────────┘ └──────┬──────┘  │   │
│  └───────────┼────────────────────┼──────────────────┼──────────┘   │
│              └────────────────────┴──────────────────┘               │
│                                   │                                  │
│                                   ▼                                  │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │         Phase 1: Operator-Level Decomposition                │   │
│  │                                                              │   │
│  │  Decompose transformer into fundamental operators:           │   │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │   │
│  │  │  Compute ops │ │  Memory ops  │ │  Communication ops   │ │   │
│  │  │  Linear1/2/3 │ │  LayerNorm   │ │  MP_AllReduce        │ │   │
│  │  │  Attention   │ │  RMSNorm     │ │  DP_AllReduce        │ │   │
│  │  │  Flash Attn  │ │  Embedding   │ │  DP_AllGather        │ │   │
│  │  │  Softmax     │ │  Activation  │ │  PP_P2P              │ │   │
│  │  └──────────────┘ └──────────────┘ └──────────────────────┘ │   │
│  └─────────────────────────────┬────────────────────────────────┘   │
│                                │                                     │
│                                ▼                                     │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │         Phase 2: Per-Operator Regressor Fitting              │   │
│  │                                                              │   │
│  │  ┌───────────────────────────────────────────────────────┐  │   │
│  │  │  Micro-benchmark data collection (PyTorch profiler    │  │   │
│  │  │  at 1µs resolution, 10 warmup + 10 measurement iters) │  │   │
│  │  └──────────────────────────┬────────────────────────────┘  │   │
│  │                             │ per-operator (config, latency) │   │
│  │                             ▼                                │   │
│  │  RandomForest / XGBoost regressors (one per operator type)   │   │
│  │  Input: workload representation vector (Table I)             │   │
│  │  Output: predicted operator latency                          │   │
│  └─────────────────────────────┬────────────────────────────────┘   │
│                                │                                     │
│                                ▼                                     │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │         Phase 3: End-to-End Timeline Integration             │   │
│  │                                                              │   │
│  │  Pipeline timeline model:                                    │   │
│  │  Runtime = (#MicroBatches-1 + #PPstages)                     │   │
│  │          × (Max_Fwd + Max_Bwd)                               │   │
│  │          + First_Stage_GradSync + Max_Update                 │   │
│  │                                                              │   │
│  │  Accounts for: PP bubble, DP gradient overlap,               │   │
│  │  MP communication, optimizer step                            │   │
│  └─────────────────────────────┬────────────────────────────────┘   │
│                                │                                     │
│                                ▼                                     │
│                   ┌────────────────────────┐                        │
│                   │  Runtime Prediction    │                        │
│                   │  (seconds per batch)   │                        │
│                   │  Error: 4.98% (A100)   │                        │
│                   │         9.38% (GH200)  │                        │
│                   └────────────────────────┘                        │
└──────────────────────────────────────────────────────────────────────┘
▲ Fig 1: Bottom-up performance modeling — decompose LLM into
         operators, fit per-operator regressors, integrate into
         pipeline timeline for end-to-end runtime prediction

Interpretation. The bottom-up decomposition design makes the framework architecture-portable: adding support for a new GPU (e.g., H200) only requires re-collecting per-operator microbenchmarks on that hardware and re-fitting the same regressor structure. The pipeline timeline formula in Phase 3 is the only analytical component — it composes pre-measured operator latencies into a schedule prediction without simulating the full execution.


2. Key Architecture Diagram — Operator Decomposition & Regressor Structure

┌──────────────────────────────────────────────────────────────────┐
│         Transformer Operator Taxonomy for Performance Modeling   │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Category A: Compute-Bound (GEMM-dominated)              │   │
│  │  Operators: Linear1, Linear2, Linear3, Linear4,          │   │
│  │             QK^T, ·V, Flash Attention, Final_Linear      │   │
│  │  Bottleneck: Tensor Core throughput (TFLOPS)             │   │
│  │  Key features: b, l, d, h, mp (matrix dimensions)        │   │
│  │  Regressor: captures nonlinear cuBLAS auto-tune steps    │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Category B: Memory-Bound (elementwise)                  │   │
│  │  Operators: LayerNorm, RMSNorm, RoPE, Activation,        │   │
│  │             Fillmask, Softmax, Embedding                 │   │
│  │  Bottleneck: HBM bandwidth (GB/s)                        │   │
│  │  Key features: b, l, d (element count determines BW)     │   │
│  │  Regressor: linear in element count, catches caching     │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Category C: Communication-Bound (network-dependent)     │   │
│  │  Operators: MP_AllReduce, DP_AllReduce, DP_AllGather,    │   │
│  │             PP_P2P, Parallel_CrossEntropy                │   │
│  │  Bottleneck: NVLink/IB bandwidth, NCCL algo overhead     │   │
│  │  Key features: [entries, nodes, GPUs/node] — topology    │   │
│  │  Regressor: highest prediction error (up to 50%          │   │
│  │             for PP_P2P) due to network stochasticity     │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  Regressor architecture (same template, per operator):          │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Input: workload vector (Table I features for op type)   │   │
│  │         + hardware descriptor (interconnect, GPU count)  │   │
│  │         + parallelism descriptor (mp, dp, pp degrees)    │   │
│  │                    │                                     │   │
│  │                    ▼                                     │   │
│  │  RandomForest or XGBoost                                 │   │
│  │  (selected by min validation error, 80/20 split)         │   │
│  │                    │                                     │   │
│  │                    ▼                                     │   │
│  │  Output: predicted operator latency (seconds)            │   │
│  └──────────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────────┘
▲ Fig 2: Operator taxonomy — compute, memory, communication;
         each category needs distinct features and regressor behavior

3. Control Flow & Data Flow Diagrams

3a. Control Flow — Framework Usage (Prediction Query)

  START: query runtime(model_spec, gpu_arch, parallelism_config)
    │
    ▼
① [Compute vocab_size alignment:
   divisibility_factor = 128 × num_MP_partitions
   vocab_size = ⌈orig_vocab/divisibility_factor⌉ × divisibility_factor]
    │
    ▼
② [Compute pipeline stage encoder allocation:
   first_stage = ⌈(#encoders+5)/#PP_stages⌉ - 2
   middle_stage = ⌈(#encoders+5)/#PP_stages⌉
   last_stage  = ⌈(#encoders+5)/#PP_stages⌉ - 3]
    │
    ▼
③ [For each operator in decomposed transformer graph:]
    │
    ├── build feature vector from (b, l, d, h, mp, nodes, GPUs/node)
    │
    └── query per-operator regressor → t_op
    │
    ▼
④ [Aggregate to stage-level times:]
    │
    ├── Stage_Fwd_Max = max over all PP stages of Σ fwd t_op
    ├── Stage_Bwd_Max = max over all PP stages of Σ bwd t_op
    ├── First_Stage_GradSync = DP_AllReduce(#Stage_Parameters)
    └── Max_Update = max(Optimizer + DP_AllGather(#Stage_Params/|dp|))
    │
    ▼
⑤ [Apply 1F1B pipeline timeline formula (Eq. 7):]
    │
    Runtime = (#MicroBatches - 1 + #PPstages)
            × (Stage_Fwd_Max + Stage_Bwd_Max)
            + First_Stage_GradSync
            + Max_Update
    │
    ▼
  OUTPUT: predicted seconds per training batch
▲ Fig 3: Control flow for a single runtime prediction query —
         operator regressor lookups feed the 1F1B timeline formula

3b. Data Flow — Microbenchmark Collection Pipeline

  Target Operator (e.g., Linear1)
       │
       ▼
  ① [Source code extraction + profiler-based isolation]
       │  (operators run in isolation, no kernel overlap)
       ▼
  ② [PyTorch Profiler (1µs resolution, CUDA event recording)]
       │
       │  10 warmup iterations (saturate GPU capability)
       │  10 measurement iterations
       │  → select median of sorted top-5 samples
       ▼
  ③ [Parse profiler output: GPU runtime =
     max(end_time of associated kernels) -
     min(start_time of associated kernels)]
       │
       ▼
  ④ [Record: (feature_vector, measured_latency) tuple]
       │  feature_vector from Table I (e.g., [bl, 3d/mp] for Linear1)
       ▼
  ⑤ [Dataset: 80% training / 20% validation split]
       │
       ▼
  ⑥ [Fit RandomForest or XGBoost;
     select by min validation error;
     retrain on full dataset with selected model]
       │
       ▼
  Regressor ready for runtime queries
▲ Fig 4: Microbenchmark collection and regressor fitting data flow

3c. Data Flow — 1F1B Pipeline Timeline Model

  Time axis ──────────────────────────────────────────────►

  Stage 1  │ F1 │ F2 │ F3 │ F4 │ B4 │    │    │ B1 │ UPD│
  Stage 2  │    │ F1 │ F2 │ F3 │ F4 │ B4 │ B3 │ B2 │ B1 │
  Stage 3  │    │    │ F1 │ F2 │ F3 │ F4 │ B4 │ B3 │ B2 │
  Stage 4  │    │    │    │ F1 │ F2 │ F3 │ F4 │ B4 │ B3 │
                │◄──────── steady state ────────►│
                │ #PPstages-1 bubble steps        │

  Key overlaps modeled:
  ┌────────────────────────────────────────────────────────┐
  │  DP gradient sync (AllReduce) overlaps with backward   │
  │  propagation of EARLIER pipeline stages               │
  │  EXCEPT Stage 1 (first_stage_grad_sync not hidden)     │
  │                                                        │
  │  Optimizer step overlaps across stages:               │
  │  Max_Update = max(Optimizer + DP_AllGather per stage)  │
  │                                                        │
  │  MP_AllReduce within encoder: invoked 1-2x per fwd,   │
  │  2x per bwd → amortized across many invocations       │
  └────────────────────────────────────────────────────────┘

  Component time proportions (GPT-20B, 4-4-8 Perlmutter):
  encoder_fwd: 30%  encoder_bwd: 44%  dp_allreduce: 25%
  mp_allreduce: 4%  pp_p2p: 0.2%
▲ Fig 5: 1F1B pipeline data flow with DP overlap — gradient sync
         hidden by prior-stage backward, except at first stage

4. Design Trade-off Analysis

Design Decision Alternative A Alternative B (this framework) Winner Rationale
Modeling granularity Black-box end-to-end sampling (run 60s of training) Operator-level decomposition + lightweight probing B End-to-end sampling on 128 A100s costs 2 node-hours per configuration; operator-level probing runs entirely on CPU after one-time benchmark collection
Regressor type Analytical roofline model Tree-based learned regressors (RF/XGBoost) B Roofline cannot capture cuBLAS/cuDNN auto-tuning discontinuities, mixed precision effects, or kernel-switching thresholds; RF/XGBoost capture piecewise-linear behavior natively
Communication operator modeling Analytical alpha-beta model (LogP/LogGOPS) Empirical microbenchmarks + regression B LogGOPS requires 7+ parameters for inter-node communication alone; empirical model captures NCCL internal behavior, congestion effects, and topology-dependent performance directly
Pipeline modeling Simulate full execution timeline Analytical 1F1B formula with measured operator inputs B Full simulation requires discrete-event modeling of all NCCL primitives; 1F1B formula captures the dominant pipeline structure with O(1) computation using measured stage times
Feature representation for comms Message size only [entries, nodes, GPUs/node] topology vector B MP_AllReduce performance depends on number of nodes (inter-node hops), GPUs/node (NVLink vs. PCIe), and total data volume simultaneously — scalar message size misses topology effects
Data collection strategy Dense grid sampling (expensive) Three-pronged: micro-bench + parameter exploration + interpolation B Dense grids at high config counts are computationally infeasible; strategic sampling at boundaries + interpolation achieves comparable coverage at 10x lower collection cost
Prediction target Mean training time Minimum training batch cost B Minimum cost is more stable (less sensitive to transient network jitter); mean is dominated by outlier congestion events, making it harder to learn accurately
Hardware generalization Single-platform model Re-fit per-platform with same regressor architecture B A100 and GH200 have fundamentally different memory systems (HBM2 vs. HBM3), interconnects (NVLink 3.0 vs. C2C), and auto-tuning libraries; single model cannot generalize

For DynamICCL, prefer B in all cases because the operator-level decomposition and per-operator tree-based regression approach is exactly how DynamICCL should build its NCCL performance model: collect microbenchmarks per (collective_type, algorithm, protocol, nChannels, msg_size_bin, topology), fit a tree-based regressor, and use it to predict config performance without running full training jobs.


5. What to Borrow for DynamICCL

5.1 Operator-Level Decomposition → Config-Level Decomposition

This paper decomposes LLM training into fundamental operators and models each independently. DynamICCL should apply the same decomposition to NCCL configurations: rather than learning a single Q-function over the full (algo, proto, nChannels, numThreads, msg_size, topology) space, decompose the problem into sub-models:

These sub-models compose into an overall config quality prediction, just as the paper's per-operator regressors compose into end-to-end runtime. This decomposition also allows reusing the per-operator models across different NCCL collective types (AllReduce, AllGather, ReduceScatter) that share the same underlying Ring or Tree algorithm.

5.2 Tree-Based Regressors for NCCL Config Performance

The paper's key insight is that tree-based models (RF/XGBoost) outperform neural networks and analytical models for GPU performance prediction because they naturally capture piecewise-linear behavior (hardware thresholds, auto-tuning discontinuities). NCCL performance exhibits exactly this structure: there are sharp transitions at protocol boundaries (LL→LL128 at ~64 KiB), channel count thresholds (nChannels vs. SM availability), and algorithm transitions. DynamICCL's offline performance model (used to pre-train the Config Agent) should use XGBoost regressors rather than neural networks for NCCL config performance prediction.

Concrete action: Train an XGBoost regressor with features [algo_id, proto_id, nChannels, numThreads, log2(msg_size), num_ranks, topology_class] predicting collective_latency_ms. Use this as a surrogate environment for offline Config Agent training — the same role that the operator regressors play in this paper's framework.

5.3 Communication Operators Have Highest Prediction Error — Design Implication

The paper reports that PP_P2P has up to 50% prediction error, while compute-bound operators achieve sub-3% error. This is because communication performance is stochastic under real network conditions. For DynamICCL, this means: (a) the Config Agent's reward signal from real NCCL measurements will be noisy, and (b) offline pre-training on a static performance model will have higher bias for communication-heavy configs. The agent must therefore use a higher learning rate for its communication-related state features during online fine-tuning, and the Trigger Agent's CUSUM must have higher sensitivity thresholds to avoid false positives from communication noise rather than genuine congestion.

Design implication: Weight the CUSUM sensitivity parameter λ for communication latency signals 2x higher than for compute latency signals, reflecting the paper's observation that communication variability is ~10x higher than compute variability on multi-node clusters.

5.4 Workload Representation Vector as DynamICCL State Space Blueprint

Table I of this paper defines the exact feature vectors used for each operator type. For DynamICCL's state space, the communication operator features are directly applicable:

Concrete action: Add these four feature triples to DynamICCL's state vector, one set per active collective type in the current training step. This gives the Config Agent topology-aware context about the communication workload before it selects a config.

5.5 Minimum-Cost Prediction Target for Stable Reward Signal

The paper uses minimum training batch cost (not mean) as the prediction target because minimum cost is more stable under network jitter. DynamICCL's reward function should similarly target the minimum collective latency achievable with the chosen config (measured as the p10 latency over a rolling window of 10 recent executions), not the mean. The CUSUM change detector in the Trigger Agent should also track the running minimum latency as its baseline, triggering only when the current minimum rises above the historical minimum by more than the detection threshold Δ.

Concrete design: Change DynamICCL's reward from R = -mean(latency_window) to R = -p10(latency_window) where p10 is the 10th percentile over the last 10 executions of the same collective. This aligns with the paper's empirical finding that minimum-based targets produce more accurate and stable models.

5.6 Pipeline Stage Role Awareness for Config Agent

The paper observes that pipeline stage role (first, middle, last) significantly affects performance: first stage holds embedding parameters (larger), last stage holds loss computation (additional AllReduce for cross-entropy). DynamICCL should include a pipeline_stage_role feature in its state vector for configurations involving pipeline-parallel training. An AllReduce at the first pipeline stage (gradient sync for embedding) requires different optimal config than one at a middle stage (smaller gradient tensor), and the Config Agent should learn this distinction.

State feature addition: Add pipeline_stage_role ∈ {first, middle, last, none} as a 4-class one-hot feature. This allows the Config Agent to recognize that first-stage collectives typically involve larger tensors (vocabulary embedding gradients) and prefer higher nChannels than middle-stage collectives.