C4: Enhancing Large-Scale AI Training Efficiency with C-driven Diagnose and Communication-driven Performance — Detailed Summary

Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, Gang Lu, Yu Guan, Ennan Zhai, Wencong Xiao, Hanyu Zhao, Man Yuan, Siran Yang, Xiang Li, Jiamang Wang, Rui Men, Jianwei Zhang, Chang Zhou, Dennis Cai, Yuan Xie, Binzhang Fu | Alibaba Group; HKUST | 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA 2025)

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.


Abstract


I. Introduction


II. Understanding the Challenges in Operational AI Clusters

A. New Challenges

B. Pain Points and Their Influences on Training Jobs

C. The Quantitative Analyses of Job Crashes

Failure category (Table I, 4,096-GPU job, 40 crashes / month) Share "Local" share
CUDA Error 12.5% 100%
ECC / NVLink Error 27.5% 100%
NCCL Timeout 20.0% 75%
ACK Timeout 27.5% 81.8%
Network Error / Other 12.5% 40%

D. The Quantitative Analyses of Runtime Slowdowns


III. Design

A. Mitigating Error-induced Downtime

B. Mitigating Communication Cost


IV. Evaluations

A. Setup

Configuration item (Table II) Value
C4D evaluation model GPT-175B
C4P evaluation models GPT-22B, Llama-13B, GPT-175B
Frameworks Megatron-LM, DeepSpeed
Per-node compute 8 x NVIDIA H800
Per-node NIC 8 x BlueField-3, 200 Gbps x 2 (bonded)
Network 3-Tier Clos, Fat-Tree, 1:1 oversubscription

B. Results

1) C4D Effectiveness

Downtime component (Table III) Jun 2023 Dec 2023 Reduction
Total 31.19% 1.16% 27x
Post-Checkpoint 7.53% 0.23% 33x
Detection 3.41% 0.05% 68x
Diagnosis & Isolation 19.65% 0.73% 27x
-- ECC / NVLink 8.34% 0.20% 42x
-- CUDA 4.19% 0.10% 42x
-- CCL Timeout 3.00% 0.23% 13x
-- ACK Timeout 1.80% 0.10% 18x
Re-initialization 0.60% 0.15% 4x

2) C4P Effectiveness



VI. Conclusion


Tables (verbatim from the paper)

Table I — Crash-cause distribution, 4,096-GPU job, 1 month

Cause Share Local share
CUDA Error 12.5% 100%
ECC / NVLink Error 27.5% 100%
NCCL Timeout 20.0% 75%
ACK Timeout 27.5% 81.8%
Network Error / Other 12.5% 40%

Table II — Configurations

Item Value
C4D model GPT-175B
C4P models GPT-22B, Llama-13B, GPT-175B
Frameworks Megatron-LM, DeepSpeed
GPUs / node 8 x NVIDIA H800
NICs / node 8 x BlueField-3 (200 Gbps x 2 bonded)
Network 3-Tier Clos, Fat-Tree, 1:1 oversubscription

Table III — Error-induced downtime, Jun vs Dec 2023

Component Jun 2023 Dec 2023
Total Downtime 31.19% 1.16%
Post-Checkpoint 7.53% 0.23%
Detection 3.41% 0.05%
Diagnosis & Isolation 19.65% 0.73%
ECC / NVLink 8.34% 0.20%
CUDA 4.19% 0.10%
CCL Timeout 3.00% 0.23%
ACK Timeout 1.80% 0.10%
Re-initialization 0.60% 0.15%

Named Methods, Subsystems, and Acronyms


Equations


Note on NCCL Tuning

C4D reuses the BSP synchronization barrier as a free diagnostic anchor — per-rank timestamps captured at the communicator, operation, and transport layers expose which collective algorithm and which ring/tree branch is actually the slow path on any given iteration. The receiver-driven dependency chain of ring AllReduce in particular is exploited to walk backward to the straggling rank, which means an NCCL-tuning loop could piggyback on the same instrumentation (timestamps + delay matrix) to detect when a chosen algorithm/protocol pairing is yielding sub-line-rate throughput. The paper's own observation that bonded-port AllReduce climbed from <240 Gbps to ~360 Gbps once flow paths were explicitly assigned illustrates that, on commodity Ethernet, NCCL configuration alone cannot recover throughput without coordinated path placement.