Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng,
Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, Gang Lu, Yu Guan,
Ennan Zhai, Wencong Xiao, Hanyu Zhao, Man Yuan, Siran Yang, Xiang Li,
Jiamang Wang, Rui Men, Jianwei Zhang, Chang Zhou, Dennis Cai, Yuan Xie,
Binzhang Fu | Alibaba Group; HKUST | 2025 IEEE International Symposium
on High-Performance Computer Architecture (HPCA 2025)
Per-section summary organized by paper headings. Each section
includes paragraph-level bullet points and exact quantitative results
where the paper provides them.
Abstract
- Modern LLM training requires distributed computation across
thousands of GPUs, and overall efficiency is far below ideal because of
two recurring issues: hardware errors that crash entire jobs and network
traffic collisions that slow steady-state communication.
- Distributed training is fundamentally synchronization-bound: each
iteration exchanges gradients, parameters, and activations across all
workers, so any single faulty node halts the cluster, and the time spent
identifying the faulty node is pure waste.
- The authors propose C4 (Calibrating Collective Communication
over Converged Ethernet), a communication-driven solution built
on two insights: (1) data-parallel training has homogeneous, periodic
synchronization, so anomalies leave detectable timing syndromes inside
collective communication; (2) collective communication is dominated by
long-lived "elephant" flows whose schedule is predictable, enabling
cluster-scale traffic engineering rather than reactive ECMP.
- Production deployment in a hyperscale Alibaba Cloud cluster lifts
system efficiency from ~30% to ~45%, with a 30% reduction in
error-induced overhead and a 15% reduction in communication cost on real
LLM jobs.
I. Introduction
- LLMs (GPT, Llama and the broader generative-AI family) have
transformed ML and AI applications, and training one (e.g., a
175B-parameter model) routinely consumes roughly two months on a
1,000-GPU cluster.
- Two compounding sources of suboptimal GPU utilization motivate the
work: (1) recovery from system errors can occupy more than 30% of the
job's wall-clock lifespan, and (2) collective-communication cost rises
with scale and erodes parallel speedup.
- High error rates on the latest accelerators are unavoidable. Under
Bulk Synchronous Parallel (BSP) semantics, a single
node fault manifests as a cluster-wide failure, after which extensive
manual diagnosis is needed to find the actual root cause.
- In multi-tenant shared clusters, traffic engineering is
intrinsically harder than on dedicated training pods: bandwidth
competition, asymmetric link utilization, and "downlink" hotspots all
degrade per-flow throughput in ways that conventional ECMP-based load
balancing cannot fix.
- The paper introduces C4, which exploits the
homogeneous running rhythm of distributed workers to perform real-time
fault detection by comparing per-rank timing at synchronization points,
and which uses the predictability of training flows to plan
collision-free paths.
- C4 also performs cluster-level path management to minimize
collisions among elephant flows belonging to different jobs, addressing
the multi-tenant shortfall of intra-job-only schemes.
- Contributions claimed: (i) extension of the Alibaba
Collective Communication Library (ACCL) with status-monitoring
and path-control hooks; (ii) C4D, a real-time
hardware-error detection subsystem with fast recovery; (iii)
C4P, a global traffic-engineering subsystem; (iv)
production-scale results: ~30% error-overhead reduction and ~15%
throughput improvement on real training jobs.
II.
Understanding the Challenges in Operational AI Clusters
A. New Challenges
- Cloud providers now operate dedicated AI clusters whose workload mix
and SLOs differ sharply from traditional cloud serving.
- Synchronization sensitivity: with tens of thousands of GPUs running
combinations of tensor (TP), data (DP), pipeline (PP), and expert (EP)
parallelism, a fault on a single worker stalls every other worker.
- High-end GPUs exhibit elevated error rates relative to commodity
servers, and recovery from a job crash is both costly and slow.
- Customers pay a premium for capacity and demand peak utilization,
since time-to-market for foundation models and capital cost per
accelerator are both extreme.
B. Pain
Points and Their Influences on Training Jobs
- Figure 1 taxonomy of pain points: (1)
computing-node anomalies (CUDA errors, NVLink errors, GPU offline /
downgrade events), (2) network failures (link/port faults), (3) traffic
collisions, (4) storage-side I/O hangs and slowdowns, (5) software
issues such as buggy user code or Python garbage-collection pauses.
- Figure 2 stratifies the impact: critical failures
cause job crashes and full restarts, while runtime slowdowns split into
communication-related and non-communication-related tails. Both reduce
utilization but require different mitigations.
C. The Quantitative
Analyses of Job Crashes
- The authors instrument a 4,096-GPU job over one month and observe 40
crashes, summarized in Table I (crash-cause
distribution).
- The user-visible symptom is almost always a generic "NCCL Error";
because synchronization failures cascade, a single root cause produces
many timeout reports (e.g., NCCL error code 12), making post-mortem
diagnosis expensive.
- 82.5% of crashes are localized to specific nodes/devices, dominated
by manufacturing defects on next-generation GPUs.
- Manual root-cause identification routinely takes hours to days even
for experienced operators.
- Roughly 30% of total job time is spent on detection, diagnosis, and
restart; an automated tool that compresses this from minutes-per-event
to seconds-per-event therefore unlocks significant capacity.
| CUDA Error |
12.5% |
100% |
| ECC / NVLink Error |
27.5% |
100% |
| NCCL Timeout |
20.0% |
75% |
| ACK Timeout |
27.5% |
81.8% |
| Network Error / Other |
12.5% |
40% |
D. The
Quantitative Analyses of Runtime Slowdowns
- The cluster uses RDMA-capable NICs with dual ports,
Fat-Tree topology, and 1:1 oversubscription, but
slowdowns still arise inside the network layer during steady-state
training.
- Traffic collisions among concurrent flows inflate completion time,
and dual-port NICs become a serialization point when load balancing
fails to spread bytes across the two physical ports.
- Figure 3: as scale grows, the gap between achieved
and ideal throughput widens; for a 22B-parameter model at 512 GPUs,
effective performance falls 30% below the ideal projection.
- AI clusters are populated by a small number of long-lived "elephant"
flows, in stark contrast to the many short "mice" flows characteristic
of conventional cloud workloads. This concentration enables global
traffic planning.
- Maximizing GPU utilization therefore requires two complementary
capabilities — fast fault detection plus disciplined traffic engineering
— which together motivate the C4D / C4P split.
III. Design
A. Mitigating Error-induced
Downtime
- Two complementary attack surfaces: (1) reducing failure rates
(thermal control via DVFS and improved cooling), and (2) tolerating
failures when they happen.
- The authors compare classical online fault tolerance (redundant
compute, erasure coding, triple replication) against offline/checkpoint
methods and choose a hybrid strategy.
- Hybrid backup pool: 64 backup GPUs across 8 servers are reserved per
1,024 active GPUs, sized so that failover does not perturb steady-state
performance.
- C4D (C4 Diagnose): lightweight per-node agents
monitor workers and report status to a master that aggregates evidence,
isolates faulty components, and restarts the job from the most recent
checkpoint.
- Figures 4 and 5: BSP synchronization points are
used as anchors — every iteration is a known checkpoint at which all
ranks should arrive within bounded skew, so deviations become syndromes.
The system has three pieces: enhanced ACCL, a C4D master, and
C4a (C4 Agent) on each node.
- Figure 6: the ACCL enhancement instruments three
layers — communicator, operation, and transport — capturing operation
type, algorithm, and microsecond-level timestamps via refined CUDA
kernels. This visibility is what makes per-rank timing comparison
possible.
- C4D distinguishes four anomaly classes: communication
hang, non-communication hang,
communication slow, and non-communication
slow.
- Figure 7 — Communication Slow Detection: per-rank
delays are mapped into a 2D source-vs-destination matrix; large entries
identify the specific link, NIC port, or Rx/Tx direction where
congestion or degradation is concentrated.
- Non-Communication Slow Detection: the
receiver-driven dependency chain inherent to ring-based collectives is
exploited as a probe — by walking the chain backward, the system
pinpoints ranks whose local computation or data-loader is the actual
straggler.
B. Mitigating Communication
Cost
- Collective-communication efficiency is the second axis; small
per-step improvements compound across millions of iterations.
- The objective is to minimize the resources demanded by each
one-to-one exchange and to map the resulting traffic pattern onto
network resources without contention.
- Two structural levers: shrink the network diameter (NVLink
intra-node) and apply topology-aware scheduling so that ranks talking
most frequently land on adjacent network positions.
- Prior work on intra-job traffic engineering [13, 14] avoids
within-job collisions, but multi-tenant clusters require cross-job
coordination because independent jobs cannot see each other's flow
schedules.
- Figure 8 — C4P (C4 Performance): a cluster-scale
traffic-engineering layer dedicated to elephant flows generated by
training jobs.
- The C4P master spans tenants and jobs, records
every allocated connection, and issues path-allocation requests to ACCL.
Its functional responsibilities are (a) avoid faulty links surfaced by
C4D, (b) balance RDMA Queue Pairs across the bonded dual-port NIC, and
(c) adapt allocations dynamically as the network topology changes.
- Working flow: path-probing identifies which source
port lands on which physical path; the master then ensures that traffic
from one NIC is split evenly between its two ports and that the
aggregate spine-layer load is spread across spine switches rather than
concentrated.
IV. Evaluations
A. Setup
- Two evaluation tracks share the same hardware envelope: a subset
testbed of 16 nodes / 128 GPUs across 8 leaf switches for controlled
experiments, and full production cluster traces for large-scale
effectiveness numbers.
- Each node hosts 8x NVIDIA H800 GPUs and 8x
BlueField-3 NICs, with each NIC's two 200 Gbps ports
bonded into a 400 Gbps logical link. Network is a 3-Tier Clos / Fat-Tree
with 1:1 oversubscription.
| C4D evaluation model |
GPT-175B |
| C4P evaluation models |
GPT-22B, Llama-13B, GPT-175B |
| Frameworks |
Megatron-LM, DeepSpeed |
| Per-node compute |
8 x NVIDIA H800 |
| Per-node NIC |
8 x BlueField-3, 200 Gbps x 2 (bonded) |
| Network |
3-Tier Clos, Fat-Tree, 1:1 oversubscription |
B. Results
1) C4D Effectiveness
- Evaluated on a 2,400-GPU production GPT-175B job.
- Total error-induced downtime falls roughly 30x, from 31.19%
in June 2023 to 1.16% in December 2023.
- The June 2023 baseline was dominated by manual diagnosis and
post-checkpoint rework; both shrink dramatically once C4D is
online.
- Diagnosis-and-isolation contribution drops 27x (19.65% to 0.73%);
10-minute-cadence checkpointing collapses post-checkpoint downtime 33x
(7.53% to 0.23%).
- GPU-defect-related downtime (ECC, NVLink, CUDA categories combined)
falls 41.8x year-on-year.
| Total |
31.19% |
1.16% |
27x |
| Post-Checkpoint |
7.53% |
0.23% |
33x |
| Detection |
3.41% |
0.05% |
68x |
| Diagnosis & Isolation |
19.65% |
0.73% |
27x |
| -- ECC / NVLink |
8.34% |
0.20% |
42x |
| -- CUDA |
4.19% |
0.10% |
42x |
| -- CCL Timeout |
3.00% |
0.23% |
13x |
| -- ACK Timeout |
1.80% |
0.10% |
18x |
| Re-initialization |
0.60% |
0.15% |
4x |
2) C4P Effectiveness
- Balancing traffic between bonded ports (Figure 9):
a single AllReduce previously achieved <240 Gbps because ECMP failed
to spread bytes across the two ports of a bonded NIC. With C4P
designating per-flow paths, throughput rises to ~360 Gbps — a roughly
50% gain toward the 400 Gbps line rate.
- Balancing among multiple jobs: 8 simultaneous
AllReduce benchmarks evaluate cross-tenant fairness.
- Figure 10a (1:1 oversubscription): average
throughput improves by 70.3% under C4P.
- Figure 10b (2:1 oversubscription): 65.55%
improvement, showing the technique remains effective when bisection
bandwidth is halved.
- Figure 11: under contention, the network ECN signal
records ~15,000 Congestion Notification Packets (CNPs) per second per
port, quantifying how much head-of-line blocking C4P relieves.
- Tolerance to dynamic link failures (Figure 12):
with one link forced down, baseline ECMP achieves 185.76 Gbps while C4P
maintains 301.46 Gbps — a 62.3% gain, because C4P can re-pin flows away
from the failed link rather than relying on stochastic hashing.
Figure 13 confirms near-optimal switch-port utilization
across the surviving fabric under C4P control.
- Real-life jobs (Figure 14):
- Job 1 (GPT-22B): +15.95% throughput.
- Job 2 (Llama-7B / 13B class): +14.1% throughput.
- Job 3 (GPT-175B): minimal improvement, attributed to Gradient
Accumulation steps GA = 16, which dilutes per-iteration
communication relative to compute and leaves little headroom for network
optimization to recover.
- The headline 15% communication-cost reduction and the lift from ~30%
to ~45% system efficiency reported in the abstract come from combining
the C4D downtime savings with the C4P throughput gains.
- Stability: existing data-center fault-management
infrastructure operates at coarser granularity than is needed for LLM
training; per-iteration synchronization windows are micro-second-scale
and conventional health checks miss them entirely.
- Performance: prior performance work targets
training architectures, intra-server interconnects (NVLink, PCIe), and
adaptive routing / packet spraying inside the fabric. C4 is
complementary, addressing the cross-job coordination gap.
- Limitations: C4D cannot detect errors that occur
during the initialization phase before any collective communication has
been issued — its diagnostic signal is timing-based and requires that
collectives be running. C4P depends on detailed knowledge of the network
topology and is therefore tightly coupled to a specific cluster
build-out; portability requires re-encoding topology metadata.
VI. Conclusion
- C4 is a communication-centric solution for LLM training
clusters.
- It cuts recovery cost via rapid online fault detection and immediate
isolation, leveraging the BSP synchronization rhythm as a free,
always-on probe.
- It uses the predictable elephant-flow structure of training traffic
to perform precise cluster-scale path management, reducing contention
among co-located jobs.
- End-to-end production impact: roughly 30% reduction in error-induced
overhead and roughly 15% throughput improvement.
Tables (verbatim from the
paper)
Table I
— Crash-cause distribution, 4,096-GPU job, 1 month
| CUDA Error |
12.5% |
100% |
| ECC / NVLink Error |
27.5% |
100% |
| NCCL Timeout |
20.0% |
75% |
| ACK Timeout |
27.5% |
81.8% |
| Network Error / Other |
12.5% |
40% |
Table II — Configurations
| C4D model |
GPT-175B |
| C4P models |
GPT-22B, Llama-13B, GPT-175B |
| Frameworks |
Megatron-LM, DeepSpeed |
| GPUs / node |
8 x NVIDIA H800 |
| NICs / node |
8 x BlueField-3 (200 Gbps x 2 bonded) |
| Network |
3-Tier Clos, Fat-Tree, 1:1 oversubscription |
Table III —
Error-induced downtime, Jun vs Dec 2023
| Total Downtime |
31.19% |
1.16% |
| Post-Checkpoint |
7.53% |
0.23% |
| Detection |
3.41% |
0.05% |
| Diagnosis & Isolation |
19.65% |
0.73% |
| ECC / NVLink |
8.34% |
0.20% |
| CUDA |
4.19% |
0.10% |
| CCL Timeout |
3.00% |
0.23% |
| ACK Timeout |
1.80% |
0.10% |
| Re-initialization |
0.60% |
0.15% |
Named Methods,
Subsystems, and Acronyms
- C4 — Calibrating Collective Communication over
Converged Ethernet.
- C4D — C4 Diagnose: hardware-error detection
subsystem.
- C4P — C4 Performance: global traffic-engineering
subsystem.
- C4a — C4 Agent: per-node monitor that streams
timing syndromes to the C4D master.
- ACCL — Alibaba Collective Communication Library,
extended in this work with status-monitoring and path-control hooks at
the communicator, operation, and transport layers.
- BSP — Bulk Synchronous Parallel (the iteration
model whose synchronization barrier is repurposed as a free diagnostic
anchor).
- ECMP — Equal-Cost Multi-Path; the default
load-balancing scheme whose hash collisions C4P replaces with explicit
path allocation.
- CNP — Congestion Notification Packet; the ECN-style
feedback used to quantify network contention (~15,000 / sec / port under
load).
- GA — Gradient Accumulation; setting GA = 16
explains why GPT-175B saw less C4P benefit (compute-bound at the
iteration boundary).
- TP / DP / PP / EP — Tensor / Data / Pipeline /
Expert parallelism.
- DVFS — Dynamic Voltage and Frequency Scaling, used
as a thermal lever to lower failure rates.
- EC — Erasure Coding (rejected baseline for online
fault tolerance).
- QP — RDMA Queue Pair, the unit of path-balancing in
C4P.
Equations
- The paper introduces no numbered mathematical equations. The system
reasons in terms of per-rank timing measurements and the resulting
source-vs-destination delay matrix used by C4D for slow-link
localization.
Note on NCCL Tuning
C4D reuses the BSP synchronization barrier as a free diagnostic
anchor — per-rank timestamps captured at the communicator, operation,
and transport layers expose which collective algorithm and which
ring/tree branch is actually the slow path on any given iteration. The
receiver-driven dependency chain of ring AllReduce in particular is
exploited to walk backward to the straggling rank, which means an
NCCL-tuning loop could piggyback on the same instrumentation (timestamps
+ delay matrix) to detect when a chosen algorithm/protocol pairing is
yielding sub-line-rate throughput. The paper's own observation that
bonded-port AllReduce climbed from <240 Gbps to ~360 Gbps once flow
paths were explicitly assigned illustrates that, on commodity Ethernet,
NCCL configuration alone cannot recover throughput without coordinated
path placement.