Crux: GPU-Efficient Communication Scheduling for Deep Learning Training
Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai | Alibaba Cloud | ACM SIGCOMM '24, August 4-8, 2024, Sydney, NSW, Australia | DOI: 10.1145/3651890.3672239
Problem
In multi-tenant DLT (deep learning training) clusters running large-language- model and legacy workloads, jobs co-execute and contend for shared communication links — both inter-host network paths (where ECMP hashing causes collisions) and intra-host PCIe links. Profiling Alibaba Cloud's in-production 2,000+-GPU / 5,000+-job cluster over two weeks in Aug 2023 showed 36.3% of jobs (occupying 51% GPUs) suffer from communication contention, and a representative GPT+BERT co-location degraded GPU utilization by 9.5% (GPT iter time grew 11.0% from 1.53s to 1.70s; throughput fell 9.9% / 7.7%). Existing job schedulers treat communication as a blackbox; general co-flow schedulers (Coflow, Varys, Sincronia) ignore DLT specifics; intra-job schedulers (TACCL, NCCL, ByteScheduler) ignore inter-job effects; the only inter-job scheduler CASSINI applies static time offsets that cannot adapt to dynamic traffic. There is no scheduler that maximizes overall GPU utilization under inter-job contention.
Core Insight
Maximizing cluster GPU utilization is NP-Complete (reducible to maximum multi-commodity flow), but the authors prove (Theorem 1) that as time grows unbounded, GPU utilization is asymptotically equal to the cumulative GPU intensity I_j = W_j / t_j (compute-flops-per-iteration over slowest-link transmission time) carried over the bottleneck link — so prioritizing high- intensity flows is provably equivalent to maximizing utilization.
Method
Crux is a communication scheduler with three coordinated mechanisms applied on top of any job scheduler:
- GPU intensity-based path selection (§4.1). Iterate jobs from highest to lowest GPU intensity; for each, pick the least-congested ECMP path among candidates so high-intensity jobs do not collide.
- Priority assignment with DLT correction factor (§4.2). P_j = k_j · I_j (Equation 3). The reference job (most network traffic) gets k=1; other jobs' k_j are computed by comparing iteration length and computation- communication overlap so that prioritizing higher-P jobs strictly raises utilization (e.g., shorter-iteration job gets bumped up — Example 1 shows 41.7% vs 37.5% utilization).
- Priority compression to limited levels (§4.3). NICs and switches support only ~8 priority levels. Build a Communication Contention DAG D = <V^D, E^D> with nodes = jobs, directed edges = "shares a link with higher-priority neighbor", edge weight w_e = I_{j_higher}. Compressing N priorities to K levels = Max K-Cut on DAG D (NP-hard). Algorithm 1 approximates by sampling m=10 random topological orders via BFS; for each order, an O(n^2) DP solves Max K-Cut over the sequence (Equation 4 with Quadrangle-Inequality optimization), and the global maximum across orders is returned. Theorems 2-3 prove the surjection between sequence K-Cuts and DAG K-Cuts.
The system (Crux Daemon + Crux Transport, 7K LoC, integrated with
PyTorch / TensorFlow / X-DeepLearning via the CoCoLib library) probes
paths via INT and varied UDP-source-port packets, measures W_j / t_j by
hardware counters under unique highest-priority profiling, estimates
iteration period via Fourier Transform, and applies decisions via
ibv_modify_qp for RoCEv2 traffic and PCIe-link semaphores
for intra-host. Re-scheduling on job arrival/completion takes <1
minute and uses <0.01% network bandwidth.
Experimental Setup
| Component | Value |
|---|---|
| Real-world testbed hosts | 12 |
| GPUs per host | 8 × NVIDIA A100 |
| Total GPUs | 96 |
| RDMA NICs per host | 4 × 200 Gbps |
| Inter-host topology | 6 ToR × 8×100 Gbps + 8 Aggregation × 2×100 Gbps |
| Intra-host | NVLink + PCIe 4.0 x16 + 100 Gbps NIC |
| Workloads | ResNet, BERT, GPT (small/medium/large) |
| Production-trace sim cluster | 2,000+ GPUs, 5,000+ jobs (Alibaba Lingjun, Aug 2023) |
| Sim topologies | 2-layer Clos (173 ToR + 16 Agg); 3-layer double-sided (6 ToR + 12 Agg + 32 core) |
| Sim communication model | alpha-beta with link delay + data-size/bandwidth |
| Sim priority levels | 8 |
| Sim model count | 11 (BERT, GPT, ResNet, NMT, Multi-Interests + 2 in-house: CTR, transformer-NLP) |
| Baselines | Sincronia (general co-flow), TACCL* (intra-job, adapted), CASSINI (inter-job), HiveD + Muri (job schedulers) |
| Crux variants | Crux-PA, Crux-PS-PA, Crux-full |
| Public dataset | github.com/alibaba/alibaba-lingjun-dataset-2023 |
Headline Quantitative Results
96-GPU testbed (real-world):
- Network-path contention (32-GPU GPT + multiple 8-GPU BERT): GPU utilization +8.3% to +12.9%; GPT JCT -11% to -25%; BERT JCT 0% to +3%.
- 4-job mix (48-GPU GPT + 2×8GPU ResNet + 2×16GPU BERT): GPU utilization +13.9%; GPT JCT -18%; BERT JCT -15%; ResNet JCT +2%.
- PCIe contention (BERT vs ResNet variants): GPU utilization +9.5% to +14.8%; BERT JCT -7% to -33%; ResNet JCT +1% to +3%.
2,000-GPU trace simulation:
- Two-layer Clos GPU utilization: Sincronia 0.39, TACCL* 0.49, CASSINI 0.48, Crux-PA 0.51, Crux-PS-PA 0.62, Crux-full 0.62 (+13% to +23% over baselines).
- Double-sided GPU utilization: Sincronia 0.43, TACCL* 0.46, CASSINI 0.45, Crux-PA 0.48, Crux-PS-PA 0.51, Crux-full 0.50 (+4% to +7% over baselines).
- Crux-PS-PA over Crux-PA: +97% network utilization from path selection.
- Crux-full vs Crux-PS-PA: nearly identical — compressing 5,000 jobs to 8 priority levels has near-zero side-effect.
Stacked with job schedulers (Figure 25):
- None=0.39, None+Crux=0.62, Muri=0.59, Muri+Crux=0.73 (+14%), HiveD=0.64, HiveD+Crux=0.75 (+11%).
Microbenchmark vs optimal (1,500 small cases, ≤20 hosts each):
- Path selection: 97.69%; priority assignment: 97.24%; priority compression: 97.12%.
Operating costs:
- 7K LoC; <0.01% bandwidth overhead; <1 min reschedule per job arrival/completion. A 10% utilization gain on a 10,000-GPU cluster (8-GPU servers @ 40/hr)saves * *120,000/day**.
Limitations
- The priority-assignment correction factor k_j assumes continuous compute/ comm with a simple overlap pattern (Figure 12); real DLT jobs interleave many kernels with partial overlap. Authors argue overlap ratio dominates.
- Correction factor uses a single reference job (most network traffic). Enumerating all contending pairs and weighted-averaging would be more optimal but is exponential and requires fine-grained monitoring — impractical for deployment.
- Storage traffic, intra-job collective algorithm choice, and timing of each job's communication can perturb t_j; mitigated in modern compute/storage- separated clusters but not eliminated.
- Crux is topology-independent by design — gains less than topology-aware joint optimization could.
- Lowest-priority jobs see throughput drop up to 55.5% (no full starvation since DLT traffic is bursty); fairness is acknowledged as a trade-off but not explicitly bounded.
- Evaluation uses Clos and double-sided multi-tier topologies; Torus and other architectures are noted as adaptable but not measured.
Open Problems
- Better correction factors that enumerate or weight multiple contending reference jobs without exponential cost.
- Joint topology-aware optimization (path selection + priority + topology layout) for tighter optimality bounds.
- Fairness-aware extensions (Pareto-frontier between GPU intensity and recent throughput decrease).
- Modeling fine-grained kernel-level compute/communication interleaving in the GPU-intensity correction factor.
- Adapting Crux's GPU-intensity formulation to non-Clos topologies (e.g., Torus) and validating empirically.
Note on NCCL Tuning
Crux explicitly cites NCCL as one of the intra-job communication schedulers that is unaware of inter-job contention, and Section 7.1 notes that "specific collective communication algorithms" (which in NCCL means algorithm/protocol/chunk choices) are an unmodeled factor that perturbs t_j. The implication for any per-collective NCCL tuner is concrete: its choices are observed by Crux only as t_j on the bottleneck link, so tuning that shrinks t_j directly raises a job's GPU-intensity rank and gains scheduling priority — but tuning that varies t_j across iterations destabilizes Crux's correction factor and hurts the global schedule. NCCL configuration tuning should therefore optimize both for low and stable t_j when running under a Crux-style inter-job scheduler.