Crux: GPU-Efficient Communication Scheduling for Deep Learning Training

Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai | Alibaba Cloud | ACM SIGCOMM '24, August 4-8, 2024, Sydney, NSW, Australia | DOI: 10.1145/3651890.3672239

Problem

In multi-tenant DLT (deep learning training) clusters running large-language- model and legacy workloads, jobs co-execute and contend for shared communication links — both inter-host network paths (where ECMP hashing causes collisions) and intra-host PCIe links. Profiling Alibaba Cloud's in-production 2,000+-GPU / 5,000+-job cluster over two weeks in Aug 2023 showed 36.3% of jobs (occupying 51% GPUs) suffer from communication contention, and a representative GPT+BERT co-location degraded GPU utilization by 9.5% (GPT iter time grew 11.0% from 1.53s to 1.70s; throughput fell 9.9% / 7.7%). Existing job schedulers treat communication as a blackbox; general co-flow schedulers (Coflow, Varys, Sincronia) ignore DLT specifics; intra-job schedulers (TACCL, NCCL, ByteScheduler) ignore inter-job effects; the only inter-job scheduler CASSINI applies static time offsets that cannot adapt to dynamic traffic. There is no scheduler that maximizes overall GPU utilization under inter-job contention.

Core Insight

Maximizing cluster GPU utilization is NP-Complete (reducible to maximum multi-commodity flow), but the authors prove (Theorem 1) that as time grows unbounded, GPU utilization is asymptotically equal to the cumulative GPU intensity I_j = W_j / t_j (compute-flops-per-iteration over slowest-link transmission time) carried over the bottleneck link — so prioritizing high- intensity flows is provably equivalent to maximizing utilization.

Method

Crux is a communication scheduler with three coordinated mechanisms applied on top of any job scheduler:

GPU intensity-based path selection (§4.1). Iterate jobs from highest to lowest GPU intensity; for each, pick the least-congested ECMP path among candidates so high-intensity jobs do not collide.
Priority assignment with DLT correction factor (§4.2). P_j = k_j · I_j (Equation 3). The reference job (most network traffic) gets k=1; other jobs' k_j are computed by comparing iteration length and computation- communication overlap so that prioritizing higher-P jobs strictly raises utilization (e.g., shorter-iteration job gets bumped up — Example 1 shows 41.7% vs 37.5% utilization).
Priority compression to limited levels (§4.3). NICs and switches support only ~8 priority levels. Build a Communication Contention DAG D = <V^D, E^D> with nodes = jobs, directed edges = "shares a link with higher-priority neighbor", edge weight w_e = I_{j_higher}. Compressing N priorities to K levels = Max K-Cut on DAG D (NP-hard). Algorithm 1 approximates by sampling m=10 random topological orders via BFS; for each order, an O(n^2) DP solves Max K-Cut over the sequence (Equation 4 with Quadrangle-Inequality optimization), and the global maximum across orders is returned. Theorems 2-3 prove the surjection between sequence K-Cuts and DAG K-Cuts.

The system (Crux Daemon + Crux Transport, 7K LoC, integrated with PyTorch / TensorFlow / X-DeepLearning via the CoCoLib library) probes paths via INT and varied UDP-source-port packets, measures W_j / t_j by hardware counters under unique highest-priority profiling, estimates iteration period via Fourier Transform, and applies decisions via ibv_modify_qp for RoCEv2 traffic and PCIe-link semaphores for intra-host. Re-scheduling on job arrival/completion takes <1 minute and uses <0.01% network bandwidth.

Experimental Setup

Component	Value
Real-world testbed hosts	12
GPUs per host	8 × NVIDIA A100
Total GPUs	96
RDMA NICs per host	4 × 200 Gbps
Inter-host topology	6 ToR × 8×100 Gbps + 8 Aggregation × 2×100 Gbps
Intra-host	NVLink + PCIe 4.0 x16 + 100 Gbps NIC
Workloads	ResNet, BERT, GPT (small/medium/large)
Production-trace sim cluster	2,000+ GPUs, 5,000+ jobs (Alibaba Lingjun, Aug 2023)
Sim topologies	2-layer Clos (173 ToR + 16 Agg); 3-layer double-sided (6 ToR + 12 Agg + 32 core)
Sim communication model	alpha-beta with link delay + data-size/bandwidth
Sim priority levels	8
Sim model count	11 (BERT, GPT, ResNet, NMT, Multi-Interests + 2 in-house: CTR, transformer-NLP)
Baselines	Sincronia (general co-flow), TACCL* (intra-job, adapted), CASSINI (inter-job), HiveD + Muri (job schedulers)
Crux variants	Crux-PA, Crux-PS-PA, Crux-full
Public dataset	github.com/alibaba/alibaba-lingjun-dataset-2023

Headline Quantitative Results

96-GPU testbed (real-world):

Network-path contention (32-GPU GPT + multiple 8-GPU BERT): GPU utilization +8.3% to +12.9%; GPT JCT -11% to -25%; BERT JCT 0% to +3%.
4-job mix (48-GPU GPT + 2×8GPU ResNet + 2×16GPU BERT): GPU utilization +13.9%; GPT JCT -18%; BERT JCT -15%; ResNet JCT +2%.
PCIe contention (BERT vs ResNet variants): GPU utilization +9.5% to +14.8%; BERT JCT -7% to -33%; ResNet JCT +1% to +3%.

2,000-GPU trace simulation:

Two-layer Clos GPU utilization: Sincronia 0.39, TACCL* 0.49, CASSINI 0.48, Crux-PA 0.51, Crux-PS-PA 0.62, Crux-full 0.62 (+13% to +23% over baselines).
Double-sided GPU utilization: Sincronia 0.43, TACCL* 0.46, CASSINI 0.45, Crux-PA 0.48, Crux-PS-PA 0.51, Crux-full 0.50 (+4% to +7% over baselines).
Crux-PS-PA over Crux-PA: +97% network utilization from path selection.
Crux-full vs Crux-PS-PA: nearly identical — compressing 5,000 jobs to 8 priority levels has near-zero side-effect.

Stacked with job schedulers (Figure 25):

None=0.39, None+Crux=0.62, Muri=0.59, Muri+Crux=0.73 (+14%), HiveD=0.64, HiveD+Crux=0.75 (+11%).

Microbenchmark vs optimal (1,500 small cases, ≤20 hosts each):

Path selection: 97.69%; priority assignment: 97.24%; priority compression: 97.12%.

Operating costs:

7K LoC; <0.01% bandwidth overhead; <1 min reschedule per job arrival/completion. A 10% utilization gain on a 10,000-GPU cluster (8-GPU servers @ 40/hr)saves * *120,000/day**.

Limitations

The priority-assignment correction factor k_j assumes continuous compute/ comm with a simple overlap pattern (Figure 12); real DLT jobs interleave many kernels with partial overlap. Authors argue overlap ratio dominates.
Correction factor uses a single reference job (most network traffic). Enumerating all contending pairs and weighted-averaging would be more optimal but is exponential and requires fine-grained monitoring — impractical for deployment.
Storage traffic, intra-job collective algorithm choice, and timing of each job's communication can perturb t_j; mitigated in modern compute/storage- separated clusters but not eliminated.
Crux is topology-independent by design — gains less than topology-aware joint optimization could.
Lowest-priority jobs see throughput drop up to 55.5% (no full starvation since DLT traffic is bursty); fairness is acknowledged as a trade-off but not explicitly bounded.
Evaluation uses Clos and double-sided multi-tier topologies; Torus and other architectures are noted as adaptable but not measured.

Open Problems

Better correction factors that enumerate or weight multiple contending reference jobs without exponential cost.
Joint topology-aware optimization (path selection + priority + topology layout) for tighter optimality bounds.
Fairness-aware extensions (Pareto-frontier between GPU intensity and recent throughput decrease).
Modeling fine-grained kernel-level compute/communication interleaving in the GPU-intensity correction factor.
Adapting Crux's GPU-intensity formulation to non-Clos topologies (e.g., Torus) and validating empirically.

Note on NCCL Tuning

Crux explicitly cites NCCL as one of the intra-job communication schedulers that is unaware of inter-job contention, and Section 7.1 notes that "specific collective communication algorithms" (which in NCCL means algorithm/protocol/chunk choices) are an unmodeled factor that perturbs t_j. The implication for any per-collective NCCL tuner is concrete: its choices are observed by Crux only as t_j on the bottleneck link, so tuning that shrinks t_j directly raises a job's GPU-intensity rank and gains scheduling priority — but tuning that varies t_j across iterations destabilizes Crux's correction factor and hurts the global schedule. NCCL configuration tuning should therefore optimize both for low and stable t_j when running under a Crux-style inter-job scheduler.