Crux: GPU-Efficient Communication Scheduling for Deep Learning Training

Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai | Alibaba Cloud | ACM SIGCOMM '24, August 4-8, 2024, Sydney, NSW, Australia | DOI: 10.1145/3651890.3672239


Problem

In multi-tenant DLT (deep learning training) clusters running large-language- model and legacy workloads, jobs co-execute and contend for shared communication links — both inter-host network paths (where ECMP hashing causes collisions) and intra-host PCIe links. Profiling Alibaba Cloud's in-production 2,000+-GPU / 5,000+-job cluster over two weeks in Aug 2023 showed 36.3% of jobs (occupying 51% GPUs) suffer from communication contention, and a representative GPT+BERT co-location degraded GPU utilization by 9.5% (GPT iter time grew 11.0% from 1.53s to 1.70s; throughput fell 9.9% / 7.7%). Existing job schedulers treat communication as a blackbox; general co-flow schedulers (Coflow, Varys, Sincronia) ignore DLT specifics; intra-job schedulers (TACCL, NCCL, ByteScheduler) ignore inter-job effects; the only inter-job scheduler CASSINI applies static time offsets that cannot adapt to dynamic traffic. There is no scheduler that maximizes overall GPU utilization under inter-job contention.


Core Insight

Maximizing cluster GPU utilization is NP-Complete (reducible to maximum multi-commodity flow), but the authors prove (Theorem 1) that as time grows unbounded, GPU utilization is asymptotically equal to the cumulative GPU intensity I_j = W_j / t_j (compute-flops-per-iteration over slowest-link transmission time) carried over the bottleneck link — so prioritizing high- intensity flows is provably equivalent to maximizing utilization.


Method

Crux is a communication scheduler with three coordinated mechanisms applied on top of any job scheduler:

  1. GPU intensity-based path selection (§4.1). Iterate jobs from highest to lowest GPU intensity; for each, pick the least-congested ECMP path among candidates so high-intensity jobs do not collide.
  2. Priority assignment with DLT correction factor (§4.2). P_j = k_j · I_j (Equation 3). The reference job (most network traffic) gets k=1; other jobs' k_j are computed by comparing iteration length and computation- communication overlap so that prioritizing higher-P jobs strictly raises utilization (e.g., shorter-iteration job gets bumped up — Example 1 shows 41.7% vs 37.5% utilization).
  3. Priority compression to limited levels (§4.3). NICs and switches support only ~8 priority levels. Build a Communication Contention DAG D = <V^D, E^D> with nodes = jobs, directed edges = "shares a link with higher-priority neighbor", edge weight w_e = I_{j_higher}. Compressing N priorities to K levels = Max K-Cut on DAG D (NP-hard). Algorithm 1 approximates by sampling m=10 random topological orders via BFS; for each order, an O(n^2) DP solves Max K-Cut over the sequence (Equation 4 with Quadrangle-Inequality optimization), and the global maximum across orders is returned. Theorems 2-3 prove the surjection between sequence K-Cuts and DAG K-Cuts.

The system (Crux Daemon + Crux Transport, 7K LoC, integrated with PyTorch / TensorFlow / X-DeepLearning via the CoCoLib library) probes paths via INT and varied UDP-source-port packets, measures W_j / t_j by hardware counters under unique highest-priority profiling, estimates iteration period via Fourier Transform, and applies decisions via ibv_modify_qp for RoCEv2 traffic and PCIe-link semaphores for intra-host. Re-scheduling on job arrival/completion takes <1 minute and uses <0.01% network bandwidth.


Experimental Setup

Component Value
Real-world testbed hosts 12
GPUs per host 8 × NVIDIA A100
Total GPUs 96
RDMA NICs per host 4 × 200 Gbps
Inter-host topology 6 ToR × 8×100 Gbps + 8 Aggregation × 2×100 Gbps
Intra-host NVLink + PCIe 4.0 x16 + 100 Gbps NIC
Workloads ResNet, BERT, GPT (small/medium/large)
Production-trace sim cluster 2,000+ GPUs, 5,000+ jobs (Alibaba Lingjun, Aug 2023)
Sim topologies 2-layer Clos (173 ToR + 16 Agg); 3-layer double-sided (6 ToR + 12 Agg + 32 core)
Sim communication model alpha-beta with link delay + data-size/bandwidth
Sim priority levels 8
Sim model count 11 (BERT, GPT, ResNet, NMT, Multi-Interests + 2 in-house: CTR, transformer-NLP)
Baselines Sincronia (general co-flow), TACCL* (intra-job, adapted), CASSINI (inter-job), HiveD + Muri (job schedulers)
Crux variants Crux-PA, Crux-PS-PA, Crux-full
Public dataset github.com/alibaba/alibaba-lingjun-dataset-2023

Headline Quantitative Results

96-GPU testbed (real-world):

2,000-GPU trace simulation:

Stacked with job schedulers (Figure 25):

Microbenchmark vs optimal (1,500 small cases, ≤20 hosts each):

Operating costs:


Limitations


Open Problems

Note on NCCL Tuning

Crux explicitly cites NCCL as one of the intra-job communication schedulers that is unaware of inter-job contention, and Section 7.1 notes that "specific collective communication algorithms" (which in NCCL means algorithm/protocol/chunk choices) are an unmodeled factor that perturbs t_j. The implication for any per-collective NCCL tuner is concrete: its choices are observed by Crux only as t_j on the bottleneck link, so tuning that shrinks t_j directly raises a job's GPU-intensity rank and gains scheduling priority — but tuning that varies t_j across iterations destabilizes Crux's correction factor and hurts the global schedule. NCCL configuration tuning should therefore optimize both for low and stable t_j when running under a Crux-style inter-job scheduler.