Crux: GPU-Efficient Communication Scheduling for Deep Learning Training — Detailed Summary

Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai | Alibaba Cloud | ACM SIGCOMM '24, August 4-8, 2024, Sydney, NSW, Australia | DOI: 10.1145/3651890.3672239

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.

Abstract

Deep learning training (DLT), e.g., large language model (LLM) training, has become one of the most important services in multi-tenant cloud computing.
A study of in-production DLT jobs reveals that communication contention among different DLT jobs seriously influences overall GPU computation utilization and degrades training cluster efficiency.
The paper introduces Crux, a communication scheduler that aims to maximize GPU computation utilization by mitigating inter-job communication contention.
Maximizing GPU utilization for DLT is NP-Complete; the authors formulate and prove a novel theorem that lets them reduce the problem to GPU-intensity-aware communication scheduling.
Crux prioritizes flows from DLT jobs with high GPU computation intensity to reduce potential contention.
A 96-GPU testbed shows Crux improves GPU computation utilization by 8.3% to 14.8%.
Large-scale production trace-based simulation shows Crux increases GPU computation utilization by up to 23% versus alternatives including Sincronia, TACCL, and CASSINI.

1. Introduction

DL has empowered numerous NLP businesses such as Microsoft 365 Copilot, Firefly, and GitHub Copilot; DLT jobs run on large-scale clusters with tens of thousands of GPUs at AWS, Azure, GCP, and Alibaba Cloud.
DLT jobs co-execute on shared GPU clusters; GPU schedulers allocate GPUs to jobs (illustrated in Figure 1).
As a production DLT service provider, the authors' goal is to maximize GPU utilization, which directly affects training throughput and cluster profit (training cost and customer expense).
Co-executing jobs in the same GPU cluster increases per-job training time versus running alone, with decreased GPU utilization. Example: GPU utilization drops by 9.5% when co-executing a GPT job and a BERT job, which on a 1,000-GPU cluster wastes 95 GPUs (millions of dollars).
Profiling the entire DLT lifecycle (Section 2.2) shows the performance interference between DLT jobs mainly stems from communication contention — when synchronizing parameters/gradients/optimizers among GPUs in the same host or different hosts.

State of the art (Figure 2 taxonomy):

Job scheduler category (Mari, Gandiva, Themis, AntMan, Shockwave, hive): most schedulers treat communication (inter-host network and intra-host PCIe/NVLink) as a blackbox and focus on computation contention. A few consider communication contention but cannot completely avoid it given unpredictable job arrivals/sizes. Crux focuses on solving communication contention assuming an existing job schedule, hence is orthogonal to job schedulers.
Communication scheduler category — addresses the co-flow problem of determining flow priorities and selecting paths for multiple flows.
- General co-flow schedulers (Coflow, Varys, Sincronia) do not account for DLT-specific characteristics (iterative, bursty data traffic, computation-communication overlap).
- Most current DLT communication schedulers (TACCL, NCCL, SYNDICATE, ByteScheduler, MXDAG, Echelon, CadentFlow, Lina) solve the co-flow problem within a single DLT job; they fail to consider inter-job contention.
- CASSINI is an inter-job communication scheduler that proactively reduces contention by predicting each job's traffic pattern and applying a time-dimension offset; however, traffic patterns are affected by the job itself and other concurrent jobs, so simply shifting jobs based on pattern predictions cannot eliminate contention.

Crux contributions:

Contribution 1: Analysis of the multi-tenant production training cluster reveals that 36.3% of DLT jobs may experience communication contention, causing significant GPU wastage. The dataset is publicly released at https://github.com/alibaba/alibaba-lingjun-dataset-2023. The authors argue inter-job communication scheduling is necessary to improve GPU utilization.
Contribution 2: Transform the NP-Complete problem of maximizing GPU utilization into a GPU-intensity-aware communication scheduling problem. Crux contains: (1) a path-selection algorithm that mitigates contention by choosing the least-congested path for jobs with higher GPU intensity; (2) a priority-assignment algorithm that considers DLT characteristics such as multiple iterations and computation-communication overlap; (3) an efficient priority-compression algorithm that adapts to limited priority levels on practical NICs and switches.
Contribution 3: On a 96-GPU testbed (12 hosts × 8 NVIDIA A100), Crux delivers up to 14.8% GPU utilization improvement and 33% end-to-end improvement with real-world models (GPT, BERT, ResNet) (Section 6.2). Production trace simulation (2,000+ GPUs, 5,000+ jobs) shows Crux improves GPU utilization by 5% to 23% versus state-of-the-art schedulers (Sincronia, CASSINI, TACCL) under various network architectures (Section 6.3).

2. Background and Motivation

2.1 Background: Multi-Tenant DLT Clusters

A DLT cluster (Figure 1) contains a storage layer, hosts equipped with many GPUs/CPUs, and a network (typically a multi-layer Clos) connecting them. Each host holds multiple (e.g., eight) GPUs and NICs; GPUs within hosts are connected by proprietary high-bandwidth links such as NVLink and PCIe fabrics with PCIe switches (PCIeSw).
DLT job traffic typically traverses NVLinks, PCIe, and network links.
DLT jobs run for many iterations to compute a pre-trained model. To fit larger models, parallelism strategies (data parallelism, pipeline parallelism, tensor parallelism) distribute compute across multiple GPUs.
In each iteration, GPUs synchronize parameters/gradients/optimizers via collective operations (AllReduce, Send/Recv, ReduceScatter, AllGather, AllToAll).

2.2 Inter-Job Communication Contention Seriously Degrades GPU Utilization

The authors studied DLT jobs over two weeks in August 2023 on one of their in-production DLT clusters with 2,000+ GPUs in three-layer Clos, serving 5,000+ jobs. Models include LLMs and legacy models such as ResNet and BERT.
Figure 4 shows over 10% of jobs (GPT variants) occupy a minimum of 128 GPUs; the largest job uses up to 512 GPUs.
Figure 5: peak concurrent jobs exceeds 30 (1,000+ GPUs) over the two-week period; the cluster's job scheduler intuitively allocates GPUs in the same host or under the same switch to a job.

Prevalence of inter-job communication contention:

Figure 6: 36.3% of jobs (occupying 51% GPUs) may suffer from communication contention during their lifecycle.
Only a minority of contention occurs on intra-host PCIe links (Figure 3(b)); most occurs on network forwarding paths (Figure 3(a)).
This is because network switches typically use ECMP hash-forwarding by default, which inevitably leads to hash collisions and network link contention.

Why communication contention generally exists:

Better scheduling and increased capacity ease but cannot eliminate contention. Dynamic job arrivals and varying sizes cause resource fragmentation; jobs sharing pods compete for inter-pod links; even with 1:1 oversubscription, hash collisions over uneven traffic produce contention.

Impact of inter-job communication contention on GPU utilization:

The authors sampled a representative pair of co-executing jobs in contention and reproduced their training:
- GPT variant at 64 GPUs distributed across 8 hosts (H1-H8) under two ToR switches (4 hosts each, all 8 GPUs/host).
- BERT at 16 GPUs across 4 hosts (H9-H12) under the same two ToR switches (4 GPUs/host).
They contend on links between ToR switches and aggregation switches.
Result (Figure 7): launching BERT alongside GPT increases GPT's iteration time by 11.0% (1.53s → 1.70s); throughput of GPT and BERT drops 9.9% and 7.7% respectively; GPU utilization drops 9.5%.
Footnote 1: GPT-3 architecture parameters from [13], reduced to 24 transformer layers (from 96) and hidden size 1024 (from 12288) for cost.

2.3 Goal: Optimizing GPU Utilization

For a global DLT training provider, the cluster goal is to optimize overall GPU utilization — more computation per unit time means more profit. A 10% improvement on a 10,000-GPU cluster, assuming an 8-GPU server at 40/hour, saves * *120,000 per day**.
Prior work optimizes job completion time (JCT) which is equivalent to optimizing utilization in single-job cases. In multi-DLT scenarios, naively optimizing JCT can reduce GPU utilization.
Figure 8 example: when two jobs contend for the same link, two scheduling methods can yield the same average JCT but very different overall GPU utilization. Jobs with greater GPU workload have greater impact on overall GPU utilization and should be scheduled with higher priorities. This may introduce unfairness, but the side effect is mild (Section 7).

2.4 Optimizing GPU Utilization via Scheduling Inter-Job Communication Contention

Communication contention significantly degrades both training performance and GPU utilization. The authors argue an efficient inter-job communication scheduler is crucial. Crux schedules from two aspects:
Path selection: select paths for each job based on traffic patterns and topologies. Communication within hosts typically uses the nearest NIC or NVLink directly, so path selection there is unnecessary.
Priority assignment: when contention is inevitable, prioritize jobs with larger impact on GPU utilization.

3. Methodology and System Overview

3.1 Problem and Goal

A GPU cluster is a graph G = <V, E> where V is the set of GPUs and E is the set of links (network and intra-host); each edge e has bandwidth B_e.
J denotes the set of iterative DLT jobs. Each job j ∈ J occupies some GPUs. In each iteration, j generates compute workload W_j (in flops) distributed across its GPUs and a certain amount of network traffic. M_{j,e} is the traffic produced by j on link e per iteration.
Computation and communication may overlap, especially for large-scale jobs.

Definition 1 (GPU Utilization). In a given time period T, if GPU v processes computation workload L_v, the overall GPU utilization U_T of the cluster is:

U_T = sum_{v in V} L_v        (Equation 1)

The goal is to maximize U_T by carefully scheduling each job's data flow (path selection and priority assignment) to avoid contention.

3.2 Deriving GPU Utilization Optimization to Flow Scheduling: A Single Link Case

Maximizing U_T is a more complex version of the maximum multi-commodity flow problem (NP-Complete). Footnote 2: in max multi-commodity flow each link has a bandwidth constraint and one path must be chosen per pair of source/destination to transfer arbitrary data. Crux's problem reduces strictly to it (each job has only one source-destination pair on the bottleneck link with the same compute-comm ratio), so Crux's problem is not easier and remains NP-Complete. Each job is also iterative with computation behavior (not just transmission), adding difficulty.
The authors introduce a new concept, GPU intensity, to reflect a job's impact on GPU utilization.

Definition 2 (GPU intensity). For job j:

I_j = W_j / t_j                (Equation 2)

where t_j is the maximum time a job's traffic takes to traverse any link e in E, calculated as t_j = max_{e in E} (M_{j,e} / B_e).

Insight box: regardless of how communication and computation are overlapped, job j is always able to do W_j computation after t_j communication. Prioritizing higher-I_j traffic unblocks more computation, improving GPU utilization.
Single-link analysis: On a link e_0 with bandwidth B_{e_0} (other links set to infinite bandwidth), define h(t) as the job whose traffic occupies e_0 at time t, with h(t) = ∅ for idle. Define f(t) = I_{h(t)} if h(t) ≠ ∅ else 0; F_T = ∫_T f(t) dt. F_T is the total GPU intensity transmitted by e_0 over T.

Theorem 1. lim_{|T| → ∞} F_T / U_T = 1.

Proven in Appendix A. The intuition: in long durations, maximizing U_T is equivalent to maximizing F_T (the sum of GPU intensity transmitted by the link). Therefore, jobs with higher GPU intensity matter more in communication scheduling.

3.3 Extending to Networks: Crux Overview

Extending Theorem 1 to complex topologies:

When multiple DLT jobs run concurrently, the communication throughput of each contending job is determined by the bandwidth of the most congested bottleneck link. If we treat all other links as infinite, the topology reduces to the single-link case in §3.2, so Theorem 1's conclusion still holds.

Utilizing Theorem 1 in a real DLT cluster:

Ideal scheduling keeps the network transmitting data of the most GPU-intensive job at 100% time. But each DLT job only sends data on part of network paths and part of the time; ECMP supports limited path selection by 5-tuple; DSCP supports only a few priority levels.

Crux's three techniques (Figure 10):

GPU intensity-based path selection (§4.1) — Crux selects different paths for GPU-intensive jobs to avoid contention; more tolerant of contention among low-intensity jobs.
Priority assignment with DLT characteristics (§4.2) — directly assigning priorities by GPU intensity may produce uneven network overload over time (sometimes multiple jobs contend, sometimes the network is idle). Crux analyzes how DLT iteration/overlap behavior changes overload patterns and presents a novel mathematical model to fine-tune priorities.
Compressing priorities to limited levels (§4.3) — NICs/switches typically support no more than 8 levels; Crux compresses globally unique priorities to limited levels with minimal performance loss by selectively compressing unimportant differences.

4. Crux Design

4.1 GPU Intensity-Based Path Selection

Modern DLT clusters use ECMP hashing on redundant links between switches; hash collisions become unavoidable as concurrent jobs grow.
Why GPU intensity matters in path selection: high-intensity jobs are more dominant for cluster utilization. The path selection idea is to ensure high-intensity jobs are not affected by communication contention. When high- and low-intensity jobs compete for the same link, high-intensity jobs receive higher priority (assigned in §4.2); the main challenge is to avoid contention between high-intensity jobs.
Crux's approach: GPU-intensity-based path selection. For multiple DLT jobs, Crux makes path selection starting from the most GPU-intensive job to the least. For each job, Crux selects the least congested path among all available options at that moment. This selects distinct paths for GPU-intensive jobs from each other. Crux refreshes path selection on dynamic job arrivals/completions (§5).

4.2 Priority Assignment

Goal: assign a unique priority P_j to each job j such that: (1) If P_{j_1} > P_{j_2}, prioritizing j_1 should yield higher overall GPU utilization. (2) If P_{j_1} = P_{j_2}, prioritizing either should yield the same GPU utilization.

Why GPU intensity should be fine-tuned:

DLT jobs differ in iteration time and computation-communication overlap behavior; assigning priorities purely by GPU intensity may yield uneven network overload (similar to CASSINI's observation).

Two motivating examples:

Example 1 (Figure 11) — iteration time: Job 1 (W_1 = 10 Gflops, t_1 = 2s) vs Job 2 (W_2 = 5 Gflops, t_2 = 1s) compete for one link, both 10 GPUs. Both have equal GPU intensity by Equation 2. If Job 1 is prioritized: GPUs of Job 1 are idle 6 s, GPUs of Job 2 idle 9 s, overall utilization 37.5%. If Job 2 is prioritized: utilization reaches 41.7%, because prioritizing the shorter-iteration job better utilizes bandwidth to transmit more data, raising GPU utilization.
Example 2 (Figure 12) — computation-communication overlap: Job 1 (W_1 = 10 Gflops, t_1 = 1s, 2 GPUs, 4s compute) vs Job 2 (W_2 = 30 Gflops, t_2 = 3s, 12 GPUs, 2s compute) compete for one link. They have equal GPU intensity. Assuming a job starts communication after finishing 50% computation of one iteration: prioritizing Job 1 leaves 12 GPUs of Job 2 idle for 7 s; prioritizing Job 2 leaves 2 GPUs of Job 1 idle 6 s. Communication delay of ≤1 s for Job 1 doesn't hurt because its communication can be fully overlapped by computation; Job 2 is more sensitive to communication delay since its computation cannot fully overlap.

Modeling DLT characteristics in priority assignment:

Define a correction factor k_j for each job j, bridging GPU intensity and priority:

P_j = k_j * I_j                (Equation 3)

To compute k_j, Crux picks the job with the most network traffic in the cluster as the reference job (most likely to contend with others), and sets k_{reference} = 1. Correction factors of other jobs are computed by comparing with the reference job.
Worked Example 1 follow-up: Job 1 is the reference (P_{j_1} = I_{j_1}); given iteration time, prioritize each and compare. Prioritizing Job 1 lets the network carry Job 1's data 6 s and Job 2's data 3 s; conversely 4 s and 6 s. To equalize compute overload, Job 1 should pass 3s/2s = 1.5x Job 2's data (so I_{j_1} = 1.5 I_{j_2}), giving k_{j_2} = 1.5 k_{j_1} = 1.5. Correction factors are different when picking a different reference; the difference is the iteration length, computation-communication overlap, and possibly other aspects, which can be folded into k.

4.3 Priority Compression

Real-world clusters' NICs and switches support limited priority levels (typically 8 physical levels); some are reserved for TCP, instruction transmission, or proactive diagnosis. So multiple jobs end up sharing the same priority, leading to random communication contention.
Compression challenge: minimize GPU utilization loss caused by such contention.

Example (Figure 13): 4 jobs with decreasing priorities mapped to 2 levels (high=1, low=0). Sincronia assigns Job 1 to high, others to low. Varys uses a balanced split: Job 1-2 high, Job 3-4 low. Both are reasonable absent further information. With knowledge of which jobs share links and how intensely they contend, the optimal compression assigns Job 1,3 to high and Job 2,4 to low.

Crux's approach — DAG model and Max K-Cut formulation:

Define DAG D = <V^D, E^D>. Each node ∈ V^D represents a job in J. For any two jobs j_1, j_2 sharing network links, with j_1 assigned higher priority, there is an edge e ∈ E^D from j_1 to j_2. Edge weight w_e (or w_{j_1, j_2}) = I_{j_1}, since the loss of j_1 losing GPU utilization is proportional to its GPU intensity.
The DAG is called the Communication Contention DAG (Figure 14).
Optimization goal: minimize GPU utilization loss during compression. Given a valid priority compression for J, the corresponding node set V^D is partitioned into K subsets. For each edge e, if both nodes are in the same subset, GPU utilization is lost by w_e. The optimal partition maximizes the total weight of links whose nodes are NOT in the same subset — equivalent to maximizing the link cut by K-partitioning. Hence the problem is the Max K-Cut problem in DAGs.

Algorithm: approximating Max K-Cut for DAG.

Approximation idea: solve the problem in a constrained solution space defined by topological orderings of the DAG, then expand back to the original space.
Suppose {a_n} = {a_1, ..., a_n} is a topological order of DAG D where n = |V^D|. Partition {a_n} into K consecutive subsequences B_1 = {a_1, ..., a_{i_1}}, ..., B_K = {a_{i_{K-1}+1}, ..., a_n}. By Theorems 2 and 3 (Appendix B), Max K-Cut over sequence {a_n} is a surjection to the K-Cut of DAG D.
Compared with Max K-Cut of D, finding it for sequence {a_n} is much easier. Define f(i, k) as the max k-Cut for {a_1, ..., a_i}:

f(i, k) = max_{1 <= j < i} [ f(j, k-1) + C_{j, i} ]    (Equation 4)

where C_{j, i} = sum of edge weights from {a_1, ..., a_j} to {a_{j+1}, ..., a_i} in DAG D.

f(n, K) is the Max K-Cut for {a_1, ..., a_n} and can be solved by dynamic programming. f has nK states; constant matrix C precomputed in O(n^2); f's state conversion is O(n) but optimizable to O(1) by Quadrangle Inequalities (proven monotonic). Therefore, given a topological sequence, K-Cut is computed in O(n^2).
Since each topological order has its own optimal K-Cut (a constrained K-Cut), Crux approximates the global Max K-Cut for the DAG by sampling enough topological order sequences of D.

Algorithm 1 — Priority Compression (pseudocode summary):

Input: DAG D = <V^D, E^D>. Output: K-Cut for D.
For cases 1 to m (in practice m = 10 random topological orders via BFS):
- Generate random topological order {a_1, ..., a_n}.
- Preprocess matrix C: S_{i,k} = S_{i-1,k} + S_{i, k-1} - S_{i-1, k-1} + w_{a_i, a_k}; C_{i, j} = S_{i, j} - S_{i, i}.
- Compute f(i, k) by DP: f(i, k) = max_{g(i-1, k) <= j < i} [f(j, k-1) + C_{j, i}]; record argmax in g; build Cut(i, k) = Cut(g(i, k), k-1) + {a_{g(i, k)+1}, ..., a_i}.
- If f(n, K) > MaxCut: update MaxCut and OutputCut.

4.4 Effectiveness Validation

Microbenchmark constructed since the algorithm's optimality is hard to prove mathematically.
Simulation settings: 1,500 small-scale cases. Each case ≤20 hosts (8 GPUs each), 2-layer Clos with 2-4 ToRs and 2 aggregation switches. 5 different DLT jobs running in the cluster, assigned to 3 priority levels.
Baselines: TACCL* (intra-job; the authors implement TACCL's specification for inter-job using the least-congested-link insight and longest-distance prioritization). Sincronia and Varys for priority assignment. Sincronia for compression. Optimal solution applied to other two mechanisms when ablating one.
Results (Figure 16): CDF of error rate from optimal. Crux provides 97.69%, 97.24%, and 97.12% performance in path selection (§4.1), priority assignment (§4.2), and priority compression (§4.3) compared with optimal. Each part is much closer to optimal than alternatives.
Limitations of Crux algorithms discussed in §7.1.

5. Implementation

Crux is 7K lines of code for widely-used DLT frameworks (PyTorch, TensorFlow, X-DeepLearning).
The converged communication library (CoCoLib) supports RoCEv2, TCP, etc., and provides collective communication APIs (e.g., AllReduce) and Send/Receive API satisfying various DLT models. Jobs use Crux by replacing original communication libraries with CoCoLib.
Crux Daemon (CD) and Crux Transport (CT) are two core modules deployed in each host. CD runs as a daemon collecting information and making scheduling decisions. CT executes the decisions. For efficiency, in each job only a leader CD makes scheduling decisions and synchronizes with others. In real-world deployment, Crux uses <0.01% network bandwidth for these synchronizations and decisions.
Crux requires path information (intra-host and inter-host), then executes scheduling decisions for each job — without modifying DLT model code.

Path information probing:

For intra-host (PCIe paths), Crux uses Intel PCM and AMD uProf.
For inter-host paths, Crux collects path information between each pair of hosts using probing packets. Since GPU clusters use ECMP forwarding via five-tuple hashing, Crux finds a suitable 16-bit UDP source port for each candidate path by sending probes with varied source ports until all candidate paths can be reached. Crux employs INT (in-network telemetry, widely deployed in current switches) to insert per-hop information into probing packets. INT is not imperative; path info can also be derived via hash-algorithm calculation.

Job information measurement:

Crux requires W_j and t_j to compute GPU intensity (Equation 2).
To ensure accurate measurement, Crux assigns a unique highest priority to a job during profiling, preventing contention.
Hardware monitoring (GPU, NIC, PCIe) measures computation and communication overloads. For computation: directly sum GPU overload during a fixed monitoring period (e.g., 30s). For communication: sum the duration of data transfers. Both overloads divided by iteration count yield W_j and t_j.
Iteration duration estimated by Fourier Transform — convert communication from time domain to frequency domain to estimate iteration duration.

Execute scheduling decisions:

On new job arrival, Crux probes/measures the new job and reassigns paths and priorities for all existing jobs. CTs invoke ibv_modify_qp to set UDP source ports of RoCEv2 connections; packets are delivered via certain paths based on ECMP. CTs set IP traffic classes corresponding to assigned priority levels via ibv_modify_qp to enter different packet queues of NIC/switches.
For intra-host priority assignment, CDs maintain semaphores for PCIe links and block PCIe communication of lower-priority jobs when higher-priority jobs use PCIe links. CTs access the semaphores via inter-process shared memory.
Jobs rescheduled or finished trigger Crux to treat as new status and reassign paths/priorities. Profiling and re-scheduling overhead takes less than one minute per job arrival/completion (DLT job durations range from hours to months), considered negligible.

6. Evaluation

Three results to be shown:

96-GPU testbed: Crux improves GPU utilization by up to 14.8% and reduces up to 33% JCT for GPU-intensive jobs.
2,000-GPU production trace simulation: Crux improves GPU utilization by 4% to 23% versus state-of-the-art.
Crux is compatible with state-of-the-art job schedulers and further improves GPU utilization and DLT performance by 11% to 23%.

6.1 Experiment Setup

Real-world testbed (Figure 18):

12 hosts × 8 NVIDIA A100 GPUs = 96 GPUs. Each host has 4×200Gbps RDMA NICs.
Each host connects to one ToR switch via four links (every two GPUs use one switch via a shared link, e.g., GPU 0&1 to switch 1 link 1, GPU 2&3 to switch 2 link 2). Inter-host topology: 6 ToR switches × 8×100Gbps + 8 Aggregation switches × 2×100Gbps. Intra-host: NVLinks plus PCIe 4.0 x16 fabric and NIC at 100 Gbps.

Simulated GPU cluster:

Simulator implements GPU cluster networks for computation and communication of multiple jobs. Computation simulated by directly running models on actual GPUs to obtain per-iteration time. Communication simulated using the alpha-beta model considering both physical link delay and data-size/ bandwidth delay. Simulator assumes 8 priority levels.
Two topologies for trace simulation:
- Double-sided topology: 6 ToR + 12 aggregation + 32 core switches; each host connects to two ToRs via 8 links — actual topology used in trace.
- Two-layer Clos topology: 173 ToR + 16 aggregation switches; each host connects to one ToR.

Performance metrics:

Overall GPU utilization is the optimization goal; also reports per-job JCT (inversely proportional to throughput).

6.2 Real-World Evaluation

Real-world co-locations: Each job runs ResNet, BERT, or GPT (small, medium, large). Standalone training treated as the ideal.

Contention on network paths (Figure 19):

Co-locate a 32-GPU GPT job with multiple 8-GPU BERT jobs.
As contention increases, GPU utilization decreases (without Crux).
With Crux, overall GPU utilization improves by 8.3% to 12.9%, close to the ideal case.
Per-job: GPT gets 11% to 25% shorter JCT at the expense of small BERT JCT increase of 0% to 3%. BERT increase is mitigated through effective path selection. Net outcome favors GPU utilization since GPT is more important.

Co-location of 48GPU GPT + 2×8GPU ResNet + 2×16GPU BERT (Figure 20):

GPT has highest GPU utilization, ResNet lowest.
With Crux: GPT's JCT decreases by 18%, BERT's by 15%, ResNet's increases by 2% to reserve network bandwidth for the other models.
Overall GPU utilization increases by 13.9% (Figure 20(a)).

Contention on PCIe (Figure 21, Figure 22):

Two PCIe contention cases:
- Case 1: 16GPU BERT + multiple 4GPU ResNet jobs (Figure 21).
- Case 2: 8GPU ResNet + BERT with varying GPU counts (8, 16, 24) (Figure 22).
Both cases: Crux yields higher GPU utilization (+9.5% to +14.8%), close to ideal.
BERT's JCT decreases significantly (-7% to -33%) due to its high GPU intensity; ResNet's JCT slightly increases (+1% to +3%).

6.3 Trace-based Simulation

Compare Crux with: Sincronia (general co-flow), TACCL* (intra-job, §4.4), CASSINI (inter-job).
Crux variants: Crux-PA (priority assignment only), Crux-PS-PA (path selection + priority assignment), Crux-full (path selection + priority assignment + priority compression).
11 different models evaluated: 5 open-source (BERT, GPT, ResNet, NMT, Multi-Interests), plus two in-house models for Click-Through-Rate and transformer-based NLP.

Performance under different cluster configurations:

Figure 23: Two-layer Clos network and three-layer double-sided network. Compared to baselines, Crux improves GPU utilization by 13% to 23% and by 4% to 7%, respectively. Bar values:
- Clos: Sincronia 0.39, TACCL* 0.49, CASSINI 0.48, Crux-PA 0.51, Crux-PS-PA 0.62, Crux-full 0.62.
- Double-sided: Sincronia 0.43, TACCL* 0.46, CASSINI 0.45, Crux-PA 0.48, Crux-PS-PA 0.51, Crux-full 0.50.

GPU intensity in network vs. GPU utilization (Figure 24):

Figure 24 shows real-time distribution of jobs' GPU intensity transmitted in the network across baselines and Crux variants. Darker color = higher GPU intensity transmission.
Comparing Sincronia (Figure 24(a)), TACCL* (b), CASSINI (c), Crux-PA (d): Crux-PA has darker colors → prioritizes GPU-intensive jobs during contention. On the first day, baselines' colors are generally lighter; Crux improvements are 26%, 14%, 5% respectively in average GPU utilization.
Comparing Crux-PA (d) and Crux-PS-PA (e): Crux-PS-PA's non-white area is much larger, indicating better-utilized network paths from path selection; the increase yields 97% increase in network utilization.
Comparing Crux-PS-PA (e) and Crux-full (f): GPU intensity distributions are highly consistent — compressing 5,000 jobs to 8 priorities introduces nearly no side effect.

6.4 Working Together with Job Schedulers

Crux is orthogonal to job schedulers and can complement them. Two schedulers evaluated: HiveD (allocates GPUs by physical affinity to minimize fragmentation) and Muri (schedules by reducing idle links).
Without job scheduling ("None"), Muri and HiveD improve GPU utilization by 20% and 25% respectively. Together with Crux, GPU utilization further improves by 14% (Muri+Crux) and 11% (HiveD+Crux).
Bar values (Figure 25): None=0.39, None+Crux=0.62, Muri=0.59, Muri+Crux= 0.73, HiveD=0.64, HiveD+Crux=0.75.
Even with state-of-the-art job schedulers, communication contention remains — leaving room for Crux to add gain.

7. Discussions

7.1 Limitations in Crux Algorithm

Definition of GPU intensity helps greatly but a gap to optimal remains. Crux makes several simplifications.
First — overlap simplification (§4.2): Crux assumes computation and communication are continuous in one iteration with simple overlap pattern (Figure 12). Real DLT jobs may interleave many comm/compute kernels with partial overlap. The most important factor is the overlap ratio, and jobs with less overlap are more sensitive to communication latency, so the assumption is sufficient.
Second — single reference job for correction factor: Crux uses only the job with most network traffic as reference. A better solution would enumerate all combinations of contending jobs, compute correction factors for each, and weighted-average them by contention frequency. This requires exponential complexity and fine-grained network monitoring — impractical for deployment. Crux's simplification trades modest optimality for deployability.
Third — other influencing factors: Communication performance is also affected by storage traffic, timing of each job's communication, and specific collective communication algorithms. Storage traffic may vary across iterations and affect t_j. Modern GPU clusters typically use compute/storage separation, limiting storage interference. Evaluation shows Crux still makes scheduling close to optimal.

7.2 Job Fairness in Scheduling

Crux optimizes utilization, which means low-intensity jobs (typically with fewer GPUs) may be sacrificed for higher-intensity jobs (more GPUs). The authors argue GPU-intensive jobs should be treated more carefully because customers pay more for them, so unfairness is outweighed by utilization gains.
Evaluation: although some jobs lose performance under Crux, no job is starved. Lowest-priority jobs experience a 55.5% throughput decrease, not a complete halt — DLT traffic patterns are periodic and bursty so links are idle a significant fraction of time, allowing low-priority jobs to communicate during idle intervals.
Crux can be extended to consider fairness, e.g., compute weighted average of GPU intensity and recent throughput decrease per job for the final priority assignment, or compute Pareto-optimal frontier.

7.3 Adaptability to Other Topologies

Crux's GPU-intensity-based scheduling is independent of topology. It can be adapted to any topology, though topology-independent design may sacrifice some optimality available to topology-aware joint optimization.
Clos and double-sided topologies are representative of multi-layered multi-tenant architectures. Less common topologies (e.g., Torus) face the same multi-tenant communication contention problem; Crux can potentially apply.

Job scheduler: A series of works (Mari, Gandiva, Themis, AntMan, HiveD, Shockwave, hive, [20, 21, 30, 31, 33, 46, 57, 58, 63, 64]) target overall training performance by GPU allocation. Crux is orthogonal — targets high-level inter-job arrangement; can also work together with them.

Communication scheduler: Three categories.

General co-flow schedulers: Coflow [22], Orchestra [24], Varys [25], Weaver [35], PIAS [17], Amoeba [60], Sincronia [16], Aalo [23], Karuna [19], Baraat [26]. Average completion time / deadline scheduling. Unaware of how different co-flows affect end-to-end DLT performance and GPU utilization.
Intra-job communication schedulers: ByteScheduler [50], EchelonFlow [48], Lina [42], CadentFlow [38], SYNDICATE [47], MXDAG [55], BGL [45], Janus [29], ACCL [28], TACCL [53] — allocate priorities or paths to flows in a single DLT job, optimizing scheduling decisions but unaware of inter-job contention. Crux is orthogonal.
Inter-job: CASSINI [51] — proactively reuses network links over time by applying a time-dimension offset per job. However, dynamic networks and jobs affect actual traffic patterns, so merely setting time offsets cannot eliminate contention.

9. Conclusion

Crux is a communication scheduler that tackles inter-job communication contention and optimizes cluster GPU utilization.
Proposes the GPU intensity concept, reducing GPU utilization optimization to a flow optimization problem.
Three techniques: GPU-intensity-based path selection, priority assignment, and priority compression.
Evaluation: improves cluster GPU utilization by up to 23% versus state-of-the-art.
Ethics: this work raises no ethical issues.
Acknowledgements: thank shepherd Sujata Banerjee and SIGCOMM reviewers; Yu Zhou for the earliest implementation; Xin Jin and Minlan Yu for feedback; Ennan Zhai is the corresponding author.

Appendix A — Proof of Theorem 1

For job j at time t, define g_j(t) = I_{h(t)} if h(t) = j, else 0; G_{j,T} = ∫T g_j(t) dt; F_T = sum_j G{j,T}.
S_j = transmission time on link e_0 by job j. G_{j,T} = I_j * S_j.
N_j = S_j / t_j is the number of iterations transmittable for job j (not necessarily integer). Actual N'_j (integer) satisfies |N_j - N'_j| ≤ 1 (i.e., -1 ≤ N'_j - N_j ≤ +1, Inequality 5; Inequality 6 substitutes S_j/t_j).
Cluster computation workload U_T = sum_v L_v = sum_j W_j N'_j (Equation 7).
Substituting bounds (Inequality 8): F_T - sum W_j ≤ U_T ≤ F_T + sum W_j (Inequality 9). Dividing by U_T: 1 - sum W_j / U_T ≤ F_T / U_T ≤ 1 + sum W_j / U_T (Inequality 10).
As |T| → ∞, U_T → ∞ since jobs always run; both bounds → 1, so lim F_T / U_T = 1 (Equation 11). QED.

Appendix B — Proof of Theorems in §4.3

Theorem 2: Any K-Cut of sequence {a_n} is a valid K-Cut for DAG D. Proof: by definition of topological ordering there cannot exist edge from a_j to a_i for i < j. Therefore, no edges from B_j to B_i for i < j; the partition forms a valid K-Cut.
Theorem 3: For any valid K-Cut of D, there exists at least one topological order sequence {a_n} with a K-Cut on it equivalent to that K-Cut. Proof sketch: given K-Cut subsets V_1, ..., V_K, build sub-DAG D_i with topological order b_{i,1}, ..., b_{i, |V_i|}; concatenate to form a topological ordering of D whose partition matches.

Quantitative Tables Summary

Metric	Value
Production cluster size	2,000+ GPUs, 5,000+ jobs
Two-week peak concurrent jobs	30+ (1,000+ GPUs)
Largest single job	512 GPUs
Jobs with min 128 GPUs	>10% of jobs (GPT variants)
Jobs at risk of communication contention	36.3% (51% GPUs)
GPT iter time increase from BERT contention	11.0% (1.53s → 1.70s)
GPT throughput decrease	9.9%
BERT throughput decrease	7.7%
GPU utilization decrease	9.5%
Daily savings at 10% GPU util gain on 10K cluster	$120,000
Crux LOC	7K
Crux scheduling overhead	<0.01% bandwidth, <1 min per arrival
Microbenchmark — Crux path selection vs optimal	97.69%
Microbenchmark — Crux priority assignment vs optimal	97.24%
Microbenchmark — Crux priority compression vs optimal	97.12%
Testbed — GPU util gain (network contention)	+8.3% to +12.9%
Testbed — GPT JCT improvement	-11% to -25%
Testbed — BERT JCT change	0% to +3%
Testbed (4-job) — GPU util gain	+13.9%
Testbed (4-job) — GPT JCT	-18%
Testbed (4-job) — BERT JCT	-15%
Testbed (4-job) — ResNet JCT	+2%
PCIe contention — GPU util gain	+9.5% to +14.8%
PCIe contention — BERT JCT	-7% to -33%
PCIe contention — ResNet JCT	+1% to +3%
Trace sim Clos — Crux GPU util	0.62
Trace sim Clos — vs baselines	+13% to +23%
Trace sim Double-sided — vs baselines	+4% to +7%
Trace sim — Crux network util gain (PS-PA vs PA)	+97%
Lowest-priority job throughput decrease	55.5% (no halt)
Crux + Muri GPU util gain	+14% over Muri alone
Crux + HiveD GPU util gain	+11% over HiveD alone