Crux: GPU-Efficient Communication Scheduling for Deep Learning Training — Detailed Summary

Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai | Alibaba Cloud | ACM SIGCOMM '24, August 4-8, 2024, Sydney, NSW, Australia | DOI: 10.1145/3651890.3672239

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.


Abstract


1. Introduction

State of the art (Figure 2 taxonomy):

Crux contributions:


2. Background and Motivation

2.1 Background: Multi-Tenant DLT Clusters

2.2 Inter-Job Communication Contention Seriously Degrades GPU Utilization

Prevalence of inter-job communication contention:

Why communication contention generally exists:

Impact of inter-job communication contention on GPU utilization:

2.3 Goal: Optimizing GPU Utilization

2.4 Optimizing GPU Utilization via Scheduling Inter-Job Communication Contention


3. Methodology and System Overview

3.1 Problem and Goal

Definition 1 (GPU Utilization). In a given time period T, if GPU v processes computation workload L_v, the overall GPU utilization U_T of the cluster is:

U_T = sum_{v in V} L_v        (Equation 1)

Definition 2 (GPU intensity). For job j:

I_j = W_j / t_j                (Equation 2)

where t_j is the maximum time a job's traffic takes to traverse any link e in E, calculated as t_j = max_{e in E} (M_{j,e} / B_e).

Theorem 1. lim_{|T| → ∞} F_T / U_T = 1.

3.3 Extending to Networks: Crux Overview

Extending Theorem 1 to complex topologies:

Utilizing Theorem 1 in a real DLT cluster:

Crux's three techniques (Figure 10):


4. Crux Design

4.1 GPU Intensity-Based Path Selection

4.2 Priority Assignment

Why GPU intensity should be fine-tuned:

Two motivating examples:

Modeling DLT characteristics in priority assignment:

P_j = k_j * I_j                (Equation 3)

4.3 Priority Compression

Example (Figure 13): 4 jobs with decreasing priorities mapped to 2 levels (high=1, low=0). Sincronia assigns Job 1 to high, others to low. Varys uses a balanced split: Job 1-2 high, Job 3-4 low. Both are reasonable absent further information. With knowledge of which jobs share links and how intensely they contend, the optimal compression assigns Job 1,3 to high and Job 2,4 to low.

Crux's approach — DAG model and Max K-Cut formulation:

Algorithm: approximating Max K-Cut for DAG.

f(i, k) = max_{1 <= j < i} [ f(j, k-1) + C_{j, i} ]    (Equation 4)

where C_{j, i} = sum of edge weights from {a_1, ..., a_j} to {a_{j+1}, ..., a_i} in DAG D.

Algorithm 1 — Priority Compression (pseudocode summary):

4.4 Effectiveness Validation


5. Implementation

Path information probing:

Job information measurement:

Execute scheduling decisions:


6. Evaluation

Three results to be shown:

6.1 Experiment Setup

Real-world testbed (Figure 18):

Simulated GPU cluster:

Performance metrics:

6.2 Real-World Evaluation

Real-world co-locations: Each job runs ResNet, BERT, or GPT (small, medium, large). Standalone training treated as the ideal.

Contention on network paths (Figure 19):

Co-location of 48GPU GPT + 2×8GPU ResNet + 2×16GPU BERT (Figure 20):

Contention on PCIe (Figure 21, Figure 22):

6.3 Trace-based Simulation

Performance under different cluster configurations:

GPU intensity in network vs. GPU utilization (Figure 24):

6.4 Working Together with Job Schedulers


7. Discussions

7.1 Limitations in Crux Algorithm

7.2 Job Fairness in Scheduling

7.3 Adaptability to Other Topologies


Job scheduler: A series of works (Mari, Gandiva, Themis, AntMan, HiveD, Shockwave, hive, [20, 21, 30, 31, 33, 46, 57, 58, 63, 64]) target overall training performance by GPU allocation. Crux is orthogonal — targets high-level inter-job arrangement; can also work together with them.

Communication scheduler: Three categories.


9. Conclusion


Appendix A — Proof of Theorem 1

Appendix B — Proof of Theorems in §4.3


Quantitative Tables Summary

Metric Value
Production cluster size 2,000+ GPUs, 5,000+ jobs
Two-week peak concurrent jobs 30+ (1,000+ GPUs)
Largest single job 512 GPUs
Jobs with min 128 GPUs >10% of jobs (GPT variants)
Jobs at risk of communication contention 36.3% (51% GPUs)
GPT iter time increase from BERT contention 11.0% (1.53s → 1.70s)
GPT throughput decrease 9.9%
BERT throughput decrease 7.7%
GPU utilization decrease 9.5%
Daily savings at 10% GPU util gain on 10K cluster $120,000
Crux LOC 7K
Crux scheduling overhead <0.01% bandwidth, <1 min per arrival
Microbenchmark — Crux path selection vs optimal 97.69%
Microbenchmark — Crux priority assignment vs optimal 97.24%
Microbenchmark — Crux priority compression vs optimal 97.12%
Testbed — GPU util gain (network contention) +8.3% to +12.9%
Testbed — GPT JCT improvement -11% to -25%
Testbed — BERT JCT change 0% to +3%
Testbed (4-job) — GPU util gain +13.9%
Testbed (4-job) — GPT JCT -18%
Testbed (4-job) — BERT JCT -15%
Testbed (4-job) — ResNet JCT +2%
PCIe contention — GPU util gain +9.5% to +14.8%
PCIe contention — BERT JCT -7% to -33%
PCIe contention — ResNet JCT +1% to +3%
Trace sim Clos — Crux GPU util 0.62
Trace sim Clos — vs baselines +13% to +23%
Trace sim Double-sided — vs baselines +4% to +7%
Trace sim — Crux network util gain (PS-PA vs PA) +97%
Lowest-priority job throughput decrease 55.5% (no halt)
Crux + Muri GPU util gain +14% over Muri alone
Crux + HiveD GPU util gain +11% over HiveD alone