Detailed Summary: Immediate Communication for Distributed AI Tasks (DistFuse)

Citation: Jihao Xin, Seongjong Bae, KyoungSoo Park, Marco Canini, Changho Hwang. KAUST / Seoul National University / Microsoft Research. HotInfra 2024. (Workshop at SC24)

PDF: 0014_Immediate_.Comm_Dist_tasks_GPU.pdf

Abstract

DistFuse introduces fine-grained, tile-level computation-communication overlap for distributed GeMM + AllReduce operations in LLM inference. Unlike prior work that overlaps independent operations across layers, DistFuse handles the case where communication is data-dependent on computation (the GeMM output feeds directly into AllReduce). By triggering tile-wise AllReduce immediately when each GeMM output tile is ready, DistFuse achieves 44.3% reduction in AllReduce latency on Llama 3-70B attention blocks. A vLLM integration yields 5.2%–11.1% end-to-end inference speedup on Llama 3-70B even without overlap, purely from improved kernel locality.

1. Motivation

1.1 The Communication Bottleneck in Tensor Parallelism

Modern LLM inference with tensor parallelism distributes each layer's weight matrix across multiple GPUs. Each GPU computes a partial GeMM result, then an AllReduce aggregates these partial results across all GPUs before the next layer can proceed. In Llama 3-70B with 8-way tensor parallelism, this pattern occurs twice per attention block (Attention Score and Down Projection layers) and once per MLP block.

The scaling problem is severe: scaling Llama 2-13B from 8 to 1,024 GPUs reduces Model FLOPs Utilization (MFU) from 47% to 4%. An empirical measurement on Llama 2-7B shows communication can account for up to 80% of execution time at scale.

1.2 Why Existing Overlap Strategies Fail

Standard computation-communication overlap techniques work by identifying independent operators — for example, overlapping the gradient AllReduce of layer N with the backward pass of layer N-1. These "inter-operator" methods fail for the GeMM-AllReduce pair in tensor parallelism because:

The AllReduce output is needed immediately by the next operation in the same layer.
The GeMM and AllReduce are causally dependent: the AllReduce input is the GeMM output.
No amount of scheduling or graph reordering can make them independent.

Compilation-based methods (TVM, Triton, Welder) and scheduler-based methods (SYNDICATE, CoCoNet, T3) do not solve this specific case. CoCoNet requires a DSL and custom scheduler; T3 requires specialized hardware. DistFuse addresses this without modifying the training framework or requiring hardware changes.

2. Background

2.1 LLM Architecture — Llama 3-70B Example

Llama 3-70B runs 80 transformer layers. Each layer contains:

Q, K, V projections (column-parallel GeMM, followed by AllGather in Megatron-style TP)
Attention score computation (QK^T and Score×V)
Output projection O = Score × W_o (row-parallel GeMM, followed by AllReduce)
Two MLP layers: Up/Gated projection (column-parallel) and Down projection (row-parallel + AllReduce)

The Down Projection GeMM in Llama 3-70B has dimensions [M=1024, N=8192, K=28672], creating a substantial AllReduce on the K dimension partial sums.

2.2 GPU Communication Model

GPUs can transfer data between each other via: (1) NVLink for intra-node (direct peer-to-peer memory access at 600 GB/s on A100), (2) GPUDirect RDMA for inter-node (NIC directly accesses GPU memory, bypassing CPU and host memory). NCCL implements collective communication but uses contiguous buffers and sequential execution, making it unsuitable for tile-by-tile communication. CUDA streams allow multiple independent kernel sequences to execute concurrently on the GPU.

3. System Design

3.1 Core Idea: Immediate Communication

DistFuse observes that GeMM hardware execution is not perfectly SIMD-synchronous: though all SMs launch simultaneously, each SM completes its assigned tiles at slightly different times. The output tiles of a GeMM are therefore produced incrementally across SMs, not all at once.

Standard execution: Wait for all GeMM tiles to complete → launch AllReduce on the full output buffer.

DistFuse execution: As each GeMM tile completes on its SM, immediately start AllReduce for that tile's data → the remaining GeMM tiles continue executing on other SMs in parallel with the tile-wise AllReduce.

Standard:
  GeMM tile 1 ─┐
  GeMM tile 2  ├─ all complete ──► AllReduce (full buffer)
  GeMM tile 3 ─┘

DistFuse:
  GeMM tile 1 ──► AllReduce tile 1 ─┐
  GeMM tile 2 ──► AllReduce tile 2  ├─ all complete ──► next op
  GeMM tile 3 ──► AllReduce tile 3 ─┘
         ↑ overlap

3.2 Challenge 1: Granularity Selection

Why tiles? Element-level communication would maximally hide latency but introduces prohibitive per-element communication overhead and requires rewriting the GeMM kernel from scratch. Tile-level communication is the coarsest granularity that still allows overlapping, and tiles are the natural execution unit of GeMM in off-the-shelf libraries.

Tile-wise All-reduce implementation: NCCL cannot handle non-contiguous, tile-sized buffers — it requires a single contiguous input buffer for the full AllReduce. DistFuse implements a new tile-wise communication library using GPUDirect P2P access that:

Splits the AllReduce into multiple collective calls, one per tile.
Handles non-contiguous tile memory layouts.
Aligns tile widths to the A100 cache line size (128 bytes) to prevent cache coherence overhead.
Uses CUTLASS's tile-wise GeMM kernel as the computation backend.

Tiling and SM allocation: For an A100 with 108 SMs, overlapping occurs only when more than 108 CTAs (Cooperative Thread Arrays) are launched (otherwise tiles finish one per SM with no opportunity to pipeline). DistFuse uses one dedicated CTA per tile for large GeMMs and applies Split-K to break tiles into multiple CTAs for better SM utilization with smaller GeMMs.

The tile-wise AllReduce is shown to achieve performance on par with NCCL's optimized AllReduce for contiguous buffers (~6% overhead in current prototype, attributed to implementation inefficiency not a fundamental limitation).

3.3 Challenge 2: Hardware Scheduling

The problem: Even when GeMM tiles and AllReduce tiles can logically execute in parallel (different hardware units: Tensor Cores vs. RDMA NIC), fusing them into a single CUDA kernel does not guarantee overlap. Experiments reveal that the compiler/hardware scheduler may serialize them despite explicit attempts at parallelism.

Experimental evidence (Figure 4): Two variants are tested:

Case 1: CTA executes TensorOps first, then LdOps (memory reads for communication). Result: near-ideal overlap observed (1062 µs, close to theoretical 1062 µs ideal).
Case 2: CTA executes LdOps first, then TensorOps. Result: no overlap (1350 µs, vs. ideal 1087 µs). The two operations serialize despite having no data dependencies and using different hardware resources.

This behavior is consistent across V100, A100, and H100, and is attributed to an undocumented compiler/hardware scheduler behavior transparent to the programmer (verified by inspecting compiled SASS assembly code).

Solution: Two-stream explicit synchronization. DistFuse launches GeMM and AllReduce in two separate CUDA streams sharing a flag buffer allocated in GPU memory:

Stream 1 (GeMM):
  compute tile i → set flag[i] in shared buffer

Stream 2 (AllReduce):
  busy-wait until flag[i] is set → execute AllReduce for tile i

The flag buffer is allocated in GPU memory accessible to both streams. The AllReduce stream busy-waits on each tile flag before proceeding. This explicit barrier approach consistently achieves the desired overlap for all tested hardware and cases.

4. Evaluation

4.1 Setup

Hardware: ASUS ESC N4A-E11 server, Ubuntu 22.04, CUDA 12.1, PyTorch 2.1.2, 4 NVIDIA A100-40GB GPUs connected via NVLink (P2P enabled).
Model: Llama 3-70B inference, using Meta's original parallelism strategy (row-parallel GeMM + AllReduce for Down Projection and O-projection).
Baseline: PyTorch standard GeMM + NCCL AllReduce.

4.2 Standalone Layer Results (Table 1)

Layer	GeMM dimensions [M,N,K]	PyTorch total	DistFuse w/o overlap	DistFuse w/ overlap	Speedup vs. PyTorch
Attention Score	[1024, 8192, 8192]	387.0 µs	403.4 µs (+4.2%)	333.8 µs	+15.9%
Down Projection	[1024, 8192, 28672]	764.5 µs	771.1 µs (+0.9%)	621.6 µs	+23.0%

The "DistFuse w/o overlap" row shows the baseline cost of DistFuse's tile-wise kernels without enabling the pipeline overlap. The ~2–4% overhead relative to PyTorch is an implementation inefficiency. Enabling overlap achieves 20.5% speedup in the first layer and effectively hides 44.3% of AllReduce communication latency overall.

Communication accounts for 8.43%–25.05% of per-block inference latency. The Down Projection layer has a larger K dimension (28,672 vs. 8,192) and therefore a longer GeMM, providing more overlap opportunity.

4.3 vLLM Integration

DistFuse's tile-wise AllReduce kernel is integrated into vLLM (replacing the standard NCCL AllReduce) without implementing the overlap yet:

End-to-end speedup: 5.2%–11.1% on Llama 3-70B inference.
Source of speedup (without overlap): vLLM accesses AllReduce buffer linearly after GeMM, requiring global memory reads. DistFuse's tile-wise kernel preserves spatial locality from the GeMM computation, reducing L2 cache misses.
Lower bound: This 5.2%–11.1% speedup is described as a conservative lower bound; full overlap integration is expected to produce additional improvements.

5. Limitations

1. Single-node scope: All experiments use NVLink P2P communication within a single 4-GPU node. Multi-node extension would require tile-wise RDMA, where each AllReduce tile must be communicated over InfiniBand. The latency and bandwidth characteristics of RDMA are very different from NVLink, and implementing tile-wise RDMA with acceptable overhead is a significant engineering challenge.

2. Dependency on explicit synchronization hack: The two-stream, flag-based synchronization is a workaround for an undocumented hardware scheduler behavior. Future GPU architectures or compiler versions could break this approach or make it unnecessary.

3. Manual configuration required: Tile size, CTA allocation, and stream configuration currently require manual tuning per model and GPU. An automated compiler or runtime pass to identify overlapping opportunities is identified as future work.

4. Incomplete vLLM integration: The overlap feature is not yet integrated into vLLM. Only the standalone tile-wise AllReduce replacement is evaluated end-to-end; the full DistFuse with GeMM-AllReduce pipeline overlap has not been measured end-to-end.

5. Training not evaluated: The paper focuses on inference. Training would require handling gradient AllReduce during backward passes, which has different memory layout requirements (gradients are computed in reverse order and may not be contiguous in the same tile pattern as forward activations).

6. NVLink bottleneck not fully exposed: Llama 3 inference on a single node with NVLink is not the most communication-bottlenecked scenario. The benefit of DistFuse would be substantially larger in a multi-node, InfiniBand-connected setting where AllReduce latency is much higher.

CoCoNet (ASPLOS 2022): Breaking computation and communication abstraction barrier. Implements fused computation-communication kernels using a DSL. Requires an additional scheduler and custom DSL code. Most closely related to DistFuse but requires more user effort.

T3 (arXiv 2024): Transparent tracking and triggering for fine-grained overlap of compute and collectives. Depends on specialized hardware mechanisms (tracking and triggering hardware built into the accelerator). DistFuse achieves similar goals with standard GPU hardware.

TensorRT-LLM: NVIDIA's LLM inference framework implements custom AllReduce kernels that outperform NCCL for small messages within a single node. Motivated the same performance gap that DistFuse targets, but TensorRT-LLM's solution is not open-source or general-purpose.

NCCL: Used as the baseline AllReduce implementation. The paper notes that "customizing AllReduce with NCCL is challenging due to its latency and complexity" — motivating DistFuse's custom tile-wise communication library.

ACCO (arXiv 2024): Accumulate while you Communicate, overlapping gradient accumulation with AllReduce. This is an inter-operator approach and does not help for the GeMM-AllReduce dependency case.

7. Section-by-Section Paragraph Summaries

Section 1 — Introduction

Introduces the communication bottleneck in distributed LLM inference (up to 80% of execution time). Differentiates DistFuse from inter-operator overlap methods: DistFuse handles the case where computation and communication are data-dependent. The core claim is that immediate tile-level communication can hide up to 44.3% of AllReduce latency in Llama 3-70B. vLLM integration shows 5.2%–11.1% speedup even without the overlap feature enabled.

Section 2 — Background

Describes Llama 3-70B architecture with GeMM-dominated compute pattern. Explains inter-operator overlapping (computation-communication independence required) and why it fails for GeMM + AllReduce. Summarizes GPU communication: NVLink for intra-node, GPUDirect RDMA for inter-node. Notes that NCCL is high-latency for this use case and not easily customizable.

Section 3 — Immediate Communication

Defines the tile as the communication granularity. Describes the tile-wise GeMM + AllReduce pipeline where each SM completes its GeMM tile and begins its AllReduce phase while the next tile's GeMM begins. Explains the two challenges (granularity selection and hardware scheduling) and their solutions. Provides Figure 3 showing the pipeline structure per-SM and Figure 4 showing the compiler/hardware scheduling non-determinism.

Section 3.1 — Granularity

Why tile granularity is chosen. Details of the tile-wise AllReduce library using GPUDirect P2P. Cache line alignment for A100. CUTLASS adoption for tile-wise GeMM. Split-K for SM load balancing with small tiles.

Section 3.2 — Hardware Scheduling

Case 1 vs. Case 2 experiment demonstrating non-deterministic overlap behavior in fused kernels. Two-stream explicit flag synchronization solution. Confirmed to work consistently across V100, A100, H100.

Section 3.3 — Preliminary Evaluation

Table 1 results. 44.3% AllReduce latency reduction in attention blocks. vLLM integration results (5.2%–11.1% end-to-end speedup). Discussion of why improvement occurs even without overlap (spatial locality in tile-wise AllReduce).

Section 4 — Discussion and Future Work

Multi-node extension planned. Automation of overlap opportunity detection and tile size selection as future work. Plans to integrate full overlap feature into vLLM. Potential for instruction-level parallelism at even finer granularity.

8. Relevance to DynamICCL

Low-to-moderate direct relevance. Addresses an adjacent problem at a different abstraction level.

DistFuse and DynamICCL both target the communication overhead in distributed LLM training/inference, but from orthogonal angles:

1. NCCL's inflexibility motivates both. DistFuse explicitly states that "customizing AllReduce with NCCL is challenging due to its latency and complexity" — which is the same observation that led to DynamICCL's use of the NCCL tuner plugin API. DistFuse responds by bypassing NCCL with custom kernels; DynamICCL responds by exposing NCCL's parameter space to an RL agent.

2. Tile size / chunkSize analogy. DistFuse's tile size selection (coarser tiles → fewer sync points but less overlap benefit; finer tiles → more overlap but more communication overhead) is structurally analogous to NCCL's chunkSize parameter, which controls the granularity of the ring/tree pipeline. DynamICCL's Agent-2 could in principle tune chunkSize to analogously trade overlap benefit against pipeline overhead.

3. Communication latency decomposition. DistFuse's measurement that AllReduce accounts for 8–25% of per-block Llama 3-70B inference latency (in a favorable NVLink scenario) provides an empirical lower bound on the communication fraction DynamICCL's optimization is targeting. In multi-node RDMA environments, this fraction will be substantially higher.

4. Complementary, not competing. DistFuse reduces AllReduce latency through tile-level overlap at the kernel level. DynamICCL reduces AllReduce time through optimal collective algorithm and parameter selection at the library level. Both optimizations could in principle be applied simultaneously to the same workload.

5. No applicability to DynamICCL's core mechanism. DistFuse does not address congestion detection, RL-based parameter selection, or collective algorithm choice. The paper does not discuss NCCL's internal algorithms (ring/tree/CollNet/NVLS), protocols (LL/LL128/Simple), or channel count. These are DynamICCL's direct levers.