Brief Summary: Immediate Communication for Distributed AI Tasks (DistFuse)

Citation: Jihao Xin, Seongjong Bae, KyoungSoo Park, Marco Canini, Changho Hwang. KAUST / Seoul National University / Microsoft Research. HotInfra 2024.

PDF: 0014_Immediate_.Comm_Dist_tasks_GPU.pdf

Problem

Distributed LLM inference with tensor parallelism requires each GPU to run a GeMM followed by an AllReduce to sum partial results across GPUs. These operations are on the critical path (they cannot be overlapped with independent work), and the AllReduce accounts for 8.43%–25.05% of inference latency per attention block in Llama 3-70B. Existing inter-operator overlapping techniques (pipelining independent computation with independent communication across layers) do not help here because the AllReduce is data-dependent on the GeMM result — the GeMM must finish before AllReduce can begin. This is the "dependency" problem that prior work sidesteps.

Core Insight

GPUs do not execute SIMD operations synchronously: with tile-based GeMM, individual output tiles are computed sequentially across SMs. DistFuse exploits this by starting tile-level AllReduce communication as soon as each tile is ready, rather than waiting for all GeMM tiles to complete. This fine-grained intra-operator overlapping hides AllReduce latency behind the tail of the GeMM computation — even when a data dependency exists between them.

Method

Two key challenges and solutions:

Challenge 1 — Granularity: The smallest natural granularity is a GeMM tile. DistFuse implements a new tile-wise All-reduce library based on GPUDirect P2P access (NCCL uses contiguous buffers and cannot handle non-contiguous, tile-by-tile communication). Tile widths are aligned to the GPU cache line size (128 bytes for A100) to minimize access overhead. This splits one AllReduce into multiple tile-sized collective calls, each triggered immediately when the corresponding GeMM tile completes.

Challenge 2 — Hardware Scheduling: Naively fusing GeMM and AllReduce into a single kernel fails in practice: the GPU compiler/hardware scheduler does not reliably overlap TensorCore operations with global memory reads, even when there are no data dependencies between them. DistFuse uses an explicit synchronization barrier per tile and launches GeMM and AllReduce in two separate CUDA streams sharing a flag buffer. The GeMM sets a flag when a tile completes; the AllReduce busy-waits on that flag. This explicit flag-based synchronization consistently produces the desired overlap across V100, A100, and H100.

Key Results

Evaluated on Llama 3-70B inference, single-node 4×A100-40GB with NVLink (P2P connected):

Layer	Total (PyTorch)	DistFuse (no overlap)	DistFuse (with overlap)	Speedup
Attention Score [1024,8192,8192]	387.0 µs	403.4 µs	333.8 µs	15.9% faster than PyTorch
Down Projection [1024,8192,28672]	764.5 µs	771.1 µs	621.6 µs	23.0% faster than PyTorch

44.3% reduction in AllReduce communication latency in the attention blocks.
Tile-wise AllReduce overhead: ~6% over NCCL's AllReduce (an implementation artifact, not fundamental).
Integration with vLLM (kernel only, no overlap yet): 5.2%–11.1% end-to-end latency speedup on Llama 3-70B inference, attributed to better spatial locality and cache reuse from tile-wise AllReduce.

Limitations

Single-node only: Evaluated only on NVLink-connected intra-node GPUs. Multi-node extension would require tile-wise communication over RDMA, which is far more complex.
Early-stage prototype: Requires manual configuration; does not automatically detect overlapping opportunities or select tile sizes.
AllReduce only on inference critical path demonstrated: Training extension is stated as future work.
Compiler opacity: The need for explicit flag-based synchronization is a workaround for an undocumented compiler behavior. The root cause is in the GPU hardware scheduler; DistFuse cannot eliminate this fundamental reliance on explicit barriers.
Small communication window: The inference scenario (NVLink, 4 GPUs, small batch) is not the most communication-bottlenecked scenario; training at scale over InfiniBand would provide a much larger speedup potential.

Relevance to DynamICCL

Low-to-moderate direct relevance. Conceptually adjacent, not directly applicable.

DistFuse targets intra-operator computation-communication overlap at the kernel level — a fundamentally different optimization axis from DynamICCL's collective algorithm and parameter selection. The connections are:

Motivation alignment: DistFuse and DynamICCL both address the same root observation — communication is on the critical path of LLM workloads. DistFuse's measurement that AllReduce accounts for 8–25% of inference latency per attention block motivates the importance of NCCL optimization more broadly.
NCCL as the target: DistFuse notes that "customizing the All-reduce kernel with NCCL has been challenging due to its latency and complexity" — this is precisely the pain point that DynamICCL addresses through the NCCL tuner plugin API. DistFuse chose to bypass NCCL with a custom kernel; DynamICCL works within the NCCL framework.
Tile size / chunk size analogy: DistFuse's tile size selection trade-off (fine-grained triggers communication overhead; coarse-grained reduces overlap benefit) is analogous to NCCL's chunkSize / numPipeOps knob, which DynamICCL's Agent-2 could potentially tune in the future.
No overlap between agents: DistFuse does not perform congestion detection or RL-based parameter selection. DynamICCL does not perform intra-kernel GeMM-AllReduce overlap. The two systems are complementary if applied to the same cluster.