Collective Communication Performance Evaluation for Distributed Deep Learning Training — Detailed Summary

Sookwang Lee (ETRI), Jaehwan Lee (Korea Aerospace University) | Applied Sciences (MDPI), Vol. 14, 5100 | 2024

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points capturing what is in that section, with emphasis on workloads, knobs, and regimes where defaults underperform.


Abstract


1. Introduction


2. Background

Distributed paradigms covered:

Five collective primitives summarized: Broadcast, Gather, AllGather, Reduce, AllReduce — with diagrams of the data-flow patterns.

Library landscape:


3. Architecture Comparison


4. Experimental Setup

Hardware (Table 1):

Component Specification
GPUs 4x NVIDIA GeForce RTX 3080, 12 GiB each
GPU interconnect PCIe Gen3, 16 GB/s bidirectional
CPU Intel Core i9-10900, 10 cores
Memory 32 GB DDR4 @ 2933 MHz
Network Single-node only (no inter-node)

Software stack:

Component Version
NVIDIA driver 515.48
CUDA 11.3
NCCL 2.4
OpenMPI 4.1.4 (and CUDA-aware)
MPICH 3.3
Docker 20.10.18
Singularity (version not specified)
PyTorch 2.0.1

Deployment regimes:

Workload axes:


5. Linux Shell Experiments

Broadcast (1 GB):

Gather (1 GB):

AllReduce (1 GB):

Scaling 1 to 4 GPUs:


6. PyTorch Experiments

Parameter-Server paradigm (ResNet-18, CIFAR-10):

Ring-AllReduce paradigm (ResNet-18, CIFAR-10):

Training-time delta:


7. Summary of Results

Best/worst pairings (Tables 11–12):

Primitive Best (regime, library) Worst (regime, library)
Broadcast Bare-metal NCCL Cross-Docker NCCL (+213%)
Gather Bare-metal CUDA-aware OpenMPI Cross-Docker GLOO (high spread)
AllReduce Bare-metal NCCL Cross-Docker MPI/GLOO


9. Conclusions


Tables and Figures of Interest


Configurations and Knobs Varied (compact view)

Axis Values
Library MPICH, OpenMPI, CUDA-aware OpenMPI, GLOO, NCCL 2.4
Primitive Broadcast, Gather, AllReduce
GPU count 1, 2, 3, 4
Message size 1 GB (micro); ResNet-18 gradients (macro)
Deployment regime Bare-metal, Singularity, Single-Docker, Cross-Docker
DDL paradigm Parameter Server, Ring AllReduce
Model ResNet-18 (CIFAR-10, batch 32, 10 epochs)
NCCL knobs exposed None (defaults only — no NCCL_ALGO, NCCL_PROTO, NCCL_NTHREADS, NCCL_NCHANNELS sweep)

Relevance to DynamICCL — Mapping Table

DynamICCL is an RL-based NCCL configuration optimizer. Agent-2 picks per-collective (algorithm, protocol, nChannels, numThreads). The ideas in this paper map onto DynamICCL's reward-design and exploration strategy as follows:

Paper finding DynamICCL implication
Best library/regime pairing flips across deployment topology Add a topology descriptor (bare-metal / Singularity / single-container / cross-container, plus PCIe/NVLink path) to the state vector; without it, the policy will average over regimes and miss optima.
NCCL Broadcast +213% in cross-Docker Treat container-isolation regime as a high-leverage exploration axis; expect large regret if Agent-2 is trained only on bare-metal traces and deployed in containers.
NCCL AllReduce -345% vs. MPI in favorable regime The action space may need to include "library family" as a coarse discrete dimension above (algo, proto, nChannels, numThreads), or DynamICCL must explicitly assume NCCL and accept that it is bounded above by NCCL's intrinsic envelope.
Per-primitive best library differs (Broadcast/Gather/AllReduce) Per-collective action selection is the correct granularity; one global config is provably suboptimal.
1 GB micro-benchmark + ResNet-18 macro both used DynamICCL's training corpus should include both isolated micro-collectives (for transfer to new collective sizes) and end-to-end DDL traces (to capture overlap with compute).
cudaMemcpy time = 0% for NCCL vs. 16-20% for MPI When DynamICCL's reward uses end-to-end wall time, this overhead is automatically captured; an algorithmic-bandwidth proxy would hide it. Argument for wall-time reward.
Defaults under-tested vs. virtualization Defaults are not a strong baseline globally; even simple exploration in cross-container regimes will show large gains, validating the RL-vs-static-default comparison DynamICCL plans to publish.
GLOO/MPI degrade more gracefully than NCCL under virtualization DynamICCL should consider catastrophic-fallback action: if measured collective time exceeds an envelope, retry with a safer config rather than committing the worst-case.
No NCCL knob sweep performed in paper Direct gap DynamICCL fills: this paper sets the macro library context; DynamICCL operates one level deeper inside NCCL itself, which the paper explicitly does not explore.
Single-node only, RTX 3080, NCCL 2.4 DynamICCL's Chameleon-Cloud bare-metal setup with InfiniBand and modern NCCL (2.18+) operates in a regime this paper does not cover; the cross-container insight transfers, but the absolute numbers do not.

Synthesis: This paper is most useful to DynamICCL as a measurement-grade prior on regime-dependence rather than as a methodology to inherit. It gives concrete evidence that (a) deployment topology must be a state feature, (b) per-collective decisions are the right granularity, (c) end-to-end wall-time reward is necessary to capture host-staging overheads, and (d) there is large, reproducible regret to be reclaimed in non-bare-metal regimes — which is exactly the exploration target where DynamICCL's RL approach should pay off most.