Collective Communication Performance Evaluation for Distributed Deep Learning Training

Sookwang Lee (ETRI), Jaehwan Lee (Korea Aerospace University) | Applied Sciences (MDPI), Vol. 14, 5100 | 2024

Problem

Distributed Deep Learning (DDL) training scales by data-parallel, model-parallel, and parameter-server paradigms whose iteration time is increasingly dominated by collective communication rather than compute. Modern clusters routinely deploy training inside virtualized environments (Docker, Singularity) for reproducibility and multi-tenant scheduling, yet prior benchmarking of collective libraries (MPI, GLOO, NCCL) has focused on bare-metal and largely ignored intra-node virtualization topology. As a result, practitioners pick a library/backend by reputation, not measurement, and end up with iteration-time penalties that can exceed 2x relative to the best configuration available on the same hardware.

Core Insight

The "best" collective library is not absolute but environment-dependent: NCCL is fastest on bare-metal All-Reduce because of GPU-direct paths, but loses that advantage sharply when GPUs are isolated across separate Docker containers on the same node, while GLOO and MPI degrade more gracefully under the same virtualization. Library selection must therefore be co-decided with the deployment topology (bare-metal / Singularity / single-container / cross-container) and the DDL paradigm (PS vs. Ring All-Reduce).

Method

The authors build two test harnesses on a single node with 4x NVIDIA RTX 3080 GPUs (PCIe Gen3, 16 GB/s), Intel i9-10900, CUDA 11.3, NCCL 2.4, OpenMPI 4.1.4, MPICH 3.3, and Docker/Singularity:

A C++ Linux-shell micro-benchmark that measures Broadcast, Gather, and AllReduce on 1 GB tensors with sub-routine breakdown (cudaMemcpy time vs. pure communication time).
A PyTorch 2.0.1 macro-benchmark training ResNet-18 on CIFAR-10 (batch 32, 10 epochs) under both Parameter-Server and Ring-AllReduce paradigms.

Each experiment is replicated across four deployment regimes — bare-metal, Singularity, single-Docker (all 4 GPUs in one container), and cross-Docker (each GPU in its own container) — and across five backends (MPICH, OpenMPI, CUDA-aware OpenMPI, GLOO, NCCL). GPU count is swept from 1 to 4. The metric basket is end-to-end latency, per-stage cudaMemcpy/MPI_Send time, training wall time, and scaling efficiency.

Results

NCCL achieves up to 345% lower AllReduce execution time than MPI/GLOO on bare metal and single-Docker.
NCCL Broadcast latency is 213% higher in cross-Docker than in single-Docker — a regime where defaults silently underperform.
4-GPU AllReduce in cross-Docker takes 2.20s vs. 1.19s on bare metal (~85% slowdown) for NCCL.
GLOO shows 36% lower Gather latency in single-Docker than in cross-Docker, but never beats NCCL on AllReduce.
CUDA-aware OpenMPI Gather (2.22s on bare metal) beats standard MPI (2.38–2.56s) by avoiding host-staged copies.
For Parameter-Server paradigms on bare metal/Singularity, MPI is preferable because NCCL forces a GPU on the parameter-server node, wasting it.
NCCL spends 0% of AllReduce time in host-side cudaMemcpy; MPI spends ~16–20% there.

Limitations

Single-node only; no inter-node InfiniBand/RoCE/Ethernet sweep.
Single GPU type (RTX 3080, PCIe Gen3) — no NVLink/NVSwitch coverage, no A100/H100 data, no multi-rail networks.
NCCL 2.4 is several years behind current production releases; tree algorithms, LL/LL128 protocol selection, and the tuner-plugin API are not exercised.
Only ResNet-18/CIFAR-10 — no transformer, no large-batch, no message-size sweep beyond the fixed 1 GB micro-benchmark.
No exposure of NCCL knobs (NCCL_ALGO, NCCL_PROTO, NCCL_NTHREADS, NCCL_NCHANNELS); defaults are taken as-is.

Relevance to DynamICCL

This paper is a measurement-grade reminder that NCCL's "default" performance is a function of deployment regime, not just hardware, and that the regimes where defaults underperform are reproducible and large in magnitude — exactly the regret signal an RL agent like DynamICCL's Agent-2 should exploit.

Concrete implications for reward design and exploration:

Environment as state feature: Pensieve-style state is incomplete for DynamICCL — virtualization topology (bare-metal vs. single-container vs. cross-container) and PCIe/NVLink path information must be in the state vector because the optimal action flips across these regimes.
Reward = wall-clock latency, not throughput proxies: the paper's 3.45x AllReduce gap and 2.13x Broadcast gap are end-to-end latencies. A DynamICCL reward built on cudaMemcpy or algorithmic-bandwidth proxies would miss the cross-container penalty entirely.
Defaults are a strong baseline only on bare-metal: cross-Docker is a regime where any reasonable exploration policy will rapidly find improvements; the paper provides ground-truth lower bounds DynamICCL can sanity-check against.
Per-collective decisions matter: AllReduce, Gather, Broadcast each have a different best library/regime pair. DynamICCL's per-collective action selection is exactly the right granularity.
Workloads to add to DynamICCL's training corpus: 1 GB AllReduce, PS-paradigm Broadcast/Gather, ResNet-18 iteration traces, and single-container/cross-container topologies — these regimes round out the bare-metal-heavy datasets typical in NCCL tuning work.