Collective Communication Performance Evaluation for Distributed Deep Learning Training

Sookwang Lee (ETRI), Jaehwan Lee (Korea Aerospace University) | Applied Sciences (MDPI), Vol. 14, 5100 | 2024


Problem

Distributed Deep Learning (DDL) training scales by data-parallel, model-parallel, and parameter-server paradigms whose iteration time is increasingly dominated by collective communication rather than compute. Modern clusters routinely deploy training inside virtualized environments (Docker, Singularity) for reproducibility and multi-tenant scheduling, yet prior benchmarking of collective libraries (MPI, GLOO, NCCL) has focused on bare-metal and largely ignored intra-node virtualization topology. As a result, practitioners pick a library/backend by reputation, not measurement, and end up with iteration-time penalties that can exceed 2x relative to the best configuration available on the same hardware.


Core Insight

The "best" collective library is not absolute but environment-dependent: NCCL is fastest on bare-metal All-Reduce because of GPU-direct paths, but loses that advantage sharply when GPUs are isolated across separate Docker containers on the same node, while GLOO and MPI degrade more gracefully under the same virtualization. Library selection must therefore be co-decided with the deployment topology (bare-metal / Singularity / single-container / cross-container) and the DDL paradigm (PS vs. Ring All-Reduce).


Method

The authors build two test harnesses on a single node with 4x NVIDIA RTX 3080 GPUs (PCIe Gen3, 16 GB/s), Intel i9-10900, CUDA 11.3, NCCL 2.4, OpenMPI 4.1.4, MPICH 3.3, and Docker/Singularity:

Each experiment is replicated across four deployment regimes — bare-metal, Singularity, single-Docker (all 4 GPUs in one container), and cross-Docker (each GPU in its own container) — and across five backends (MPICH, OpenMPI, CUDA-aware OpenMPI, GLOO, NCCL). GPU count is swept from 1 to 4. The metric basket is end-to-end latency, per-stage cudaMemcpy/MPI_Send time, training wall time, and scaling efficiency.


Results


Limitations


Relevance to DynamICCL

This paper is a measurement-grade reminder that NCCL's "default" performance is a function of deployment regime, not just hardware, and that the regimes where defaults underperform are reproducible and large in magnitude — exactly the regret signal an RL agent like DynamICCL's Agent-2 should exploit.

Concrete implications for reward design and exploration:

  1. Environment as state feature: Pensieve-style state is incomplete for DynamICCL — virtualization topology (bare-metal vs. single-container vs. cross-container) and PCIe/NVLink path information must be in the state vector because the optimal action flips across these regimes.
  2. Reward = wall-clock latency, not throughput proxies: the paper's 3.45x AllReduce gap and 2.13x Broadcast gap are end-to-end latencies. A DynamICCL reward built on cudaMemcpy or algorithmic-bandwidth proxies would miss the cross-container penalty entirely.
  3. Defaults are a strong baseline only on bare-metal: cross-Docker is a regime where any reasonable exploration policy will rapidly find improvements; the paper provides ground-truth lower bounds DynamICCL can sanity-check against.
  4. Per-collective decisions matter: AllReduce, Gather, Broadcast each have a different best library/regime pair. DynamICCL's per-collective action selection is exactly the right granularity.
  5. Workloads to add to DynamICCL's training corpus: 1 GB AllReduce, PS-paradigm Broadcast/Gather, ResNet-18 iteration traces, and single-container/cross-container topologies — these regimes round out the bare-metal-heavy datasets typical in NCCL tuning work.