Collective Communication Performance Evaluation for Distributed Deep Learning Training — Detailed Summary

Sookwang Lee (ETRI), Jaehwan Lee (Korea Aerospace University) | Applied Sciences (MDPI), Vol. 14, 5100 | 2024

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points capturing what is in that section, with emphasis on workloads, knobs, and regimes where defaults underperform.

Abstract

Improper choice of collective communication library degrades DDL training performance because communication time grows superlinearly with cluster size while compute per device is fixed.
The paper compares MPI (MPICH and OpenMPI, including CUDA-aware variants), GLOO, and NCCL across bare-metal, Singularity, single-Docker, and cross-Docker (multi-container) deployments on a single node.
Headline numbers: NCCL Broadcast latency is 213% higher in cross-Docker than single-Docker; GLOO Gather latency is 36% lower in single-Docker than cross-Docker; NCCL AllReduce execution time is up to 345% lower than MPI/GLOO in favorable environments.
The contribution is a decision rubric for picking a library/backend pair given deployment topology and DDL paradigm.

1. Introduction

DDL has become essential because state-of-the-art models (e.g., GPT-3 at 175B parameters) cannot be trained on a single device within useful budgets.
Communication is now the dominant bottleneck in DDL: at scale, collective-communication wall time can exceed compute wall time by more than 10x.
Cloud and HPC operators increasingly run DDL workloads inside containers (Docker, Singularity) for reproducibility, isolation, and scheduling, but these layers introduce communication overheads that have not been systematically benchmarked.
The paper's gap claim: prior collective-communication evaluation focused on bare-metal HPC and did not isolate the effect of intra-node container topology on NCCL/MPI/GLOO.
The paper's contributions: (i) library-level micro-benchmarks across four deployment regimes, (ii) a PyTorch end-to-end study using both Parameter-Server and Ring-AllReduce paradigms, (iii) explicit best/worst pairings per primitive.

2. Background

Distributed paradigms covered:

Parameter Server (PS): centralized aggregator pattern; worker GPUs send gradients, server averages and broadcasts new weights. Implemented with Broadcast + Gather.
Ring All-Reduce: decentralized; every GPU sends/receives in a ring, bandwidth-optimal in N for tensor size T (each GPU moves 2(N-1)/N * T).

Five collective primitives summarized: Broadcast, Gather, AllGather, Reduce, AllReduce — with diagrams of the data-flow patterns.

Library landscape:

MPI (MPICH, OpenMPI): host-memory-oriented; CUDA-aware variants register GPU buffers but still mediate via host where direct paths are absent.
GLOO: Facebook's PyTorch-native CPU/GPU library; simpler than NCCL, sometimes more stable across virtualization.
NCCL: NVIDIA's GPU-resident library using CUDA IPC, NVLink, and GPU-direct paths; the de facto choice for AllReduce on NVIDIA GPUs.

3. Architecture Comparison

Walks through the call-graph of each library in two harnesses: a Linux shell (C++ direct calls) and PyTorch (Python wrappers).
Highlights cudaMemcpy hot-spots in MPI/GLOO control flow versus NCCL's GPU-resident path that bypasses host staging.
Explains how Docker's per-container CUDA context forces additional copies in cross-container deployments — the mechanistic root of the cross-Docker penalty observed later.
The "single-Docker" regime keeps all GPUs under one CUDA context and one IPC namespace; the "cross-Docker" regime fragments these, breaking CUDA IPC fast-paths NCCL relies on.

4. Experimental Setup

Hardware (Table 1):

Component	Specification
GPUs	4x NVIDIA GeForce RTX 3080, 12 GiB each
GPU interconnect	PCIe Gen3, 16 GB/s bidirectional
CPU	Intel Core i9-10900, 10 cores
Memory	32 GB DDR4 @ 2933 MHz
Network	Single-node only (no inter-node)

Software stack:

Component	Version
NVIDIA driver	515.48
CUDA	11.3
NCCL	2.4
OpenMPI	4.1.4 (and CUDA-aware)
MPICH	3.3
Docker	20.10.18
Singularity	(version not specified)
PyTorch	2.0.1

Deployment regimes:

Bare metal.
Singularity (single image, all GPUs).
Single-Docker (one container, all 4 GPUs).
Cross-Docker (4 containers, 1 GPU each).

Workload axes:

Linux-shell micro-benchmarks: 1 GB random tensors; Broadcast, Gather, AllReduce; GPU count swept 1 to 4.
PyTorch macro-benchmark: ResNet-18 on CIFAR-10, batch size 32, 10 epochs; PS and Ring-AllReduce paradigms.

5. Linux Shell Experiments

Broadcast (1 GB):

NCCL on bare-metal/single-Docker: lowest latency due to CUDA IPC path.
NCCL on cross-Docker: latency rises 213% vs. single-Docker — IPC fast-path is broken across container CUDA contexts.
MPI is more stable across regimes but slower than NCCL when NCCL's fast-path is intact.
GLOO sits between, with smaller spread across regimes than NCCL.

Gather (1 GB):

CUDA-aware OpenMPI on bare-metal: 2.22s, beats standard MPI at 2.38–2.56s.
GLOO Gather in single-Docker: 36% lower latency than cross-Docker.
For Gather, NCCL's lead over MPI shrinks compared to AllReduce.

AllReduce (1 GB):

NCCL achieves up to 345% lower execution time than MPI/GLOO on bare-metal and single-Docker.
NCCL 4-GPU AllReduce: 1.19s on bare-metal, ~2.20s in cross-Docker (~85% slowdown).
Time breakdown (Table 4): NCCL spends 0% of AllReduce time on host-side cudaMemcpy; MPI spends ~16-20% there. This explains the gap on bare-metal and explains why CUDA-aware MPI narrows it.

Scaling 1 to 4 GPUs:

NCCL scales near-ideally on bare-metal/single-Docker.
In cross-Docker, NCCL's scaling curve flattens — the per-container isolation cost grows with rank count.

6. PyTorch Experiments

Parameter-Server paradigm (ResNet-18, CIFAR-10):

Implemented as Broadcast (weights) + Gather (gradients).
MPI is the recommended backend on bare-metal/Singularity because NCCL forces an extra GPU as the "server" rank; that GPU is otherwise idle.
Cross-Docker hurts both libraries; MPI degrades less.

Ring-AllReduce paradigm (ResNet-18, CIFAR-10):

NCCL is consistently fastest on bare-metal and single-Docker.
In cross-Docker, NCCL's advantage shrinks substantially; GLOO becomes competitive at low GPU count.
Iteration time is dominated by AllReduce of gradient tensors at the end of each backward pass — exactly the regime where NCCL's bare-metal lead is largest.

Training-time delta:

The micro-benchmark gap (3.45x AllReduce) translates into measurable end-to-end training-time differences across regimes; the paper presents these as bar charts (Figures 13–18) rather than aggregate tables.

7. Summary of Results

Best/worst pairings (Tables 11–12):

Primitive	Best (regime, library)	Worst (regime, library)
Broadcast	Bare-metal NCCL	Cross-Docker NCCL (+213%)
Gather	Bare-metal CUDA-aware OpenMPI	Cross-Docker GLOO (high spread)
AllReduce	Bare-metal NCCL	Cross-Docker MPI/GLOO

Cross-Docker is a uniformly bad regime for collectives whose fast paths depend on CUDA IPC.
Singularity is closer to bare-metal than Docker because it does not fragment the CUDA context across ranks.
CUDA-aware MPI partially closes the NCCL gap but never overtakes it on AllReduce.

Compares to OSU micro-benchmarks (MPI-only, no NCCL/GLOO, no containers) and Horovod-timeline-based studies (framework-level only, no per-primitive isolation).
Notes prior NCCL studies have focused on inter-node InfiniBand fabrics (e.g., Demystifying NCCL) and have not characterized intra-node container effects.

9. Conclusions

Recommendation matrix:
- PS paradigm + bare-metal/Singularity: use MPI.
- Ring-AllReduce + bare-metal/single-Docker: use NCCL.
- Cross-Docker: re-evaluate; NCCL's fast-paths are broken; consider consolidating GPUs into a single container or switching to MPI/GLOO.
Future work: extend to inter-node InfiniBand/RoCE; study network-topology effects on virtualized layers; cover larger transformer models.

Tables and Figures of Interest

Table 1 — Hardware specification (RTX 3080, PCIe Gen3, 16 GB/s).
Table 4 — Time breakdown of AllReduce showing NCCL = 0% cudaMemcpy vs. MPI ~16–20% cudaMemcpy.
Tables 11–12 — Best/worst (library, regime) pairings per primitive.
Figures 13–18 — Bar charts showing the cross-Docker latency spike for NCCL across Broadcast/Gather/AllReduce.

Configurations and Knobs Varied (compact view)

Axis	Values
Library	MPICH, OpenMPI, CUDA-aware OpenMPI, GLOO, NCCL 2.4
Primitive	Broadcast, Gather, AllReduce
GPU count	1, 2, 3, 4
Message size	1 GB (micro); ResNet-18 gradients (macro)
Deployment regime	Bare-metal, Singularity, Single-Docker, Cross-Docker
DDL paradigm	Parameter Server, Ring AllReduce
Model	ResNet-18 (CIFAR-10, batch 32, 10 epochs)
NCCL knobs exposed	None (defaults only — no NCCL_ALGO, NCCL_PROTO, NCCL_NTHREADS, NCCL_NCHANNELS sweep)

Relevance to DynamICCL — Mapping Table

DynamICCL is an RL-based NCCL configuration optimizer. Agent-2 picks per-collective (algorithm, protocol, nChannels, numThreads). The ideas in this paper map onto DynamICCL's reward-design and exploration strategy as follows:

Paper finding	DynamICCL implication
Best library/regime pairing flips across deployment topology	Add a topology descriptor (bare-metal / Singularity / single-container / cross-container, plus PCIe/NVLink path) to the state vector; without it, the policy will average over regimes and miss optima.
NCCL Broadcast +213% in cross-Docker	Treat container-isolation regime as a high-leverage exploration axis; expect large regret if Agent-2 is trained only on bare-metal traces and deployed in containers.
NCCL AllReduce -345% vs. MPI in favorable regime	The action space may need to include "library family" as a coarse discrete dimension above (algo, proto, nChannels, numThreads), or DynamICCL must explicitly assume NCCL and accept that it is bounded above by NCCL's intrinsic envelope.
Per-primitive best library differs (Broadcast/Gather/AllReduce)	Per-collective action selection is the correct granularity; one global config is provably suboptimal.
1 GB micro-benchmark + ResNet-18 macro both used	DynamICCL's training corpus should include both isolated micro-collectives (for transfer to new collective sizes) and end-to-end DDL traces (to capture overlap with compute).
cudaMemcpy time = 0% for NCCL vs. 16-20% for MPI	When DynamICCL's reward uses end-to-end wall time, this overhead is automatically captured; an algorithmic-bandwidth proxy would hide it. Argument for wall-time reward.
Defaults under-tested vs. virtualization	Defaults are not a strong baseline globally; even simple exploration in cross-container regimes will show large gains, validating the RL-vs-static-default comparison DynamICCL plans to publish.
GLOO/MPI degrade more gracefully than NCCL under virtualization	DynamICCL should consider catastrophic-fallback action: if measured collective time exceeds an envelope, retry with a safer config rather than committing the worst-case.
No NCCL knob sweep performed in paper	Direct gap DynamICCL fills: this paper sets the macro library context; DynamICCL operates one level deeper inside NCCL itself, which the paper explicitly does not explore.
Single-node only, RTX 3080, NCCL 2.4	DynamICCL's Chameleon-Cloud bare-metal setup with InfiniBand and modern NCCL (2.18+) operates in a regime this paper does not cover; the cross-container insight transfers, but the absolute numbers do not.

Synthesis: This paper is most useful to DynamICCL as a measurement-grade prior on regime-dependence rather than as a methodology to inherit. It gives concrete evidence that (a) deployment topology must be a state feature, (b) per-collective decisions are the right granularity, (c) end-to-end wall-time reward is necessary to capture host-staging overheads, and (d) there is large, reproducible regret to be reclaimed in non-bare-metal regimes — which is exactly the exploration target where DynamICCL's RL approach should pay off most.