Collective Communication Performance Evaluation for Distributed Deep Learning Training — Detailed Summary
Sookwang Lee (ETRI), Jaehwan Lee (Korea Aerospace University) | Applied Sciences (MDPI), Vol. 14, 5100 | 2024
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points capturing what is in that section, with emphasis on workloads, knobs, and regimes where defaults underperform.
Abstract
- Improper choice of collective communication library degrades DDL training performance because communication time grows superlinearly with cluster size while compute per device is fixed.
- The paper compares MPI (MPICH and OpenMPI, including CUDA-aware variants), GLOO, and NCCL across bare-metal, Singularity, single-Docker, and cross-Docker (multi-container) deployments on a single node.
- Headline numbers: NCCL Broadcast latency is 213% higher in cross-Docker than single-Docker; GLOO Gather latency is 36% lower in single-Docker than cross-Docker; NCCL AllReduce execution time is up to 345% lower than MPI/GLOO in favorable environments.
- The contribution is a decision rubric for picking a library/backend pair given deployment topology and DDL paradigm.
1. Introduction
- DDL has become essential because state-of-the-art models (e.g., GPT-3 at 175B parameters) cannot be trained on a single device within useful budgets.
- Communication is now the dominant bottleneck in DDL: at scale, collective-communication wall time can exceed compute wall time by more than 10x.
- Cloud and HPC operators increasingly run DDL workloads inside containers (Docker, Singularity) for reproducibility, isolation, and scheduling, but these layers introduce communication overheads that have not been systematically benchmarked.
- The paper's gap claim: prior collective-communication evaluation focused on bare-metal HPC and did not isolate the effect of intra-node container topology on NCCL/MPI/GLOO.
- The paper's contributions: (i) library-level micro-benchmarks across four deployment regimes, (ii) a PyTorch end-to-end study using both Parameter-Server and Ring-AllReduce paradigms, (iii) explicit best/worst pairings per primitive.
2. Background
Distributed paradigms covered:
- Parameter Server (PS): centralized aggregator pattern; worker GPUs send gradients, server averages and broadcasts new weights. Implemented with Broadcast + Gather.
- Ring All-Reduce: decentralized; every GPU sends/receives in a ring, bandwidth-optimal in N for tensor size T (each GPU moves 2(N-1)/N * T).
Five collective primitives summarized: Broadcast, Gather, AllGather, Reduce, AllReduce — with diagrams of the data-flow patterns.
Library landscape:
- MPI (MPICH, OpenMPI): host-memory-oriented; CUDA-aware variants register GPU buffers but still mediate via host where direct paths are absent.
- GLOO: Facebook's PyTorch-native CPU/GPU library; simpler than NCCL, sometimes more stable across virtualization.
- NCCL: NVIDIA's GPU-resident library using CUDA IPC, NVLink, and GPU-direct paths; the de facto choice for AllReduce on NVIDIA GPUs.
3. Architecture Comparison
- Walks through the call-graph of each library in two harnesses: a Linux shell (C++ direct calls) and PyTorch (Python wrappers).
- Highlights cudaMemcpy hot-spots in MPI/GLOO control flow versus NCCL's GPU-resident path that bypasses host staging.
- Explains how Docker's per-container CUDA context forces additional copies in cross-container deployments — the mechanistic root of the cross-Docker penalty observed later.
- The "single-Docker" regime keeps all GPUs under one CUDA context and one IPC namespace; the "cross-Docker" regime fragments these, breaking CUDA IPC fast-paths NCCL relies on.
4. Experimental Setup
Hardware (Table 1):
| Component | Specification |
|---|---|
| GPUs | 4x NVIDIA GeForce RTX 3080, 12 GiB each |
| GPU interconnect | PCIe Gen3, 16 GB/s bidirectional |
| CPU | Intel Core i9-10900, 10 cores |
| Memory | 32 GB DDR4 @ 2933 MHz |
| Network | Single-node only (no inter-node) |
Software stack:
| Component | Version |
|---|---|
| NVIDIA driver | 515.48 |
| CUDA | 11.3 |
| NCCL | 2.4 |
| OpenMPI | 4.1.4 (and CUDA-aware) |
| MPICH | 3.3 |
| Docker | 20.10.18 |
| Singularity | (version not specified) |
| PyTorch | 2.0.1 |
Deployment regimes:
- Bare metal.
- Singularity (single image, all GPUs).
- Single-Docker (one container, all 4 GPUs).
- Cross-Docker (4 containers, 1 GPU each).
Workload axes:
- Linux-shell micro-benchmarks: 1 GB random tensors; Broadcast, Gather, AllReduce; GPU count swept 1 to 4.
- PyTorch macro-benchmark: ResNet-18 on CIFAR-10, batch size 32, 10 epochs; PS and Ring-AllReduce paradigms.
5. Linux Shell Experiments
Broadcast (1 GB):
- NCCL on bare-metal/single-Docker: lowest latency due to CUDA IPC path.
- NCCL on cross-Docker: latency rises 213% vs. single-Docker — IPC fast-path is broken across container CUDA contexts.
- MPI is more stable across regimes but slower than NCCL when NCCL's fast-path is intact.
- GLOO sits between, with smaller spread across regimes than NCCL.
Gather (1 GB):
- CUDA-aware OpenMPI on bare-metal: 2.22s, beats standard MPI at 2.38–2.56s.
- GLOO Gather in single-Docker: 36% lower latency than cross-Docker.
- For Gather, NCCL's lead over MPI shrinks compared to AllReduce.
AllReduce (1 GB):
- NCCL achieves up to 345% lower execution time than MPI/GLOO on bare-metal and single-Docker.
- NCCL 4-GPU AllReduce: 1.19s on bare-metal, ~2.20s in cross-Docker (~85% slowdown).
- Time breakdown (Table 4): NCCL spends 0% of AllReduce time on host-side cudaMemcpy; MPI spends ~16-20% there. This explains the gap on bare-metal and explains why CUDA-aware MPI narrows it.
Scaling 1 to 4 GPUs:
- NCCL scales near-ideally on bare-metal/single-Docker.
- In cross-Docker, NCCL's scaling curve flattens — the per-container isolation cost grows with rank count.
6. PyTorch Experiments
Parameter-Server paradigm (ResNet-18, CIFAR-10):
- Implemented as Broadcast (weights) + Gather (gradients).
- MPI is the recommended backend on bare-metal/Singularity because NCCL forces an extra GPU as the "server" rank; that GPU is otherwise idle.
- Cross-Docker hurts both libraries; MPI degrades less.
Ring-AllReduce paradigm (ResNet-18, CIFAR-10):
- NCCL is consistently fastest on bare-metal and single-Docker.
- In cross-Docker, NCCL's advantage shrinks substantially; GLOO becomes competitive at low GPU count.
- Iteration time is dominated by AllReduce of gradient tensors at the end of each backward pass — exactly the regime where NCCL's bare-metal lead is largest.
Training-time delta:
- The micro-benchmark gap (3.45x AllReduce) translates into measurable end-to-end training-time differences across regimes; the paper presents these as bar charts (Figures 13–18) rather than aggregate tables.
7. Summary of Results
Best/worst pairings (Tables 11–12):
| Primitive | Best (regime, library) | Worst (regime, library) |
|---|---|---|
| Broadcast | Bare-metal NCCL | Cross-Docker NCCL (+213%) |
| Gather | Bare-metal CUDA-aware OpenMPI | Cross-Docker GLOO (high spread) |
| AllReduce | Bare-metal NCCL | Cross-Docker MPI/GLOO |
- Cross-Docker is a uniformly bad regime for collectives whose fast paths depend on CUDA IPC.
- Singularity is closer to bare-metal than Docker because it does not fragment the CUDA context across ranks.
- CUDA-aware MPI partially closes the NCCL gap but never overtakes it on AllReduce.
8. Related Work
- Compares to OSU micro-benchmarks (MPI-only, no NCCL/GLOO, no containers) and Horovod-timeline-based studies (framework-level only, no per-primitive isolation).
- Notes prior NCCL studies have focused on inter-node InfiniBand fabrics (e.g., Demystifying NCCL) and have not characterized intra-node container effects.
9. Conclusions
- Recommendation matrix:
- PS paradigm + bare-metal/Singularity: use MPI.
- Ring-AllReduce + bare-metal/single-Docker: use NCCL.
- Cross-Docker: re-evaluate; NCCL's fast-paths are broken; consider consolidating GPUs into a single container or switching to MPI/GLOO.
- Future work: extend to inter-node InfiniBand/RoCE; study network-topology effects on virtualized layers; cover larger transformer models.
Tables and Figures of Interest
- Table 1 — Hardware specification (RTX 3080, PCIe Gen3, 16 GB/s).
- Table 4 — Time breakdown of AllReduce showing NCCL = 0% cudaMemcpy vs. MPI ~16–20% cudaMemcpy.
- Tables 11–12 — Best/worst (library, regime) pairings per primitive.
- Figures 13–18 — Bar charts showing the cross-Docker latency spike for NCCL across Broadcast/Gather/AllReduce.
Configurations and Knobs Varied (compact view)
| Axis | Values |
|---|---|
| Library | MPICH, OpenMPI, CUDA-aware OpenMPI, GLOO, NCCL 2.4 |
| Primitive | Broadcast, Gather, AllReduce |
| GPU count | 1, 2, 3, 4 |
| Message size | 1 GB (micro); ResNet-18 gradients (macro) |
| Deployment regime | Bare-metal, Singularity, Single-Docker, Cross-Docker |
| DDL paradigm | Parameter Server, Ring AllReduce |
| Model | ResNet-18 (CIFAR-10, batch 32, 10 epochs) |
| NCCL knobs exposed | None (defaults only — no NCCL_ALGO, NCCL_PROTO, NCCL_NTHREADS, NCCL_NCHANNELS sweep) |
Relevance to DynamICCL — Mapping Table
DynamICCL is an RL-based NCCL configuration optimizer. Agent-2 picks per-collective (algorithm, protocol, nChannels, numThreads). The ideas in this paper map onto DynamICCL's reward-design and exploration strategy as follows:
| Paper finding | DynamICCL implication |
|---|---|
| Best library/regime pairing flips across deployment topology | Add a topology descriptor (bare-metal / Singularity / single-container / cross-container, plus PCIe/NVLink path) to the state vector; without it, the policy will average over regimes and miss optima. |
| NCCL Broadcast +213% in cross-Docker | Treat container-isolation regime as a high-leverage exploration axis; expect large regret if Agent-2 is trained only on bare-metal traces and deployed in containers. |
| NCCL AllReduce -345% vs. MPI in favorable regime | The action space may need to include "library family" as a coarse discrete dimension above (algo, proto, nChannels, numThreads), or DynamICCL must explicitly assume NCCL and accept that it is bounded above by NCCL's intrinsic envelope. |
| Per-primitive best library differs (Broadcast/Gather/AllReduce) | Per-collective action selection is the correct granularity; one global config is provably suboptimal. |
| 1 GB micro-benchmark + ResNet-18 macro both used | DynamICCL's training corpus should include both isolated micro-collectives (for transfer to new collective sizes) and end-to-end DDL traces (to capture overlap with compute). |
| cudaMemcpy time = 0% for NCCL vs. 16-20% for MPI | When DynamICCL's reward uses end-to-end wall time, this overhead is automatically captured; an algorithmic-bandwidth proxy would hide it. Argument for wall-time reward. |
| Defaults under-tested vs. virtualization | Defaults are not a strong baseline globally; even simple exploration in cross-container regimes will show large gains, validating the RL-vs-static-default comparison DynamICCL plans to publish. |
| GLOO/MPI degrade more gracefully than NCCL under virtualization | DynamICCL should consider catastrophic-fallback action: if measured collective time exceeds an envelope, retry with a safer config rather than committing the worst-case. |
| No NCCL knob sweep performed in paper | Direct gap DynamICCL fills: this paper sets the macro library context; DynamICCL operates one level deeper inside NCCL itself, which the paper explicitly does not explore. |
| Single-node only, RTX 3080, NCCL 2.4 | DynamICCL's Chameleon-Cloud bare-metal setup with InfiniBand and modern NCCL (2.18+) operates in a regime this paper does not cover; the cross-container insight transfers, but the absolute numbers do not. |
Synthesis: This paper is most useful to DynamICCL as a measurement-grade prior on regime-dependence rather than as a methodology to inherit. It gives concrete evidence that (a) deployment topology must be a state feature, (b) per-collective decisions are the right granularity, (c) end-to-end wall-time reward is necessary to capture host-staging overheads, and (d) there is large, reproducible regret to be reclaimed in non-bare-metal regimes — which is exactly the exploration target where DynamICCL's RL approach should pay off most.