Horovod: fast and easy distributed deep learning in TensorFlow — Detailed Summary

Alexander Sergeev, Mike Del Balso | Uber Technologies, Inc. | arXiv:1802.05799, February 2018

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.


Abstract


1. Introduction


2. Going distributed


3. Leveraging a different type of algorithm


4. Introducing Horovod


5. Distributing your training job with Horovod


6. Horovod Timeline


7. Tensor Fusion


8. Horovod Benchmarks

Quantitative summary (consolidated)

Benchmark Network Models Scaling efficiency at 128 GPUs
Standard distributed TensorFlow 25 GbE TCP Inception-V3, ResNet-101 ~50% (baseline)
Horovod 25 GbE TCP Inception-V3, ResNet-101 88%
Horovod 25 GbE RDMA (IB/RoCE) Inception-V3, ResNet-101 >90% (+3-4% over TCP)
Horovod 25 GbE RDMA VGG-16 +~30% vs. Horovod over TCP
Optimization Reported benefit
NCCL 2 backend (vs. Baidu's MPI ring-allreduce) enables cross-machine ring-allreduce, topology-aware
Tensor Fusion (default 64 MB buffer) up to 65% speedup on layer-rich models on unoptimized TCP
Broadcast init hook replaces hand-rolled parameter-sync code at step 0

9. Next steps


10. Tables and Figures (consolidated)

# Caption (paraphrased) Key content
Fig 1 Multi-GPU scaling using standard TensorFlow ~50% capacity loss at 128 GPUs, Inception-V3 and ResNet-101
Fig 2 Data-parallel training pattern 4-step loop: read, compute grads, average, update
Fig 3 Parameter-server architecture 1-PS vs. multi-PS configurations
Fig 4 Ring-allreduce algorithm N nodes, 2 ⋅ (N−1) steps, bandwidth-optimal
Fig 5 Horovod Timeline screenshot Chrome about:tracing view of distributed events
Fig 6 Horovod vs. standard distributed TensorFlow on 25 GbE TCP 88% scaling at 128 GPUs (Horovod) vs. ~50% (baseline)
Fig 7 Horovod TCP vs. Horovod RDMA on 25 GbE +3-4% on Inception-V3 / ResNet-101; ~30% on VGG-16
Listing 1 Example TensorFlow program with Horovod shows the 4-line modification budget

11. Cross-Cutting Empirical Take-Aways

Take-away Derived from
Parameter-server is the wrong default for synchronous data-parallel SGD on >=128 GPUs Sec. 2 (~50% efficiency)
Ring-allreduce + MPI launch eliminates almost all distributed-systems boilerplate Sec. 3, 5
NCCL 2 is the right cross-machine ring-allreduce backend (vs. CUDA-aware MPI) Sec. 4 (Step 2)
Many small allreduces are pathological for ring-allreduce Sec. 7 (Tensor Fusion 65% gain)
Communication-heavy models (VGG-16) benefit most from RDMA; compute-heavy models (ResNet/Inception) benefit modestly Sec. 8 (Fig. 7)
A 4-line user API is sufficient if MPI handles process launch Sec. 5 (Listing 1)
Distributed profiling needs a unified cross-rank timeline, not per-rank traces Sec. 6

12. Limitations of the Paper


13. Open Problems Implicitly Surfaced

  1. Adaptive Tensor Fusion. A static 64 MB buffer is unlikely to be optimal across the {fabric, GPU generation, model topology} cross- product; a runtime adaptive policy that picks fusion granularity per step is a natural follow-up.
  2. Heterogeneous-fabric scaling. When some links are NVLink and others are 25 GbE TCP, the optimal ring ordering and chunk size are non-obvious — Horovod delegates this to NCCL but does not expose user- level controls.
  3. Beyond synchronous SGD. Horovod's allreduce abstraction does not help large-batch convergence problems; pairing ring-allreduce with gradient compression, local SGD, or hierarchical synchronization is left to the user.
  4. Model parallelism. Single-replica-too-large is explicitly listed as future work; Horovod itself does not provide pipeline or tensor- parallel primitives.
  5. Operationalizing MPI. Ease of MPI install on production Kubernetes clusters is called out as a real obstacle — orchestration tooling for collective frameworks is itself a research/engineering open problem.

14. Discussion of NCCL

Note on NCCL Tuning

Horovod's Tensor Fusion is a direct example of why the post-fusion message size — rather than the user's per-tensor sizes — is the right input to any NCCL configuration choice. The 65% TCP improvement from fusing many tiny gradients into a single 64 MB buffer establishes that small-message NCCL calls are bandwidth-starved by setup overhead, exactly the regime where LL/LL128 protocols and lower nChannels would help if fusion were not applied. Conversely, once Horovod fuses to a large buffer, NCCL's Ring + Simple at higher channel counts becomes the right choice — so any tuner operating beneath Horovod must read the fused message size, not the original tensor sizes, to pick the correct configuration.