Horovod: fast and easy distributed deep learning in TensorFlow — Detailed Summary

Alexander Sergeev, Mike Del Balso | Uber Technologies, Inc. | arXiv:1802.05799, February 2018

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.

Abstract

Modern deep-learning models — especially those at the frontier of image, speech, and natural-language processing — require very large amounts of GPU compute, and a single GPU is no longer sufficient for production-scale training in reasonable wall-clock time.
Naively scaling TensorFlow to many GPUs introduces two interlocking obstacles: high inter-GPU communication overhead at synchronization points, and intrusive code-restructuring requirements that discourage researchers from migrating their existing single-GPU scripts.
The paper introduces Horovod, an open-source library (Apache 2.0) built on ring-allreduce plus an MPI launcher; the design goal is near-linear strong-scaling efficiency and a four-line code-modification budget for an existing single-GPU TensorFlow program.

1. Introduction

Deep learning has produced dramatic accuracy gains across image recognition, speech recognition, and forecasting; at Uber, these capabilities power self-driving research, ETA / trip forecasting, and fraud-prevention pipelines, motivating an internal investment in faster training.
Uber standardizes on TensorFlow because of its broad popularity, competitive performance, support for high-level APIs (notably Keras), and the ability to drop down to custom CUDA kernels when a model needs it — these properties are taken as fixed constraints for any distributed- training solution.
Horovod was open-sourced in September 2017 as a component of Uber's Michelangelo ML-as-a-service platform, with the explicit aim of making distributed deep-learning training accessible to ML practitioners across the company without requiring distributed-systems expertise.

2. Going distributed

As Uber's models and datasets grew, single-GPU training times stretched from hours to a week or longer, which became incompatible with the iteration cadence the research and engineering teams required.
The team's first approach was the standard distributed TensorFlow package (parameter-server based); experiments revealed two qualitative problems — code complexity and poor scaling — that together motivated a ground-up redesign.
Code complexity: the standard API forced users to learn an unfamiliar vocabulary including tf.Server(), tf.ClusterSpec(), tf.train.SyncReplicasOptimizer(), and tf.train.replicas_device_setter(), along with explicit roles ("worker" vs. "parameter server"), which together constituted a steep on-ramp for researchers whose only goal was to speed up an existing model.
These distributed primitives also introduced hard-to-diagnose bugs — a misconfigured cluster spec or a mis-placed device placement could fail silently or hang, so debugging time grew rather than shrank.
Poor scaling: benchmarks of the standard package on 128 GPUs of NVIDIA Pascal-class hardware showed that approximately half of the available compute (~50%) was lost to communication overhead.
Figure 1 plots scaling for Inception-V3 and ResNet-101 on 1, 4, 8, 16, 32, 64, and 128 GPUs and shows that standard distributed TensorFlow diverges sharply from the ideal-linear curve as GPU count grows, losing almost 50% of capacity at 128 GPUs.
Facebook's 2017 result of training ResNet-50 on 256 GPUs in one hour using data-parallel SGD with learning-rate scaling provided concrete evidence that — with the right communication algorithm and hyperparameter schedule — strong scaling well beyond Uber's then-current reach was achievable, motivating the ring-allreduce direction.

3. Leveraging a different type of algorithm

Uber sought a distributed-training pattern aimed specifically at models that fit on a single GPU or a single multi-GPU server — the regime that covers the vast majority of production deep-learning models — and adopted Facebook's data-parallel paradigm.
The data-parallel paradigm replicates the training script on each worker; each worker independently (1) reads a shard of data, (2) computes gradients on its mini-batch, (3) averages gradients across all workers, (4) updates the local model copy, and (5) loops — with the averaging step being the only inter-worker synchronization.
Standard TensorFlow implements gradient averaging via a parameter-server (PS) architecture: workers compute gradients, push them to one or more PS processes that hold the master parameters, and pull back updated weights for the next step.
The PS approach exposed two challenges: (a) choosing a workable worker-to-PS ratio is non-obvious — too few PSes saturate the network at the server side, too many creates "all-to-all" cross-traffic that also saturates; (b) it forces users to handle distribution mechanics in user code.
In particular, users had to explicitly start worker and PS processes, thread host/port information through their scripts, and restructure models with multi-GPU "towers" — distractions from the actual modeling work.
In early 2017, Baidu evangelized the ring-allreduce algorithm — drawing on Patarasuk and Yuan (2009) — and shared a draft TensorFlow implementation showing the algorithm's promise as a drop-in replacement for parameter-server gradient averaging.
Ring-allreduce (illustrated in Figure 4) arranges the N workers into a logical ring; each worker communicates only with its two neighbors and performs 2 ⋅ (N−1) communication steps to complete one allreduce — the algorithm is bandwidth-optimal in that the data volume each link carries is independent of N (each link carries approximately 2 ⋅ M/N bytes of useful traffic per step for a message of total size M).
The allreduce approach is significantly easier to operationalize than PS because mature MPI implementations (e.g., Open MPI) transparently set up the distributed infrastructure, hostfile, and process launch — eliminating the cluster-spec boilerplate that PS required.

4. Introducing Horovod

Uber built its own implementation on top of Baidu's ring-allreduce draft; the goal was to bring it up to production-grade usability and to improve its raw collective performance for Uber's workloads.
Step 1 — Standalone Python package. Horovod was packaged as a standalone pip-installable Python module (named after a Russian folk dance for grouped synchronized movement), reducing installation from the Baidu draft's roughly one hour of TensorFlow-source patching to minutes, and decoupling Horovod's release cadence from TensorFlow's.
Step 2 — NCCL 2 backend. The original CUDA-aware MPI ring-allreduce was replaced with NVIDIA NCCL 2, which provides a highly optimized ring-allreduce that works across multiple machines (not only intra-node) and exploits NVLink / PCIe / InfiniBand topologies automatically.
Step 3 — Multi-GPU server support. The implementation was extended beyond Baidu's original single-GPU-per-node assumption to support models that fit on a single server with multiple GPUs (e.g., a DGX-style 4- or 8-GPU box), so practitioners could begin scaling within a node before reaching across nodes.
Step 4 — API improvements. A broadcast operation was added so that all workers initialize from the same model parameters at step zero, and the user-visible API was reduced so that an existing single-GPU TensorFlow training script needs only four distinct modifications (init, GPU pinning, optimizer wrap, variable broadcast hook) to become a distributed Horovod program.

5. Distributing your training job with Horovod

Unlike the parameter-server paradigm, which required hundreds of lines of boilerplate, a Horovod-converted training script requires only a handful of lines, demonstrated in Listing 1 of the paper.
The four required modifications are: (1) call hvd.init() at program start, (2) pin each process to one GPU via config.gpu_options.visible_device_list = str(hvd.local_rank()), (3) wrap the user's optimizer with hvd.DistributedOptimizer(opt) so that gradients are averaged via ring-allreduce on apply_gradients, and (4) install hvd.BroadcastGlobalVariablesHook(0) so that rank-0's initial parameters are broadcast to all other ranks.
Programs are launched via MPI, e.g. mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py, which automatically spawns one process per GPU across the listed hosts — no application-level tf.ClusterSpec() is needed.
Horovod also supports Keras programs through the same four hooks, making the same approach viable for users working at the Keras-API level rather than directly in TensorFlow.

6. Horovod Timeline

Distributed deep-learning is notoriously hard to debug because each rank's local TensorFlow timeline and CUDA Profiler trace must be manually cross-referenced across servers to understand why a step is slow — Uber found this workflow inadequate at scale.
Horovod Timeline is a built-in profiling tool that emits a unified trace consumable by Chrome's about:tracing viewer; it is enabled by setting a single environment variable, and it reveals exactly which rank is doing what at each timestep, including allreduce starts, completions, and idle/wait periods. Figure 5 in the paper depicts a representative trace.

7. Tensor Fusion

Profiling with Horovod Timeline revealed that models with many layers — ResNet-101 in particular — emit a large number of very small allreduce calls (one per layer-gradient), and ring-allreduce is inefficient on such tiny messages because the per-call setup overhead dominates the actual bandwidth-bound transfer.
The Tensor Fusion algorithm addresses this by fusing many small ready tensors into a single contiguous buffer, performing one ring-allreduce on the buffer, and copying the reduced results back into the original output tensors — amortizing the per-call overhead across all fused tensors.
Tensor Fusion produced up to 65% performance improvement on layer-rich models (e.g., ResNet-101) when running on unoptimized TCP networks (where the per-call overhead is largest).
The Fusion procedure each step is: (a) inspect the queue of pending allreduces, (b) select the subset of tensors that share a dtype and fit within a configurable buffer (default 64 MB), (c) allocate the fusion buffer if not already cached, (d) copy fused tensors in, (e) run a single ring-allreduce, (f) copy results back to each output tensor.
Together with the rest of Michelangelo (data ingest, hyperparameter search, deployment), Tensor Fusion increases efficiency, end-to-end training speed, and ease of use for production teams at Uber.

8. Horovod Benchmarks

Re-running the official TensorFlow benchmarks with Horovod (Figure 6) shows large scaling improvements: on 25 GbE TCP, Horovod reaches 88% scaling efficiency for Inception-V3 and ResNet-101 on 128 GPUs — approximately 2x faster than the standard distributed TensorFlow parameter-server baseline, which reached only ~50% efficiency.
A second benchmark sweep used 25 GbE RDMA-capable networking (InfiniBand or RoCE) to test whether bypassing the kernel TCP stack could push scaling further.
For Inception-V3 and ResNet-101, RDMA gave a 3-4% speedup over TCP, pushing scaling efficiency past 90% — a modest but real gain.
For VGG-16, RDMA gave a much larger ~30% speedup because VGG-16 has a high parameter count concentrated in fully-connected layers, creating a communication-heavy gradient that bottlenecks on TCP but not on RDMA.
The benchmark conclusion is that Horovod scales well on both TCP and RDMA, with RDMA delivering the largest benefit for parameter-heavy models like VGG-16; for compute-heavy convolutional models (Inception-V3, ResNet-101), TCP is already close to enough.
Uber expressed intent to keep iterating on performance and to leverage community contributions — the project was deliberately released as open source to enable this.

Quantitative summary (consolidated)

Benchmark	Network	Models	Scaling efficiency at 128 GPUs
Standard distributed TensorFlow	25 GbE TCP	Inception-V3, ResNet-101	~50% (baseline)
Horovod	25 GbE TCP	Inception-V3, ResNet-101	88%
Horovod	25 GbE RDMA (IB/RoCE)	Inception-V3, ResNet-101	>90% (+3-4% over TCP)
Horovod	25 GbE RDMA	VGG-16	+~30% vs. Horovod over TCP

Optimization	Reported benefit
NCCL 2 backend (vs. Baidu's MPI ring-allreduce)	enables cross-machine ring-allreduce, topology-aware
Tensor Fusion (default 64 MB buffer)	up to 65% speedup on layer-rich models on unoptimized TCP
Broadcast init hook	replaces hand-rolled parameter-sync code at step 0

9. Next steps

Future work item 1: simplify MPI installation on production clusters by publishing reference designs and partnering with networking-hardware vendors so that ring-allreduce is easier to deploy outside research labs.
Future work item 2: share Uber's emerging best practices for hyperparameter adjustment at scale — e.g., learning-rate warm-up and scaling rules of the kind Facebook used at 256 GPUs — so that practitioners can preserve accuracy while using the larger global batch sizes that distributed training affords.
Future work item 3: add examples for very large models that span multiple GPUs (model-parallelism cases where a single replica does not fit on one GPU); Uber explicitly invites community testing on such regimes.
The authors close by hoping Horovod's simplicity will let other teams better leverage their compute resources, and by inviting external feedback and contributions — Horovod is positioned as a community artifact, not a closed Uber project.

10. Tables and Figures (consolidated)

#	Caption (paraphrased)	Key content
Fig 1	Multi-GPU scaling using standard TensorFlow	~50% capacity loss at 128 GPUs, Inception-V3 and ResNet-101
Fig 2	Data-parallel training pattern	4-step loop: read, compute grads, average, update
Fig 3	Parameter-server architecture	1-PS vs. multi-PS configurations
Fig 4	Ring-allreduce algorithm	N nodes, 2 ⋅ (N−1) steps, bandwidth-optimal
Fig 5	Horovod Timeline screenshot	Chrome `about:tracing` view of distributed events
Fig 6	Horovod vs. standard distributed TensorFlow on 25 GbE TCP	88% scaling at 128 GPUs (Horovod) vs. ~50% (baseline)
Fig 7	Horovod TCP vs. Horovod RDMA on 25 GbE	+3-4% on Inception-V3 / ResNet-101; ~30% on VGG-16
Listing 1	Example TensorFlow program with Horovod	shows the 4-line modification budget

11. Cross-Cutting Empirical Take-Aways

Take-away	Derived from
Parameter-server is the wrong default for synchronous data-parallel SGD on >=128 GPUs	Sec. 2 (~50% efficiency)
Ring-allreduce + MPI launch eliminates almost all distributed-systems boilerplate	Sec. 3, 5
NCCL 2 is the right cross-machine ring-allreduce backend (vs. CUDA-aware MPI)	Sec. 4 (Step 2)
Many small allreduces are pathological for ring-allreduce	Sec. 7 (Tensor Fusion 65% gain)
Communication-heavy models (VGG-16) benefit most from RDMA; compute-heavy models (ResNet/Inception) benefit modestly	Sec. 8 (Fig. 7)
A 4-line user API is sufficient if MPI handles process launch	Sec. 5 (Listing 1)
Distributed profiling needs a unified cross-rank timeline, not per-rank traces	Sec. 6

12. Limitations of the Paper

Single-network-fabric measurement (25 GbE TCP and 25 GbE RDMA only); no 100 GbE / 100 Gb IB / NVLink-only or NVSwitch numbers are reported, so the conclusions about RDMA speedups are bounded to this fabric class.
All published efficiency numbers are on Pascal-class GPUs; Volta / Ampere / Hopper + NVLink/NVSwitch would shift the compute-vs-comm balance and likely change the regimes where RDMA matters.
Workloads are limited to three vision CNNs (Inception-V3, ResNet-101, VGG-16); no transformer / RNN / mixture-of-experts workloads are evaluated, although the underlying ring-allreduce is workload-agnostic.
Tensor Fusion uses a fixed default 64 MB buffer; the paper does not evaluate sensitivity to buffer size, fusion-eligibility heuristics, or per-tensor-type tuning.
Only synchronous data parallelism is considered; asynchronous, bounded-staleness, and pipeline-parallel regimes are out of scope.
The paper reports headline scaling-efficiency percentages but does not decompose them into compute vs. allreduce vs. data-input components, so the residual ~10% gap to ideal at RDMA is unexplained.
No comparison against alternative collective libraries available at the time (e.g., Gloo, IBM PowerAI DDL); positioning is only against parameter-server TensorFlow.

13. Open Problems Implicitly Surfaced

Adaptive Tensor Fusion. A static 64 MB buffer is unlikely to be optimal across the {fabric, GPU generation, model topology} cross- product; a runtime adaptive policy that picks fusion granularity per step is a natural follow-up.
Heterogeneous-fabric scaling. When some links are NVLink and others are 25 GbE TCP, the optimal ring ordering and chunk size are non-obvious — Horovod delegates this to NCCL but does not expose user- level controls.
Beyond synchronous SGD. Horovod's allreduce abstraction does not help large-batch convergence problems; pairing ring-allreduce with gradient compression, local SGD, or hierarchical synchronization is left to the user.
Model parallelism. Single-replica-too-large is explicitly listed as future work; Horovod itself does not provide pipeline or tensor- parallel primitives.
Operationalizing MPI. Ease of MPI install on production Kubernetes clusters is called out as a real obstacle — orchestration tooling for collective frameworks is itself a research/engineering open problem.

14. Discussion of NCCL

NCCL 2 is the underlying transport for Horovod's ring-allreduce: the paper explicitly attributes the cross-machine optimization quality of the allreduce step to NCCL 2's topology-aware implementation rather than to Horovod itself.
The paper does not select NCCL algorithm (Ring vs. Tree) or protocol (LL/LL128/Simple) at the application level — those are NCCL-internal decisions; Horovod's design philosophy is to delegate the wire-level choice to NCCL and focus on the scheduling layer above it (Tensor Fusion, broadcast, timeline).
The Tensor Fusion design is essentially a compensating mechanism for NCCL's small-message inefficiency — fusion happens before the collective is dispatched to NCCL, so NCCL always sees a large message and operates in its bandwidth-bound regime.
This makes Horovod a useful reference point for what NCCL was expected to do well in 2018 (large-message ring-allreduce on RDMA fabrics) and what NCCL was expected to do poorly (many small collectives), informing where higher-layer scheduling is still required even today.

Note on NCCL Tuning

Horovod's Tensor Fusion is a direct example of why the post-fusion message size — rather than the user's per-tensor sizes — is the right input to any NCCL configuration choice. The 65% TCP improvement from fusing many tiny gradients into a single 64 MB buffer establishes that small-message NCCL calls are bandwidth-starved by setup overhead, exactly the regime where LL/LL128 protocols and lower nChannels would help if fusion were not applied. Conversely, once Horovod fuses to a large buffer, NCCL's Ring + Simple at higher channel counts becomes the right choice — so any tuner operating beneath Horovod must read the fused message size, not the original tensor sizes, to pick the correct configuration.