Horovod: fast and easy distributed deep learning in TensorFlow — Detailed Summary
Alexander Sergeev, Mike Del Balso | Uber Technologies, Inc. | arXiv:1802.05799, February 2018
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.
Abstract
- Modern deep-learning models — especially those at the frontier of image, speech, and natural-language processing — require very large amounts of GPU compute, and a single GPU is no longer sufficient for production-scale training in reasonable wall-clock time.
- Naively scaling TensorFlow to many GPUs introduces two interlocking obstacles: high inter-GPU communication overhead at synchronization points, and intrusive code-restructuring requirements that discourage researchers from migrating their existing single-GPU scripts.
- The paper introduces Horovod, an open-source library (Apache 2.0) built on ring-allreduce plus an MPI launcher; the design goal is near-linear strong-scaling efficiency and a four-line code-modification budget for an existing single-GPU TensorFlow program.
1. Introduction
- Deep learning has produced dramatic accuracy gains across image recognition, speech recognition, and forecasting; at Uber, these capabilities power self-driving research, ETA / trip forecasting, and fraud-prevention pipelines, motivating an internal investment in faster training.
- Uber standardizes on TensorFlow because of its broad popularity, competitive performance, support for high-level APIs (notably Keras), and the ability to drop down to custom CUDA kernels when a model needs it — these properties are taken as fixed constraints for any distributed- training solution.
- Horovod was open-sourced in September 2017 as a component of Uber's Michelangelo ML-as-a-service platform, with the explicit aim of making distributed deep-learning training accessible to ML practitioners across the company without requiring distributed-systems expertise.
2. Going distributed
- As Uber's models and datasets grew, single-GPU training times stretched from hours to a week or longer, which became incompatible with the iteration cadence the research and engineering teams required.
- The team's first approach was the standard distributed TensorFlow package (parameter-server based); experiments revealed two qualitative problems — code complexity and poor scaling — that together motivated a ground-up redesign.
- Code complexity: the standard API forced users to learn an
unfamiliar vocabulary including
tf.Server(),tf.ClusterSpec(),tf.train.SyncReplicasOptimizer(), andtf.train.replicas_device_setter(), along with explicit roles ("worker" vs. "parameter server"), which together constituted a steep on-ramp for researchers whose only goal was to speed up an existing model. - These distributed primitives also introduced hard-to-diagnose bugs — a misconfigured cluster spec or a mis-placed device placement could fail silently or hang, so debugging time grew rather than shrank.
- Poor scaling: benchmarks of the standard package on 128 GPUs of NVIDIA Pascal-class hardware showed that approximately half of the available compute (~50%) was lost to communication overhead.
- Figure 1 plots scaling for Inception-V3 and ResNet-101 on 1, 4, 8, 16, 32, 64, and 128 GPUs and shows that standard distributed TensorFlow diverges sharply from the ideal-linear curve as GPU count grows, losing almost 50% of capacity at 128 GPUs.
- Facebook's 2017 result of training ResNet-50 on 256 GPUs in one hour using data-parallel SGD with learning-rate scaling provided concrete evidence that — with the right communication algorithm and hyperparameter schedule — strong scaling well beyond Uber's then-current reach was achievable, motivating the ring-allreduce direction.
3. Leveraging a different type of algorithm
- Uber sought a distributed-training pattern aimed specifically at models that fit on a single GPU or a single multi-GPU server — the regime that covers the vast majority of production deep-learning models — and adopted Facebook's data-parallel paradigm.
- The data-parallel paradigm replicates the training script on each worker; each worker independently (1) reads a shard of data, (2) computes gradients on its mini-batch, (3) averages gradients across all workers, (4) updates the local model copy, and (5) loops — with the averaging step being the only inter-worker synchronization.
- Standard TensorFlow implements gradient averaging via a parameter-server (PS) architecture: workers compute gradients, push them to one or more PS processes that hold the master parameters, and pull back updated weights for the next step.
- The PS approach exposed two challenges: (a) choosing a workable worker-to-PS ratio is non-obvious — too few PSes saturate the network at the server side, too many creates "all-to-all" cross-traffic that also saturates; (b) it forces users to handle distribution mechanics in user code.
- In particular, users had to explicitly start worker and PS processes, thread host/port information through their scripts, and restructure models with multi-GPU "towers" — distractions from the actual modeling work.
- In early 2017, Baidu evangelized the ring-allreduce algorithm — drawing on Patarasuk and Yuan (2009) — and shared a draft TensorFlow implementation showing the algorithm's promise as a drop-in replacement for parameter-server gradient averaging.
- Ring-allreduce (illustrated in Figure 4) arranges the N workers into a logical ring; each worker communicates only with its two neighbors and performs 2 ⋅ (N−1) communication steps to complete one allreduce — the algorithm is bandwidth-optimal in that the data volume each link carries is independent of N (each link carries approximately 2 ⋅ M/N bytes of useful traffic per step for a message of total size M).
- The allreduce approach is significantly easier to operationalize than PS because mature MPI implementations (e.g., Open MPI) transparently set up the distributed infrastructure, hostfile, and process launch — eliminating the cluster-spec boilerplate that PS required.
4. Introducing Horovod
- Uber built its own implementation on top of Baidu's ring-allreduce draft; the goal was to bring it up to production-grade usability and to improve its raw collective performance for Uber's workloads.
- Step 1 — Standalone Python package. Horovod was packaged as a standalone pip-installable Python module (named after a Russian folk dance for grouped synchronized movement), reducing installation from the Baidu draft's roughly one hour of TensorFlow-source patching to minutes, and decoupling Horovod's release cadence from TensorFlow's.
- Step 2 — NCCL 2 backend. The original CUDA-aware MPI ring-allreduce was replaced with NVIDIA NCCL 2, which provides a highly optimized ring-allreduce that works across multiple machines (not only intra-node) and exploits NVLink / PCIe / InfiniBand topologies automatically.
- Step 3 — Multi-GPU server support. The implementation was extended beyond Baidu's original single-GPU-per-node assumption to support models that fit on a single server with multiple GPUs (e.g., a DGX-style 4- or 8-GPU box), so practitioners could begin scaling within a node before reaching across nodes.
- Step 4 — API improvements. A broadcast operation was added so that all workers initialize from the same model parameters at step zero, and the user-visible API was reduced so that an existing single-GPU TensorFlow training script needs only four distinct modifications (init, GPU pinning, optimizer wrap, variable broadcast hook) to become a distributed Horovod program.
5. Distributing your training job with Horovod
- Unlike the parameter-server paradigm, which required hundreds of lines of boilerplate, a Horovod-converted training script requires only a handful of lines, demonstrated in Listing 1 of the paper.
- The four required modifications are: (1) call
hvd.init()at program start, (2) pin each process to one GPU viaconfig.gpu_options.visible_device_list = str(hvd.local_rank()), (3) wrap the user's optimizer withhvd.DistributedOptimizer(opt)so that gradients are averaged via ring-allreduce onapply_gradients, and (4) installhvd.BroadcastGlobalVariablesHook(0)so that rank-0's initial parameters are broadcast to all other ranks. - Programs are launched via MPI, e.g.
mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py, which automatically spawns one process per GPU across the listed hosts — no application-leveltf.ClusterSpec()is needed. - Horovod also supports Keras programs through the same four hooks, making the same approach viable for users working at the Keras-API level rather than directly in TensorFlow.
6. Horovod Timeline
- Distributed deep-learning is notoriously hard to debug because each rank's local TensorFlow timeline and CUDA Profiler trace must be manually cross-referenced across servers to understand why a step is slow — Uber found this workflow inadequate at scale.
- Horovod Timeline is a built-in profiling tool that
emits a unified trace consumable by Chrome's
about:tracingviewer; it is enabled by setting a single environment variable, and it reveals exactly which rank is doing what at each timestep, including allreduce starts, completions, and idle/wait periods. Figure 5 in the paper depicts a representative trace.
7. Tensor Fusion
- Profiling with Horovod Timeline revealed that models with many layers — ResNet-101 in particular — emit a large number of very small allreduce calls (one per layer-gradient), and ring-allreduce is inefficient on such tiny messages because the per-call setup overhead dominates the actual bandwidth-bound transfer.
- The Tensor Fusion algorithm addresses this by fusing many small ready tensors into a single contiguous buffer, performing one ring-allreduce on the buffer, and copying the reduced results back into the original output tensors — amortizing the per-call overhead across all fused tensors.
- Tensor Fusion produced up to 65% performance improvement on layer-rich models (e.g., ResNet-101) when running on unoptimized TCP networks (where the per-call overhead is largest).
- The Fusion procedure each step is: (a) inspect the queue of pending allreduces, (b) select the subset of tensors that share a dtype and fit within a configurable buffer (default 64 MB), (c) allocate the fusion buffer if not already cached, (d) copy fused tensors in, (e) run a single ring-allreduce, (f) copy results back to each output tensor.
- Together with the rest of Michelangelo (data ingest, hyperparameter search, deployment), Tensor Fusion increases efficiency, end-to-end training speed, and ease of use for production teams at Uber.
8. Horovod Benchmarks
- Re-running the official TensorFlow benchmarks with Horovod (Figure 6) shows large scaling improvements: on 25 GbE TCP, Horovod reaches 88% scaling efficiency for Inception-V3 and ResNet-101 on 128 GPUs — approximately 2x faster than the standard distributed TensorFlow parameter-server baseline, which reached only ~50% efficiency.
- A second benchmark sweep used 25 GbE RDMA-capable networking (InfiniBand or RoCE) to test whether bypassing the kernel TCP stack could push scaling further.
- For Inception-V3 and ResNet-101, RDMA gave a 3-4% speedup over TCP, pushing scaling efficiency past 90% — a modest but real gain.
- For VGG-16, RDMA gave a much larger ~30% speedup because VGG-16 has a high parameter count concentrated in fully-connected layers, creating a communication-heavy gradient that bottlenecks on TCP but not on RDMA.
- The benchmark conclusion is that Horovod scales well on both TCP and RDMA, with RDMA delivering the largest benefit for parameter-heavy models like VGG-16; for compute-heavy convolutional models (Inception-V3, ResNet-101), TCP is already close to enough.
- Uber expressed intent to keep iterating on performance and to leverage community contributions — the project was deliberately released as open source to enable this.
Quantitative summary (consolidated)
| Benchmark | Network | Models | Scaling efficiency at 128 GPUs |
|---|---|---|---|
| Standard distributed TensorFlow | 25 GbE TCP | Inception-V3, ResNet-101 | ~50% (baseline) |
| Horovod | 25 GbE TCP | Inception-V3, ResNet-101 | 88% |
| Horovod | 25 GbE RDMA (IB/RoCE) | Inception-V3, ResNet-101 | >90% (+3-4% over TCP) |
| Horovod | 25 GbE RDMA | VGG-16 | +~30% vs. Horovod over TCP |
| Optimization | Reported benefit |
|---|---|
| NCCL 2 backend (vs. Baidu's MPI ring-allreduce) | enables cross-machine ring-allreduce, topology-aware |
| Tensor Fusion (default 64 MB buffer) | up to 65% speedup on layer-rich models on unoptimized TCP |
| Broadcast init hook | replaces hand-rolled parameter-sync code at step 0 |
9. Next steps
- Future work item 1: simplify MPI installation on production clusters by publishing reference designs and partnering with networking-hardware vendors so that ring-allreduce is easier to deploy outside research labs.
- Future work item 2: share Uber's emerging best practices for hyperparameter adjustment at scale — e.g., learning-rate warm-up and scaling rules of the kind Facebook used at 256 GPUs — so that practitioners can preserve accuracy while using the larger global batch sizes that distributed training affords.
- Future work item 3: add examples for very large models that span multiple GPUs (model-parallelism cases where a single replica does not fit on one GPU); Uber explicitly invites community testing on such regimes.
- The authors close by hoping Horovod's simplicity will let other teams better leverage their compute resources, and by inviting external feedback and contributions — Horovod is positioned as a community artifact, not a closed Uber project.
10. Tables and Figures (consolidated)
| # | Caption (paraphrased) | Key content |
|---|---|---|
| Fig 1 | Multi-GPU scaling using standard TensorFlow | ~50% capacity loss at 128 GPUs, Inception-V3 and ResNet-101 |
| Fig 2 | Data-parallel training pattern | 4-step loop: read, compute grads, average, update |
| Fig 3 | Parameter-server architecture | 1-PS vs. multi-PS configurations |
| Fig 4 | Ring-allreduce algorithm | N nodes, 2 ⋅ (N−1) steps, bandwidth-optimal |
| Fig 5 | Horovod Timeline screenshot | Chrome about:tracing view of distributed events |
| Fig 6 | Horovod vs. standard distributed TensorFlow on 25 GbE TCP | 88% scaling at 128 GPUs (Horovod) vs. ~50% (baseline) |
| Fig 7 | Horovod TCP vs. Horovod RDMA on 25 GbE | +3-4% on Inception-V3 / ResNet-101; ~30% on VGG-16 |
| Listing 1 | Example TensorFlow program with Horovod | shows the 4-line modification budget |
11. Cross-Cutting Empirical Take-Aways
| Take-away | Derived from |
|---|---|
| Parameter-server is the wrong default for synchronous data-parallel SGD on >=128 GPUs | Sec. 2 (~50% efficiency) |
| Ring-allreduce + MPI launch eliminates almost all distributed-systems boilerplate | Sec. 3, 5 |
| NCCL 2 is the right cross-machine ring-allreduce backend (vs. CUDA-aware MPI) | Sec. 4 (Step 2) |
| Many small allreduces are pathological for ring-allreduce | Sec. 7 (Tensor Fusion 65% gain) |
| Communication-heavy models (VGG-16) benefit most from RDMA; compute-heavy models (ResNet/Inception) benefit modestly | Sec. 8 (Fig. 7) |
| A 4-line user API is sufficient if MPI handles process launch | Sec. 5 (Listing 1) |
| Distributed profiling needs a unified cross-rank timeline, not per-rank traces | Sec. 6 |
12. Limitations of the Paper
- Single-network-fabric measurement (25 GbE TCP and 25 GbE RDMA only); no 100 GbE / 100 Gb IB / NVLink-only or NVSwitch numbers are reported, so the conclusions about RDMA speedups are bounded to this fabric class.
- All published efficiency numbers are on Pascal-class GPUs; Volta / Ampere / Hopper + NVLink/NVSwitch would shift the compute-vs-comm balance and likely change the regimes where RDMA matters.
- Workloads are limited to three vision CNNs (Inception-V3, ResNet-101, VGG-16); no transformer / RNN / mixture-of-experts workloads are evaluated, although the underlying ring-allreduce is workload-agnostic.
- Tensor Fusion uses a fixed default 64 MB buffer; the paper does not evaluate sensitivity to buffer size, fusion-eligibility heuristics, or per-tensor-type tuning.
- Only synchronous data parallelism is considered; asynchronous, bounded-staleness, and pipeline-parallel regimes are out of scope.
- The paper reports headline scaling-efficiency percentages but does not decompose them into compute vs. allreduce vs. data-input components, so the residual ~10% gap to ideal at RDMA is unexplained.
- No comparison against alternative collective libraries available at the time (e.g., Gloo, IBM PowerAI DDL); positioning is only against parameter-server TensorFlow.
13. Open Problems Implicitly Surfaced
- Adaptive Tensor Fusion. A static 64 MB buffer is unlikely to be optimal across the {fabric, GPU generation, model topology} cross- product; a runtime adaptive policy that picks fusion granularity per step is a natural follow-up.
- Heterogeneous-fabric scaling. When some links are NVLink and others are 25 GbE TCP, the optimal ring ordering and chunk size are non-obvious — Horovod delegates this to NCCL but does not expose user- level controls.
- Beyond synchronous SGD. Horovod's allreduce abstraction does not help large-batch convergence problems; pairing ring-allreduce with gradient compression, local SGD, or hierarchical synchronization is left to the user.
- Model parallelism. Single-replica-too-large is explicitly listed as future work; Horovod itself does not provide pipeline or tensor- parallel primitives.
- Operationalizing MPI. Ease of MPI install on production Kubernetes clusters is called out as a real obstacle — orchestration tooling for collective frameworks is itself a research/engineering open problem.
14. Discussion of NCCL
- NCCL 2 is the underlying transport for Horovod's ring-allreduce: the paper explicitly attributes the cross-machine optimization quality of the allreduce step to NCCL 2's topology-aware implementation rather than to Horovod itself.
- The paper does not select NCCL algorithm (Ring vs. Tree) or protocol (LL/LL128/Simple) at the application level — those are NCCL-internal decisions; Horovod's design philosophy is to delegate the wire-level choice to NCCL and focus on the scheduling layer above it (Tensor Fusion, broadcast, timeline).
- The Tensor Fusion design is essentially a compensating mechanism for NCCL's small-message inefficiency — fusion happens before the collective is dispatched to NCCL, so NCCL always sees a large message and operates in its bandwidth-bound regime.
- This makes Horovod a useful reference point for what NCCL was expected to do well in 2018 (large-message ring-allreduce on RDMA fabrics) and what NCCL was expected to do poorly (many small collectives), informing where higher-layer scheduling is still required even today.
Note on NCCL Tuning
Horovod's Tensor Fusion is a direct example of why the post-fusion message size — rather than the user's per-tensor sizes — is the right input to any NCCL configuration choice. The 65% TCP improvement from fusing many tiny gradients into a single 64 MB buffer establishes that small-message NCCL calls are bandwidth-starved by setup overhead, exactly the regime where LL/LL128 protocols and lower nChannels would help if fusion were not applied. Conversely, once Horovod fuses to a large buffer, NCCL's Ring + Simple at higher channel counts becomes the right choice — so any tuner operating beneath Horovod must read the fused message size, not the original tensor sizes, to pick the correct configuration.