Horovod: fast and easy distributed deep learning in TensorFlow
Alexander Sergeev, Mike Del Balso | Uber Technologies, Inc. | arXiv:1802.05799, February 2018
Problem
Modern deep-learning workloads — image, speech, NLP — outgrow a
single GPU, but naively scaling TensorFlow to many GPUs hits two
simultaneous walls. (1) The standard distributed TensorFlow package is
parameter-server based and lost approximately
half (~50%) of available compute to communication
overhead at 128 Pascal GPUs on Inception-V3 and ResNet-101. (2)
The PS API forces researchers to absorb a steep distributed-systems
vocabulary — tf.Server, tf.ClusterSpec,
tf.train.SyncReplicasOptimizer,
tf.train.replicas_device_setter, plus worker/PS role
bookkeeping — which produces hard-to-diagnose bugs and discourages
migration of existing single-GPU scripts. As Uber's training times grew
from hours to a week or longer, both the performance loss and the
integration friction became blockers.
Core Insight
Replacing parameter-server gradient averaging with bandwidth-optimal ring-allreduce (Patarasuk and Yuan, 2009; popularized for TensorFlow by Baidu in early 2017) and delegating process launch to a mature MPI runtime lets a four-line user-API absorb all distributed-systems boilerplate while doubling end-to-end scaling efficiency. The remaining small-message inefficiency of ring-allreduce is then patched at the scheduling layer above NCCL via Tensor Fusion, rather than inside the collective itself.
Method
Horovod is a standalone pip-installable Python library (Apache 2.0, released September 2017) that replaces TensorFlow's parameter-server gradient averaging with a ring-allreduce backend, layered as follows:
+-----------------------------------------------------------------+
| User TensorFlow / Keras script (4-line modification budget) |
| 1. hvd.init() |
| 2. config.gpu_options.visible_device_list = hvd.local_rank() |
| 3. opt = hvd.DistributedOptimizer(opt) |
| 4. hvd.BroadcastGlobalVariablesHook(0) |
+-----------------------------------------------------------------+
| Horovod scheduling layer |
| - Tensor Fusion (default 64 MB buffer; same dtype) |
| - Broadcast init |
| - Horovod Timeline (Chrome about:tracing) |
+-----------------------------------------------------------------+
| NCCL 2 ring-allreduce (intra-node + inter-node, topology-aware) |
+-----------------------------------------------------------------+
| MPI runtime (Open MPI): mpirun -np N -H host1:k,host2:k ... |
+-----------------------------------------------------------------+
| 25 GbE TCP / 25 GbE RDMA (InfiniBand or RoCE) |
+-----------------------------------------------------------------+
Ring-allreduce arranges N workers into a logical ring, each communicating only with its two neighbors; one allreduce of message size M takes 2 ⋅ (N−1) steps and is bandwidth-optimal (each link carries approximately 2 ⋅ M/N useful bytes per step independent of N). Tensor Fusion concatenates pending small same-dtype tensors into a single buffer (default 64 MB), runs one allreduce on the buffer, and copies results back — amortizing per-call setup over many gradients.
Experimental Setup
| Component | Value |
|---|---|
| GPUs | NVIDIA Pascal-class, scaled to 128 GPUs |
| Inter-node fabric | 25 GbE TCP and 25 GbE RDMA (InfiniBand or RoCE) |
| Framework | TensorFlow (with Keras support) |
| Collective backend | NVIDIA NCCL 2 (replaces Baidu's CUDA-aware MPI) |
| Process launch | Open MPI (mpirun -np ... -H ...) |
| Workloads | Inception-V3, ResNet-101, VGG-16 |
| Baseline | Standard distributed TensorFlow (parameter-server) |
| Metric | Images/sec; scaling efficiency vs. ideal-linear |
| Profiler | Horovod Timeline (Chrome about:tracing) |
Headline Quantitative Results
| Configuration | Network | Models | Scaling efficiency at 128 GPUs |
|---|---|---|---|
| Standard distributed TensorFlow | 25 GbE TCP | Inception-V3, ResNet-101 | ~50% (baseline) |
| Horovod | 25 GbE TCP | Inception-V3, ResNet-101 | 88% (~2x baseline) |
| Horovod | 25 GbE RDMA | Inception-V3, ResNet-101 | >90% (+3-4% over TCP) |
| Horovod | 25 GbE RDMA | VGG-16 | +~30% vs. Horovod over TCP |
Tensor Fusion contributed up to 65% performance improvement on layer-rich models (e.g., ResNet-101) on unoptimized TCP networks, where per-call setup overhead is largest. RDMA's incremental benefit was small (3-4%) for compute-heavy convolutional models but large (~30%) for VGG-16, which is parameter-heavy due to its fully-connected layers and therefore communication-bound on TCP.
Limitations
The benchmarks span a single fabric class (25 GbE TCP and 25 GbE RDMA only) — no 100 GbE, 100 Gb IB, NVLink, or NVSwitch numbers are reported, so RDMA-speedup conclusions are bounded to that class. All efficiency numbers are on Pascal-class GPUs, and the workload set is three vision CNNs (Inception-V3, ResNet-101, VGG-16) — transformers, RNNs, and MoE models are not evaluated. Tensor Fusion uses a fixed 64 MB buffer with no sensitivity sweep. Only synchronous data parallelism is considered; asynchronous, bounded-staleness, and pipeline-parallel regimes are out of scope. Headline efficiencies are not decomposed into compute vs. allreduce vs. data-input, so the residual ~10% gap to ideal at RDMA is unexplained. No comparison is made against contemporaneous collective libraries (e.g., Gloo, IBM PowerAI DDL); positioning is only against parameter-server TensorFlow. Model parallelism is left as future work.
Open Problems
- Adaptive Tensor Fusion — a static 64 MB buffer is unlikely to be optimal across the {fabric, GPU generation, model topology} product; a runtime adaptive fusion-granularity policy is a natural follow-up.
- Heterogeneous-fabric scaling — when some links are NVLink and others are 25 GbE TCP, optimal ring ordering and chunk size are non-obvious; Horovod delegates this to NCCL with no user-level controls.
- Beyond synchronous SGD — pairing ring-allreduce with gradient compression, local SGD, or hierarchical synchronization is left to the user; Horovod's allreduce abstraction does not address large-batch convergence.
- Model parallelism — Horovod provides no pipeline or tensor- parallel primitives; very-large-model support is explicitly future work.
- Operationalizing MPI — ease of MPI installation on production Kubernetes clusters is itself a research/engineering open problem called out by the authors.
Note on NCCL Tuning
Horovod's Tensor Fusion is a direct example of why the post-fusion message size — not the user's per-tensor sizes — is the right input to any NCCL configuration choice. The 65% TCP improvement from fusing many tiny gradients into a single 64 MB buffer establishes that small-message NCCL calls are bandwidth-starved by setup overhead, the regime where LL/LL128 protocols and lower nChannels would help if fusion were not applied. Once Horovod fuses to a large buffer, NCCL's Ring + Simple at higher channel counts becomes the right choice. Any tuner operating beneath Horovod must therefore observe the fused message size, not the original tensor sizes, to pick the correct configuration.