Horovod: fast and easy distributed deep learning in TensorFlow

Alexander Sergeev, Mike Del Balso | Uber Technologies, Inc. | arXiv:1802.05799, February 2018

Problem

Modern deep-learning workloads — image, speech, NLP — outgrow a single GPU, but naively scaling TensorFlow to many GPUs hits two simultaneous walls. (1) The standard distributed TensorFlow package is parameter-server based and lost approximately half (~50%) of available compute to communication overhead at 128 Pascal GPUs on Inception-V3 and ResNet-101. (2) The PS API forces researchers to absorb a steep distributed-systems vocabulary — tf.Server, tf.ClusterSpec, tf.train.SyncReplicasOptimizer, tf.train.replicas_device_setter, plus worker/PS role bookkeeping — which produces hard-to-diagnose bugs and discourages migration of existing single-GPU scripts. As Uber's training times grew from hours to a week or longer, both the performance loss and the integration friction became blockers.

Core Insight

Replacing parameter-server gradient averaging with bandwidth-optimal ring-allreduce (Patarasuk and Yuan, 2009; popularized for TensorFlow by Baidu in early 2017) and delegating process launch to a mature MPI runtime lets a four-line user-API absorb all distributed-systems boilerplate while doubling end-to-end scaling efficiency. The remaining small-message inefficiency of ring-allreduce is then patched at the scheduling layer above NCCL via Tensor Fusion, rather than inside the collective itself.

Method

Horovod is a standalone pip-installable Python library (Apache 2.0, released September 2017) that replaces TensorFlow's parameter-server gradient averaging with a ring-allreduce backend, layered as follows:

+-----------------------------------------------------------------+
| User TensorFlow / Keras script (4-line modification budget)     |
|   1. hvd.init()                                                 |
|   2. config.gpu_options.visible_device_list = hvd.local_rank()  |
|   3. opt = hvd.DistributedOptimizer(opt)                        |
|   4. hvd.BroadcastGlobalVariablesHook(0)                        |
+-----------------------------------------------------------------+
| Horovod scheduling layer                                        |
|   - Tensor Fusion (default 64 MB buffer; same dtype)            |
|   - Broadcast init                                              |
|   - Horovod Timeline (Chrome about:tracing)                     |
+-----------------------------------------------------------------+
| NCCL 2 ring-allreduce (intra-node + inter-node, topology-aware) |
+-----------------------------------------------------------------+
| MPI runtime (Open MPI): mpirun -np N -H host1:k,host2:k ...     |
+-----------------------------------------------------------------+
| 25 GbE TCP  /  25 GbE RDMA (InfiniBand or RoCE)                 |
+-----------------------------------------------------------------+

Ring-allreduce arranges N workers into a logical ring, each communicating only with its two neighbors; one allreduce of message size M takes 2 ⋅ (N−1) steps and is bandwidth-optimal (each link carries approximately 2 ⋅ M/N useful bytes per step independent of N). Tensor Fusion concatenates pending small same-dtype tensors into a single buffer (default 64 MB), runs one allreduce on the buffer, and copies results back — amortizing per-call setup over many gradients.

Experimental Setup

Component	Value
GPUs	NVIDIA Pascal-class, scaled to 128 GPUs
Inter-node fabric	25 GbE TCP and 25 GbE RDMA (InfiniBand or RoCE)
Framework	TensorFlow (with Keras support)
Collective backend	NVIDIA NCCL 2 (replaces Baidu's CUDA-aware MPI)
Process launch	Open MPI (`mpirun -np ... -H ...`)
Workloads	Inception-V3, ResNet-101, VGG-16
Baseline	Standard distributed TensorFlow (parameter-server)
Metric	Images/sec; scaling efficiency vs. ideal-linear
Profiler	Horovod Timeline (Chrome `about:tracing`)

Headline Quantitative Results

Configuration	Network	Models	Scaling efficiency at 128 GPUs
Standard distributed TensorFlow	25 GbE TCP	Inception-V3, ResNet-101	~50% (baseline)
Horovod	25 GbE TCP	Inception-V3, ResNet-101	88% (~2x baseline)
Horovod	25 GbE RDMA	Inception-V3, ResNet-101	>90% (+3-4% over TCP)
Horovod	25 GbE RDMA	VGG-16	+~30% vs. Horovod over TCP

Tensor Fusion contributed up to 65% performance improvement on layer-rich models (e.g., ResNet-101) on unoptimized TCP networks, where per-call setup overhead is largest. RDMA's incremental benefit was small (3-4%) for compute-heavy convolutional models but large (~30%) for VGG-16, which is parameter-heavy due to its fully-connected layers and therefore communication-bound on TCP.

Limitations

The benchmarks span a single fabric class (25 GbE TCP and 25 GbE RDMA only) — no 100 GbE, 100 Gb IB, NVLink, or NVSwitch numbers are reported, so RDMA-speedup conclusions are bounded to that class. All efficiency numbers are on Pascal-class GPUs, and the workload set is three vision CNNs (Inception-V3, ResNet-101, VGG-16) — transformers, RNNs, and MoE models are not evaluated. Tensor Fusion uses a fixed 64 MB buffer with no sensitivity sweep. Only synchronous data parallelism is considered; asynchronous, bounded-staleness, and pipeline-parallel regimes are out of scope. Headline efficiencies are not decomposed into compute vs. allreduce vs. data-input, so the residual ~10% gap to ideal at RDMA is unexplained. No comparison is made against contemporaneous collective libraries (e.g., Gloo, IBM PowerAI DDL); positioning is only against parameter-server TensorFlow. Model parallelism is left as future work.

Open Problems

Adaptive Tensor Fusion — a static 64 MB buffer is unlikely to be optimal across the {fabric, GPU generation, model topology} product; a runtime adaptive fusion-granularity policy is a natural follow-up.
Heterogeneous-fabric scaling — when some links are NVLink and others are 25 GbE TCP, optimal ring ordering and chunk size are non-obvious; Horovod delegates this to NCCL with no user-level controls.
Beyond synchronous SGD — pairing ring-allreduce with gradient compression, local SGD, or hierarchical synchronization is left to the user; Horovod's allreduce abstraction does not address large-batch convergence.
Model parallelism — Horovod provides no pipeline or tensor- parallel primitives; very-large-model support is explicitly future work.
Operationalizing MPI — ease of MPI installation on production Kubernetes clusters is itself a research/engineering open problem called out by the authors.

Note on NCCL Tuning

Horovod's Tensor Fusion is a direct example of why the post-fusion message size — not the user's per-tensor sizes — is the right input to any NCCL configuration choice. The 65% TCP improvement from fusing many tiny gradients into a single 64 MB buffer establishes that small-message NCCL calls are bandwidth-starved by setup overhead, the regime where LL/LL128 protocols and lower nChannels would help if fusion were not applied. Once Horovod fuses to a large buffer, NCCL's Ring + Simple at higher channel counts becomes the right choice. Any tuner operating beneath Horovod must therefore observe the fused message size, not the original tensor sizes, to pick the correct configuration.