Near-Optimal Sparse Allreduce for Distributed Deep Learning

Shigang Li, Torsten Hoefler | ETH Zurich | PPoPP '22 | DOI: 10.1145/3503221.3508399

Problem

As deep-learning models grow into the billions of parameters, dense allreduce (communication volume ~2n bytes for n parameters) becomes the dominant scaling bottleneck under data-parallel SGD. Gradient sparsification — keeping only the top-k largest-magnitude gradient entries per step (with up to 99.9% of entries dropped at negligible accuracy loss) — should reduce volume to O(k), but in practice it has not delivered wall-clock speedups. Two reasons: (a) existing sparse-allreduce primitives do not preserve the O(k) volume at scale — Allgather-based methods (TopkA) grow as 2k(P-1) linear in worker count P, recursive-doubling methods (TopkDSA / SparCML) suffer "fill-in" where the support set expands round by round and degrades to dense, and tree-based methods (gTopk) carry an extra log P factor; and (b) the top-k selection itself is GPU-hostile (sort is O(n log n), Quickselect is branchy) and prior threshold-based shortcuts (Gaussiank) systematically underestimate k because gradients have heavier-than-Gaussian tails.

Core Insight

A sparse allreduce can be made truly O(k) — bandwidth bounded by 6k(P-1)/P, within a 3x constant of the proven 2k(P-1)/P lower bound and independent of P beyond the (P-1)/P factor — by combining (i) a balanced split of the gradient space across workers based on the empirical top-k coordinate distribution, (ii) a two-phase reduce-then-allgatherv structure with destination rotation and bucketing to avoid hot-spots, and (iii) a temporal- locality threshold reuse that turns top-k selection into a single O(n) GPU elementwise pass. The design explicitly trades bandwidth (O(k)) for latency (O(P)), which is the right trade for large models on bandwidth-bound fabrics.

Method

Ok-Topk has two layers: an O(k) sparse allreduce (Algorithm 1) and an SGD optimizer (Algorithm 2) that wraps it with residual accumulation.

Phase 1 — Split and Reduce
  - Partition the n-dim gradient space among P workers using a
    *balanced* split derived from the empirical top-k coordinate
    distribution (recomputed every τ = 64 iterations).
  - Each worker reduces its assigned region, receiving ~2k/P
    elements per peer.
  - Destination rotation + bucketing prevent endpoint congestion
    and overlap comm with comp.
  - Cost: (P-1)α + 2k(P-1)/P · β

Phase 2 — Balance and Allgatherv
  - Each worker runs a *global* top-k locally on its region.
  - Data-balancing redistributes chunks to equal size.
  - Recursive-doubling allgatherv broadcasts the global top-k.
  - Cost: ≤ (P + 2 log P)α + 4k(P-1)/P · β

Total: ≤ (2P + 2 log P)α + 6k(P-1)/P · β

Top-k selection
  - Reuse last estimated magnitude threshold τ_thr; refresh
    every τ' iterations.
  - O(n) GPU comparisons per step, no sort/Quickselect.

Convergence (Theorem 4.1)
  - Under Assumption 1: bounded gap ξ between local-then-aggregate
    top-k and true top-k, ξ < P observed empirically.
  - With diminishing α, Ok-Topk SGD converges:
      min_t E[||∇f(w_t)||²] → 0.

The lower bound (Theorem 3.1) shows any COO-format sparse allreduce must spend at least 2k(P-1)/P in bandwidth, so Ok-Topk's ≤6k(P-1)/P is within a constant factor — i.e., asymptotically optimal in k.

Experimental Setup

Component	Value
Testbed	CSCS Piz Daint (Cray XC50)
CPU/GPU per node	Intel Xeon E5-2690 / NVIDIA P100 16 GB
Interconnect	Cray Aries Dragonfly
Software	PyTorch + mpi4py over Cray-MPICH 7.7.16
GPU counts	32, 64, 256
Workloads	VGG-16 / CIFAR-10 (14.7 M params); LSTM / AN4 (27.6 M); BERT / Wikipedia (133.5 M)
Optimizers	SGD (VGG, LSTM); Adam (BERT)
Baselines	Dense (Rabenseifner), DenseOvlp, TopkA, TopkDSA, gTopk, Gaussiank
Metric	End-to-end throughput, time-to-accuracy, communication volume

Communication overhead comparison (Table 1):

Algorithm	Bandwidth	Latency
Dense	2n(P-1)/P · β	2(log P)α
TopkA	2k(P-1)β	(log P)α
TopkDSA	[4k(P-1)/P · β, (2k+n)(P-1)/P · β]	(P + 2 log P)α
gTopk	4k(log P)β	2(log P)α
Gaussiank	2k(P-1)β	2(log P)α
Ok-Topk	[2k(P-1)/P · β, 6k(P-1)/P · β]	(2P + 2 log P)α

Headline Quantitative Results

Algorithm-level:

Bandwidth bound: ≤ 6k(P-1)/P · β (constant in k, independent of P beyond the (P-1)/P factor).
Lower bound (Theorem 3.1): 2k(P-1)/P · β. Ok-Topk is within 3x.
Selection deviation from true k: <11% average; 1.4% on BERT.
TopkDSA fill-in: nominal 0.1% density can balloon to 34.5%.

Optimization isolation:

Balanced split-and-reduce: 1.13x – 1.75x speedup over uniform split.
Balanced allgatherv: additional 1.12x – 1.43x speedup.

End-to-end throughput vs. sparse baselines:

Workload	GPUs	Speedup over sparse baselines	Accuracy delta
VGG-16	32	1.51x – 8.83x	matches dense (93%+)
LSTM	64	1.34x – 7.71x	WER 0.309 vs. dense 0.308
BERT	256	3.29x – 12.95x	matches dense loss curve

Wall-clock pre-training:

BERT on 32 GPUs: 150 h → 47 h (>3.2x faster than dense).
BERT parallel efficiency at 256 GPUs (weak scaling): 76.3%.

Critical observation: at 256 GPUs, allgather-based sparse methods (TopkA, Gaussiank) actually exceed dense allreduce in volume — sparsification without the right primitive is a regression, not a win.

Limitations

O(P) latency term. (2P + 2 log P)α is worse than tree-based methods' O(log P). Ok-Topk wins when bandwidth is the binding constraint (large models, large k); on small models or extremely low-latency fabrics the latency cost can flip the balance.
Data-parallelism only. No pipeline/tensor parallelism integration — explicitly future work.
Threshold reuse assumes slow drift. The temporal-locality shortcut for top-k selection holds empirically across SGD/Adam on the three workloads tested, but distributional shift mid-training (e.g., curriculum learning, optimizer switch) is not characterized.
No joint study with quantization. Quantization is noted as orthogonal and stackable but not evaluated alongside Ok-Topk.

Open Problems

Hybrid (data + pipeline + tensor) parallelism with O(k) sparse allreduce at the data-parallel layer.
Adaptive scheduling of density k(t) and refresh interval τ' based on training progress and the observed gap ξ.
Co-design with quantization: how compounding compression errors interact with Assumption 1's bound.

Note on NCCL Tuning

Ok-Topk's Table 1 makes the bandwidth/latency split explicit per algorithm — Dense uses O(log P) latency and O(n) bandwidth, while Ok-Topk inverts this to O(P) latency and O(k) bandwidth. NCCL's algorithm/protocol axes (Tree vs. Ring vs. CollNet; LL vs. LL128 vs. Simple) expose precisely this kind of α/β trade. The paper's BERT-256 result — that allgather-based sparse schemes can exceed dense volume at 256 GPUs — is a concrete instance of an algorithm that is correct at small P and wrong at large P, exactly the phenomenon a per-collective tuner must learn to recognize and avoid.