Near-Optimal Sparse Allreduce for Distributed Deep Learning

Shigang Li, Torsten Hoefler | ETH Zurich | PPoPP '22 | DOI: 10.1145/3503221.3508399


Problem

As deep-learning models grow into the billions of parameters, dense allreduce (communication volume ~2n bytes for n parameters) becomes the dominant scaling bottleneck under data-parallel SGD. Gradient sparsification — keeping only the top-k largest-magnitude gradient entries per step (with up to 99.9% of entries dropped at negligible accuracy loss) — should reduce volume to O(k), but in practice it has not delivered wall-clock speedups. Two reasons: (a) existing sparse-allreduce primitives do not preserve the O(k) volume at scale — Allgather-based methods (TopkA) grow as 2k(P-1) linear in worker count P, recursive-doubling methods (TopkDSA / SparCML) suffer "fill-in" where the support set expands round by round and degrades to dense, and tree-based methods (gTopk) carry an extra log P factor; and (b) the top-k selection itself is GPU-hostile (sort is O(n log n), Quickselect is branchy) and prior threshold-based shortcuts (Gaussiank) systematically underestimate k because gradients have heavier-than-Gaussian tails.


Core Insight

A sparse allreduce can be made truly O(k) — bandwidth bounded by 6k(P-1)/P, within a 3x constant of the proven 2k(P-1)/P lower bound and independent of P beyond the (P-1)/P factor — by combining (i) a balanced split of the gradient space across workers based on the empirical top-k coordinate distribution, (ii) a two-phase reduce-then-allgatherv structure with destination rotation and bucketing to avoid hot-spots, and (iii) a temporal- locality threshold reuse that turns top-k selection into a single O(n) GPU elementwise pass. The design explicitly trades bandwidth (O(k)) for latency (O(P)), which is the right trade for large models on bandwidth-bound fabrics.


Method

Ok-Topk has two layers: an O(k) sparse allreduce (Algorithm 1) and an SGD optimizer (Algorithm 2) that wraps it with residual accumulation.

Phase 1 — Split and Reduce
  - Partition the n-dim gradient space among P workers using a
    *balanced* split derived from the empirical top-k coordinate
    distribution (recomputed every τ = 64 iterations).
  - Each worker reduces its assigned region, receiving ~2k/P
    elements per peer.
  - Destination rotation + bucketing prevent endpoint congestion
    and overlap comm with comp.
  - Cost: (P-1)α + 2k(P-1)/P · β

Phase 2 — Balance and Allgatherv
  - Each worker runs a *global* top-k locally on its region.
  - Data-balancing redistributes chunks to equal size.
  - Recursive-doubling allgatherv broadcasts the global top-k.
  - Cost: ≤ (P + 2 log P)α + 4k(P-1)/P · β

Total: ≤ (2P + 2 log P)α + 6k(P-1)/P · β

Top-k selection
  - Reuse last estimated magnitude threshold τ_thr; refresh
    every τ' iterations.
  - O(n) GPU comparisons per step, no sort/Quickselect.

Convergence (Theorem 4.1)
  - Under Assumption 1: bounded gap ξ between local-then-aggregate
    top-k and true top-k, ξ < P observed empirically.
  - With diminishing α, Ok-Topk SGD converges:
      min_t E[||∇f(w_t)||²] → 0.

The lower bound (Theorem 3.1) shows any COO-format sparse allreduce must spend at least 2k(P-1)/P in bandwidth, so Ok-Topk's ≤6k(P-1)/P is within a constant factor — i.e., asymptotically optimal in k.


Experimental Setup

Component Value
Testbed CSCS Piz Daint (Cray XC50)
CPU/GPU per node Intel Xeon E5-2690 / NVIDIA P100 16 GB
Interconnect Cray Aries Dragonfly
Software PyTorch + mpi4py over Cray-MPICH 7.7.16
GPU counts 32, 64, 256
Workloads VGG-16 / CIFAR-10 (14.7 M params); LSTM / AN4 (27.6 M); BERT / Wikipedia (133.5 M)
Optimizers SGD (VGG, LSTM); Adam (BERT)
Baselines Dense (Rabenseifner), DenseOvlp, TopkA, TopkDSA, gTopk, Gaussiank
Metric End-to-end throughput, time-to-accuracy, communication volume

Communication overhead comparison (Table 1):

Algorithm Bandwidth Latency
Dense 2n(P-1)/P · β 2(log P)α
TopkA 2k(P-1)β (log P)α
TopkDSA [4k(P-1)/P · β, (2k+n)(P-1)/P · β] (P + 2 log P)α
gTopk 4k(log P)β 2(log P)α
Gaussiank 2k(P-1)β 2(log P)α
Ok-Topk [2k(P-1)/P · β, 6k(P-1)/P · β] (2P + 2 log P)α

Headline Quantitative Results

Algorithm-level:

Optimization isolation:

End-to-end throughput vs. sparse baselines:

Workload GPUs Speedup over sparse baselines Accuracy delta
VGG-16 32 1.51x – 8.83x matches dense (93%+)
LSTM 64 1.34x – 7.71x WER 0.309 vs. dense 0.308
BERT 256 3.29x – 12.95x matches dense loss curve

Wall-clock pre-training:

Critical observation: at 256 GPUs, allgather-based sparse methods (TopkA, Gaussiank) actually exceed dense allreduce in volume — sparsification without the right primitive is a regression, not a win.


Limitations


Open Problems

  1. Hybrid (data + pipeline + tensor) parallelism with O(k) sparse allreduce at the data-parallel layer.
  2. Adaptive scheduling of density k(t) and refresh interval τ' based on training progress and the observed gap ξ.
  3. Co-design with quantization: how compounding compression errors interact with Assumption 1's bound.

Note on NCCL Tuning

Ok-Topk's Table 1 makes the bandwidth/latency split explicit per algorithm — Dense uses O(log P) latency and O(n) bandwidth, while Ok-Topk inverts this to O(P) latency and O(k) bandwidth. NCCL's algorithm/protocol axes (Tree vs. Ring vs. CollNet; LL vs. LL128 vs. Simple) expose precisely this kind of α/β trade. The paper's BERT-256 result — that allgather-based sparse schemes can exceed dense volume at 256 GPUs — is a concrete instance of an algorithm that is correct at small P and wrong at large P, exactly the phenomenon a per-collective tuner must learn to recognize and avoid.