Near-Optimal Sparse Allreduce for Distributed Deep Learning
Shigang Li, Torsten Hoefler | ETH Zurich | PPoPP '22 | DOI: 10.1145/3503221.3508399
Problem
As deep-learning models grow into the billions of parameters, dense
allreduce (communication volume ~2n bytes for n parameters) becomes the
dominant scaling bottleneck under data-parallel SGD. Gradient
sparsification — keeping only the top-k largest-magnitude gradient
entries per step (with up to 99.9% of entries dropped at negligible
accuracy loss) — should reduce volume to O(k), but in practice it has
not delivered wall-clock speedups. Two reasons: (a) existing
sparse-allreduce primitives do not preserve the O(k) volume at scale —
Allgather-based methods (TopkA) grow as 2k(P-1) linear in
worker count P, recursive-doubling methods (TopkDSA / SparCML) suffer
"fill-in" where the support set expands round by round and degrades to
dense, and tree-based methods (gTopk) carry an extra log P
factor; and (b) the top-k selection itself is GPU-hostile (sort is O(n
log n), Quickselect is branchy) and prior threshold-based shortcuts
(Gaussiank) systematically underestimate k because gradients have
heavier-than-Gaussian tails.
Core Insight
A sparse allreduce can be made truly O(k) — bandwidth bounded by
6k(P-1)/P, within a 3x constant of the proven
2k(P-1)/P lower bound and independent of P beyond the
(P-1)/P factor — by combining (i) a balanced split of the gradient space
across workers based on the empirical top-k coordinate distribution,
(ii) a two-phase reduce-then-allgatherv structure with destination
rotation and bucketing to avoid hot-spots, and (iii) a temporal-
locality threshold reuse that turns top-k selection into a single O(n)
GPU elementwise pass. The design explicitly trades bandwidth (O(k)) for
latency (O(P)), which is the right trade for large models on
bandwidth-bound fabrics.
Method
Ok-Topk has two layers: an O(k) sparse allreduce (Algorithm 1) and an SGD optimizer (Algorithm 2) that wraps it with residual accumulation.
Phase 1 — Split and Reduce
- Partition the n-dim gradient space among P workers using a
*balanced* split derived from the empirical top-k coordinate
distribution (recomputed every τ = 64 iterations).
- Each worker reduces its assigned region, receiving ~2k/P
elements per peer.
- Destination rotation + bucketing prevent endpoint congestion
and overlap comm with comp.
- Cost: (P-1)α + 2k(P-1)/P · β
Phase 2 — Balance and Allgatherv
- Each worker runs a *global* top-k locally on its region.
- Data-balancing redistributes chunks to equal size.
- Recursive-doubling allgatherv broadcasts the global top-k.
- Cost: ≤ (P + 2 log P)α + 4k(P-1)/P · β
Total: ≤ (2P + 2 log P)α + 6k(P-1)/P · β
Top-k selection
- Reuse last estimated magnitude threshold τ_thr; refresh
every τ' iterations.
- O(n) GPU comparisons per step, no sort/Quickselect.
Convergence (Theorem 4.1)
- Under Assumption 1: bounded gap ξ between local-then-aggregate
top-k and true top-k, ξ < P observed empirically.
- With diminishing α, Ok-Topk SGD converges:
min_t E[||∇f(w_t)||²] → 0.
The lower bound (Theorem 3.1) shows any COO-format sparse allreduce
must spend at least 2k(P-1)/P in bandwidth, so Ok-Topk's
≤6k(P-1)/P is within a constant factor — i.e.,
asymptotically optimal in k.
Experimental Setup
| Component | Value |
|---|---|
| Testbed | CSCS Piz Daint (Cray XC50) |
| CPU/GPU per node | Intel Xeon E5-2690 / NVIDIA P100 16 GB |
| Interconnect | Cray Aries Dragonfly |
| Software | PyTorch + mpi4py over Cray-MPICH 7.7.16 |
| GPU counts | 32, 64, 256 |
| Workloads | VGG-16 / CIFAR-10 (14.7 M params); LSTM / AN4 (27.6 M); BERT / Wikipedia (133.5 M) |
| Optimizers | SGD (VGG, LSTM); Adam (BERT) |
| Baselines | Dense (Rabenseifner), DenseOvlp, TopkA, TopkDSA, gTopk, Gaussiank |
| Metric | End-to-end throughput, time-to-accuracy, communication volume |
Communication overhead comparison (Table 1):
| Algorithm | Bandwidth | Latency |
|---|---|---|
| Dense | 2n(P-1)/P · β | 2(log P)α |
| TopkA | 2k(P-1)β | (log P)α |
| TopkDSA | [4k(P-1)/P · β, (2k+n)(P-1)/P · β] | (P + 2 log P)α |
| gTopk | 4k(log P)β | 2(log P)α |
| Gaussiank | 2k(P-1)β | 2(log P)α |
| Ok-Topk | [2k(P-1)/P · β, 6k(P-1)/P · β] | (2P + 2 log P)α |
Headline Quantitative Results
Algorithm-level:
- Bandwidth bound:
≤ 6k(P-1)/P · β(constant in k, independent of P beyond the (P-1)/P factor). - Lower bound (Theorem 3.1):
2k(P-1)/P · β. Ok-Topk is within 3x. - Selection deviation from true k: <11% average; 1.4% on BERT.
- TopkDSA fill-in: nominal 0.1% density can balloon to 34.5%.
Optimization isolation:
- Balanced split-and-reduce: 1.13x – 1.75x speedup over uniform split.
- Balanced allgatherv: additional 1.12x – 1.43x speedup.
End-to-end throughput vs. sparse baselines:
| Workload | GPUs | Speedup over sparse baselines | Accuracy delta |
|---|---|---|---|
| VGG-16 | 32 | 1.51x – 8.83x | matches dense (93%+) |
| LSTM | 64 | 1.34x – 7.71x | WER 0.309 vs. dense 0.308 |
| BERT | 256 | 3.29x – 12.95x | matches dense loss curve |
Wall-clock pre-training:
- BERT on 32 GPUs: 150 h → 47 h (>3.2x faster than dense).
- BERT parallel efficiency at 256 GPUs (weak scaling): 76.3%.
Critical observation: at 256 GPUs, allgather-based sparse methods (TopkA, Gaussiank) actually exceed dense allreduce in volume — sparsification without the right primitive is a regression, not a win.
Limitations
- O(P) latency term.
(2P + 2 log P)αis worse than tree-based methods'O(log P). Ok-Topk wins when bandwidth is the binding constraint (large models, large k); on small models or extremely low-latency fabrics the latency cost can flip the balance. - Data-parallelism only. No pipeline/tensor parallelism integration — explicitly future work.
- Threshold reuse assumes slow drift. The temporal-locality shortcut for top-k selection holds empirically across SGD/Adam on the three workloads tested, but distributional shift mid-training (e.g., curriculum learning, optimizer switch) is not characterized.
- No joint study with quantization. Quantization is noted as orthogonal and stackable but not evaluated alongside Ok-Topk.
Open Problems
- Hybrid (data + pipeline + tensor) parallelism with O(k) sparse allreduce at the data-parallel layer.
- Adaptive scheduling of density k(t) and refresh interval τ' based on training progress and the observed gap ξ.
- Co-design with quantization: how compounding compression errors interact with Assumption 1's bound.
Note on NCCL Tuning
Ok-Topk's Table 1 makes the bandwidth/latency split explicit per
algorithm — Dense uses O(log P) latency and
O(n) bandwidth, while Ok-Topk inverts this to
O(P) latency and O(k) bandwidth. NCCL's
algorithm/protocol axes (Tree vs. Ring vs. CollNet; LL vs. LL128 vs.
Simple) expose precisely this kind of α/β trade. The paper's BERT-256
result — that allgather-based sparse schemes can exceed dense
volume at 256 GPUs — is a concrete instance of an algorithm that is
correct at small P and wrong at large P, exactly the phenomenon a
per-collective tuner must learn to recognize and avoid.