Near-Optimal Sparse Allreduce for Distributed Deep Learning — Detailed Summary

Shigang Li, Torsten Hoefler | Department of Computer Science, ETH Zurich | PPoPP '22 (ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming), April 2-6 2022, Seoul | DOI: 10.1145/3503221.3508399

Per-section summary organized by paper headings. Each section gives paragraph-level bullet coverage and preserves the paper's exact equations, tables, figures, and quantitative results.


Abstract


1. Introduction



3. O(k) Sparse Allreduce

3.1 Algorithm Design

3.1.1 Split and Reduce

3.1.2 Balance and Allgatherv

3.1.3 Efficient Top-k Selection

3.1.4 Pseudocode

3.2 Lower-Bound Analysis


4. Ok-Topk SGD Algorithm

4.1 Convergence Analysis


5. Evaluations

5.1 Validating Assumption 1

5.2 Selection Accuracy

5.3 Effect of Load-Balancing Optimizations

5.4 End-to-End Training

5.4.1 VGG-16 / CIFAR-10

5.4.2 LSTM / AN4 (Speech)

5.4.3 BERT / Wikipedia


6. Conclusion


Equations Catalog (for reference)

  1. Stochastic gradient: G_t(w_t) = (1/b) Σ_{i=0}^{b} ∇ℓ(w_t, x_i, y_i)
  2. Weight update: w_{t+1} = w_t − α G_t(w_t)
  3. Cost model: Cost = α + βL
  4. Eq. 1 — Split-and-reduce overhead: (P − 1)α + 2k(P-1)/P · β
  5. Eq. 2 — Balance-and-allgatherv overhead: ≤ (P + 2 log P)α + 4k(P-1)/P · β
  6. Eq. 3 — Total Ok-Topk allreduce: ≤ (2P + 2 log P)α + 6k(P-1)/P · β
  7. Eq. 4 — Convergence: min_{t∈{1..T}} E[||∇f(w_t)||^2] → 0 as T → ∞
  8. Eq. 5 — Assumption 1: ||Topk((1/P) Σ_i (αG^i_t + ε^i_t)) − Topk((1/P) Σ_i Topk(αG^i_t + ε^i_t))|| ≤ ξ ||αG_t(w_t)||

Named Methods, Algorithms, Theorems

Name Type Statement
Ok-Topk scheme The integrated sparse-allreduce + SGD method; whole-paper contribution
Algorithm 1 algorithm O(k) sparse allreduce — two phases (split-reduce, balance-allgatherv)
Algorithm 2 algorithm Ok-Topk SGD — Algorithm 1 + residual accumulation
Theorem 3.1 theorem Lower bound: any COO sparse allreduce needs ≥ 2k(P-1)/P bandwidth
Theorem 4.1 theorem Convergence: Ok-Topk SGD → stationary point for smooth non-convex f
Assumption 1 assumption Bounded gap ξ between local-then-aggregate top-k and true top-k

Headline Quantitative Results (consolidated)


Limitations


Open Problems Stated by the Authors

  1. Hybrid parallelism. Combining O(k) sparse allreduce with pipeline and tensor parallelism, where the cross-stage and cross-rank patterns differ structurally.
  2. Adaptive density / threshold scheduling. A principled rule for choosing k(t) and the refresh interval τ' as a function of training progress and observed ξ.
  3. Co-design with quantization. The paper notes quantization is orthogonal but does not measure the joint regime; how compounding compression errors interact with Assumption 1 is open.


Note on NCCL Tuning

Ok-Topk demonstrates that the right algorithm can pin bandwidth to O(k) while ceding latency to O(P) — and it makes that trade explicitly visible in Table 1's bandwidth/latency split. NCCL's tuner exposes the same kind of trade through algorithm (Tree vs. Ring vs. CollNet) and protocol (LL/LL128/Simple), each with a different α/β profile. The paper's BERT result — that allgather-based sparse schemes exceed dense volume at 256 GPUs — is a concrete demonstration that the optimal collective at small P can be the wrong collective at large P, which is exactly the phenomenon a per-collective tuner must learn to avoid.