SparCML: High-Performance Sparse Communication for Machine Learning — Detailed Summary

Cedric Renggli (ETH Zurich), Saleh Ashkboos (IST Austria), Mehdi Aghagolzadeh (Microsoft), Dan Alistarh (IST Austria), Torsten Hoefler (ETH Zurich) | SC '19, Denver, CO, November 17-22, 2019 | DOI: 10.1145/3295500.3356222 | Code: gitlab.com/renggli/sparcml

Per-section summary organized by paper headings. Each section includes paragraph-level bullets and exact quantitative results where the paper provides them.


Abstract


1. Introduction and Motivation

Motivation paragraph:

Standard distribution strategy paragraph:

Existing optimization landscape paragraph:

Conceptual contribution paragraph:

Technical contribution paragraph:

Performance/feasibility paragraph:

Targets paragraph:

Challenges paragraph:

Experimental results paragraph:

Large workloads paragraph:


2. Preliminaries

Notation table (Section 2 opener):

Variable Description
P Number of nodes
N Problem dimension
p_i Node i, 1 <= i <= P
H_i Set of non-zero indices node p_i communicates
k max_i
K total nnz in global sum:
d density of nnz: k/N

2.1 Data Parallelism and Communication Costs

Data parallel paragraph:

x_{t+1} = x_t - eta * sum_{i=1..P} grad F_i(x_t)

2.2 Communication-Reduction Techniques

Structured Sparsification paragraph:

v_0 = eps_0^i = 0
for each step t >= 1 do
   acc_t^i  <- eps_{t-1}^i + alpha * grad F_i(v_{t-1})
   eps_t^i  <- acc_t^i - TopK(acc_t^i)              # update error
   g_t^i    <- allreduce( Q(TopK(acc_t^i)), SUM )    # sparse contribution
   v_t^i    <- v_{t-1}^i - g_t^i
end for

Quantization paragraph (orthogonal axis):

x_{t+1} = x_t - sum_{i=1..P} Q( grad F_i(x_t) )

3. Communication-Reduction: A Critical View

Section opener — examines techniques in the supercomputing / cloud regime.

Structured sparsification paragraph (positives):

Figure 1 paragraph (caveat at scale):

Quantization paragraph (opposite trade-off):


4. Communication Reduction in SparCML

Section opener — neither sparsification nor quantization is ideal alone at high node count, so SparCML combines them.

Sparse Quantized Reduction paragraph:

Convergence Proof paragraph:

Theorem 4.1. Consider the SparCML SGD algorithm minimising a smooth, non-convex function f. There exists a learning-rate schedule (alpha_t) such that: min_{t in {1..T}} E[||grad f(x_t)||^2] -> 0 as T -> infinity.

Discussion paragraph (limits of the theorem):


5. Supporting Sparsity in SparCML

5.1 Data Representation: Sparse Streams

Sparse stream definition paragraph:

Vector representations paragraph:

Storage paragraph:

Switching to a Dense Format paragraph:

Efficient Summation paragraph:

5.2 Efficient Collectives on Sparse Streams

Section opener paragraph:

Analytical Model paragraph:

Goal paragraph:

Assumptions paragraph:

5.3 Communication Algorithms

Section opener paragraph:

SSAR vs DSAR paragraph:

Two extremes paragraph (uniform-nnz assumption):

Lemma 5.1 paragraph:

Lemma 5.1. The time T for sparse allreduce is bounded by T >= log_2(P) alpha + (P-1) k beta_d if K = kP, and T >= log_2(P) alpha + 2 ((P-1)/P) k beta_d if K = k and computation for reduction is perfectly parallelized.

Switching paragraph:

5.3.1 The Small Data Case — SSAR_Recursive_double

Latency analysis:

5.3.2 The Large Data Case — SSAR_Split_allgather

5.3.3 The Dynamic Case — Switching to Dense

Analysis paragraph:

Lemma 5.2 paragraph:

Lemma 5.2. Any algorithm solving the DSAR problem needs at least log_2(P) alpha + delta beta_d, where the lower bound on the bandwidth required is at least (1/2) kappa of any bandwidth-optimal fully-dense allreduce algorithm, with kappa = delta/N.

Algorithm paragraph (DSAR_Split_allgather):


6. Supporting Low-Precision Communication


7. Artifact and Additional Features

Interface and Code paragraph:

Non-Blocking Operations paragraph:

MPI-OPT paragraph:

Microsoft Cognitive Toolkit (CNTK) paragraph:


8. Experiments

Setup paragraph:

8.1 Micro-Benchmarks

Setup paragraph:

Figure 3 results paragraph:

8.2 Large-Scale Classification

Goal paragraph:

Datasets — Table 1:

Name # Classes # Samples Dimension
URL [38] 2 2,396,130 3,231,961
Webspam [51] 2 350,000 16,609,143
CIFAR-10 [34] 10 60,000 32 x 32 x 3
ImageNet-1K [43] 1000 1.3M 224 x 224 x 3
ATIS [24] 128 4,978 sentences / 56,590 words
Hansards [41] 948 K sentences / 15,657 K words

Sparsity-driven SGD paragraph:

Table 2 — Distributed optimization with MPI-OPT (full-dataset pass times, communication time in brackets):

System Dataset Model # nodes Baseline (s) Algorithm Algo Time (s) Speedup
Piz Daint Webspam LR 32 24.0 (21.6) SSAR_Recursive_double 6.8 (2.6) 3.53x (6.17x)
SVM 32 16.2 (14.2) 6.5 (4.4) 2.49x (3.23x)
Piz Daint URL LR 32 26.4 (25.8) SSAR_Recursive_double 7.5 (7.0) 3.52x (3.69x)
SVM 32 19.8 (19.3) 5.6 (5.3) 3.54x (3.64x)
Piz Daint Webspam LR 8 46.7 (37.9) SSAR_Split_allgather 25.6 (15.8) 1.82x (2.40x)
URL LR 8 37.7 (35.3) 20.9 (15.0) 1.80x (2.35x)
Greina (IB) Webspam LR 8 65.2 (46.7) SSAR_Split_allgather 36.3 (19.0) 1.80x (2.46x)
URL LR 8 81.4 (44.7) 61.1 (24.9) 1.33x (1.80x)
Greina (GigE) Webspam LR 8 768.0 (759.5) SSAR_Split_allgather 37.9 (29.5) 20.26x (25.75x)
URL LR 8 1045.0 (1004.6) 80.26 (42.2) 12.65x (23.81x)

SCD paragraph:

Apache Spark comparison paragraph:

Discussion paragraph:

8.3 Training Deep Neural Networks

Setup paragraph:

Bandwidth reduction example paragraph:

Accuracy & Speedup paragraph (CIFAR-10):

End-to-end training time paragraph:

8.4 Large Workload Experiments

ImageNet ResNet-50 paragraph:

Wide ResNet (4xResNet18, 4xResNet34) on ImageNet-1K paragraph:

Automated Speech Recognition (ASR) paragraph:

Hyperparameter Tuning paragraph:


Reduced communication techniques paragraph:

Lossless methods paragraph:

Communication frameworks paragraph:

Sparse reduction paragraph:


10. Conclusions and Further Work


Acknowledgments / Funding


Cross-Cutting Empirical Take-Aways

Take-away Derived from
Sparsification + sum across many nodes naturally densifies the result Figure 1 (Sec. 3)
Recursive doubling wins for small data; split-allgather wins for large data Sec. 5.3, Figure 3
Pure sparsity can yield at most (2/kappa)x = 4x speedup vs. dense allreduce Lemma 5.2
Combined sparsification + stochastic quantization preserves convergence Theorem 4.1
Sparse comm benefit is bigger on slower networks Greina GigE vs. Piz Daint (Fig. 3, Table 2)
ASR LSTM scaled to 128 GPUs in 1.78 days vs. 14 days BMUF baseline Sec. 8.4
ResNet-50 doesn't benefit from extreme sparsity at scale (densifies quickly) Sec. 8.4
Wide ResNets (4xResNet18/34) yield ~2x speedup at 64 GPUs with <0.5% Top-5 loss Sec. 8.4
Linear classifier on URL: 31x to 63x speedup vs. Spark, depending on stack Sec. 8.2
GigE small-cluster speedups exceed IB at the same node count (slower link, more comm bound) Table 2

Limitations of the Paper


Notable Relationship to NCCL