SparCML: High-Performance Sparse Communication for Machine Learning — Detailed Summary
Cedric Renggli (ETH Zurich), Saleh Ashkboos (IST Austria), Mehdi Aghagolzadeh (Microsoft), Dan Alistarh (IST Austria), Torsten Hoefler (ETH Zurich) | SC '19, Denver, CO, November 17-22, 2019 | DOI: 10.1145/3295500.3356222 | Code: gitlab.com/renggli/sparcml
Per-section summary organized by paper headings. Each section includes paragraph-level bullets and exact quantitative results where the paper provides them.
Abstract
- Modern ML on large datasets requires highly-scalable algorithms; the standard data-parallel pattern sums every node's gradient via a global allreduce — this allreduce is the single dominant scaling bottleneck.
- Empirical observation: gradient values are frequently zero or close to zero, so allreduce inputs are sparsifiable.
- Contribution: a set of communication-efficient sparse-input collective protocols, plus an MPI-extending library called SparCML that adds non- blocking (asynchronous) and low-precision support.
- The protocols generalize standard collectives so that processes can contribute arbitrary sparse input vectors — i.e. each rank's nnz indices may differ.
- Authors position SparCML as a foundation for future highly scalable ML frameworks; superscript footnote: "SparCML" reads as sparse ML.
1. Introduction and Motivation
Motivation paragraph:
- ML workloads are growing fast: AlexNet trained in days on a 2010-era GPU system; BERT (cited as ref [17]) would take more than a year to train on a single GPU.
- Network parameter counts have grown from a handful (regression) up to ~200 MB for AlexNet up to 340 million parameters / 11 GB at 32-bit precision for the largest BERT.
Standard distribution strategy paragraph:
- Data parallelism partitions the dataset and replicates the model. Global sum is performed by allreduce or a parameter server [35]; this paper focuses on allreduce due to PS scaling limits.
- The allreduce reduction is the bottleneck because its size equals the model size and is not reduced as more nodes are added.
- Concretely: at very large node counts, "hundreds of megabytes must be summed globally every few microseconds."
Existing optimization landscape paragraph:
- All major frameworks chase efficient communication [1,13,33,44,57]; GPU vendors develop specific layers like NCCL [40].
- Research community has proposed quantization [4,44,45], asynchronous training [59], structured sparsification [2,18,46], and large-batch methods [20,56].
- The unexplored angle the authors target: how to exploit sparsity in the global summation itself. Their layer extends MPI primitives.
Conceptual contribution paragraph:
- SparCML reduces communication and synchronization cost by exploiting sparsity and relaxed consistency.
- A node may compute updates against a partially inconsistent view of parameters; the implication is that updates are either naturally sparse [51] or can be sparsified in a principled manner without loss of convergence [2,4,18,46].
Technical contribution paragraph:
- Thesis: sparse-aware communication and compression should be standard but is not available in MPI [19] or in ML communication libraries [40].
- Reason: building general sparse collectives is non-trivial because it adds a new dimension to existing system trade-offs in collective design [48].
Performance/feasibility paragraph:
- For some workload parameters, SparCML can be shown to be within constant factors of optimal in the bandwidth/latency cost model.
- It achieves order-of-magnitude speedups vs. highly optimized dense collectives or naive sparse implementations, both in synthetic and real applications.
- Adds two extra features: efficient reduced-precision support and non- blocking operations. Example: gradient exchange at 4 bits per coordinate, overlapping computation and communication.
Targets paragraph:
- Two large-scale ML targets: deep neural networks and large-scale regularized classification.
- Two system scenarios: (1) supercomputing — high-performance, optimized network, (2) datacenters — relatively slow networks (InfiniBand or Gigabit Ethernet).
Challenges paragraph:
- Implementing allreduce sums where many nodes contribute sparse vectors is hard because non-zero indices overlap unpredictably; the size of the reduced result is unknown a priori.
- An adaptive set of techniques is needed for the various input regimes.
- An additional ML-side challenge: avoiding hyperparameter retuning when enabling sparsity. The paper reports this is achievable with few exceptions.
Experimental results paragraph:
- Three classes of benchmarks: (1) synthetic, (2) academic ML datasets, (3) large-scale image classification (ImageNet ResNet) and automated speech recognition (LSTM, internal Microsoft assistant model).
- Synthetic shows order-of-magnitude speedups vs. dense; limited overhead in the dense case.
- Integrations: CNTK (Microsoft) and MPI-OPT (the authors' own framework).
- Headline supercomputing result: SparCML reduces end-to-end convergence time of a state-of-the-art NLU network by ~6x.
- Headline datacenter result: completes a large-scale URL classification task 31x faster than a Cray MPI-based variant (which doesn't exploit sparsity). Speedups are larger on slower networks.
Large workloads paragraph:
- ImageNet wide-residual networks [58] on 64 GPUs: end-to-end training time reduced by ~2x with negligible accuracy loss (<0.5% Top-5 validation), no extra hyperparameter tuning.
- Gains are negligible for plain ResNet-50 [20,23] (smaller, less amenable to sparsity).
- ASR task: state-of-the-art LSTM training time reduced on 128 GPUs by almost 10x, from 14 days to 1.78 days, no accuracy loss.
2. Preliminaries
Notation table (Section 2 opener):
| Variable | Description |
|---|---|
| P | Number of nodes |
| N | Problem dimension |
| p_i | Node i, 1 <= i <= P |
| H_i | Set of non-zero indices node p_i communicates |
| k | max_i |
| K | total nnz in global sum: |
| d | density of nnz: k/N |
2.1 Data Parallelism and Communication Costs
Data parallel paragraph:
P nodes share a large dataset; each holds its own model copy x_t.
Updates are exchanged either by global averaging or via a parameter server [35]. Each iteration runs SGD: each node computes a stochastic gradient on a randomly sampled mini-batch, then nodes globally sum.
Standard SGD iteration:
x_{t+1} = x_t - eta * sum_{i=1..P} grad F_i(x_t)
- The trade-off is parallelism vs. communication: P nodes give P times more samples per iteration, but require an additional global sum to maintain consistency.
- This communication cost motivates the wave of communication-reduction techniques that follow.
2.2 Communication-Reduction Techniques
Structured Sparsification paragraph:
- Top-k SGD [2,18]: each node communicates only the k largest-magnitude components of its local gradient. k is fixed to some percentage, "even lower than 1%" per [37].
- Forces sparsity at each node, although chosen indices may differ across nodes.
- Components not sent are accumulated (residual / error feedback) into the next iteration's gradient.
- Algorithm 1 pseudocode (paper's Algorithm 1, SparCML Quantized TopK SGD at node i):
v_0 = eps_0^i = 0
for each step t >= 1 do
acc_t^i <- eps_{t-1}^i + alpha * grad F_i(v_{t-1})
eps_t^i <- acc_t^i - TopK(acc_t^i) # update error
g_t^i <- allreduce( Q(TopK(acc_t^i)), SUM ) # sparse contribution
v_t^i <- v_{t-1}^i - g_t^i
end for
- The quantization function Q can be the identity (no quantization) or a stochastic quantizer (Section 6).
Quantization paragraph (orthogonal axis):
- Quantization [4,16,44,52] reduces bits per gradient value:
x_{t+1} = x_t - sum_{i=1..P} Q( grad F_i(x_t) )
- Q is element-wise; reduces precision; preserves convergence as long as Q is zero-mean, but the added variance can slow convergence [4].
3. Communication-Reduction: A Critical View
Section opener — examines techniques in the supercomputing / cloud regime.
Structured sparsification paragraph (positives):
- Sparsification preserves convergence even on non-convex objectives [5]; empirical neural-net experiments allow nodes to send <1% of local gradient without losing convergence [37].
Figure 1 paragraph (caveat at scale):
- High sparsity (>99%) requires careful momentum / learning-rate hyperparameter tuning — error-prone and time-consuming.
- Lower sparsity (5-10% nnz/node) is more stable, but summing across many nodes can cause the reduced vector to become dense — at which point communication is bottlenecked again.
- Figure 1: 3D plot of result density (%) vs. number of nodes N (1-128) and per-node density k (0-30%) on ResNet20 / CIFAR-10 at training epoch 5. Result density rises sharply with both axes. (The same shape holds across training stages and models — ResNet, DenseNet.)
Quantization paragraph (opposite trade-off):
- Quantization compression rate is independent of node count — it does not fill in.
- However, quantization can yield only 4-8x compression before added variance begins to hurt accuracy [4,22].
4. Communication Reduction in SparCML
Section opener — neither sparsification nor quantization is ideal alone at high node count, so SparCML combines them.
Sparse Quantized Reduction paragraph:
- High-level: each node maintains residual error eps^i locally; on each step the residual + new gradient produces an accumulator; the accumulator is truncated to obtain the value g_t^i sent to others; the new residual is updated.
- The allreduce sums truncated values across nodes.
- Because the summed result may become dense, SparCML may quantize that intermediate dense vector with stochastic QSGD quantization [4] to bound bandwidth.
- Novel claim: even though sparsification and stochastic quantization were proposed independently in [2,18,37] and [4,44] respectively, this paper is the first to combine them and prove convergence of the combined method.
Convergence Proof paragraph:
- Theorem 4.1 (proof in appendix, builds on the convergence proof of TopK SGD [5]; the novelty is adding stochastic quantization on top):
Theorem 4.1. Consider the SparCML SGD algorithm minimising a smooth, non-convex function f. There exists a learning-rate schedule (alpha_t) such that: min_{t in {1..T}} E[||grad f(x_t)||^2] -> 0 as T -> infinity.
Discussion paragraph (limits of the theorem):
- The result is ergodic convergence to a stationary point of expected zero gradient — weaker than convergence to a global minimum, but matches the state-of-the-art for non-convex with quantization [36].
- Like most theory results, it does not prescribe a precise hyperparameter set beyond "learning rates should be diminishing."
5. Supporting Sparsity in SparCML
5.1 Data Representation: Sparse Streams
Sparse stream definition paragraph:
- The data type is called a sparse stream: it is designed for efficient computation and communication of sparse vectors.
- Implemented in C++11.
- Discussion focuses on summation as the binary op, but applies to other component-wise ops.
Vector representations paragraph:
- Each H_i is a sparse subset of size k = max_i |H_i| << N (universe N).
- Density d_i = |H_i|/N; d = max_i d_i = k/N.
- Total nnz after reduction: K = |U_{i=1..P} H_i|. So k <= K <= min{N, P*k}.
Storage paragraph:
- Stored as a sequence of (index, value) pairs in a contiguous array.
- Index width yields
isizebytes per non-zero. Single or double precision values supported; lower precision support discussed in Section 7.
Switching to a Dense Format paragraph:
- Even if the input is sparse, the intermediate reduced vector may approach the universe size N — at which point the sparse representation is wasteful.
- Density model: dense format transmits N * isize bytes; sparse format
with nnz non-zeros transmits nnz*(c + isize) bytes, where c >=
ceil(log_2(N)/8) is the index size in bytes. Sparse beats dense iff
nnz <= delta = N*isize / (c + isize). - In practice delta is set even smaller (because summing sparse vectors is computationally more expensive than summing dense vectors).
- Initial nnz is bounded by k, but as P grows, nnz can exceed delta. SparCML prepends a flag byte to each stream that marks dense/sparse — and pre- allocates N * isize bytes so the worst case is bounded.
Efficient Summation paragraph:
- Summing two streams u_1, u_2 — distinguishes: (1) indices of u_1 and u_2 may overlap — must compute the size of |H_1 union H_2|; overlap is upper bounded by |H_1| + |H_2|; tightness depends on the unknown sparsity distribution. (2) indices are disjoint — concatenate.
- Three sub-cases inside (1):
- Both sparse, possibly overlapping: pre-check if union exceeds delta; if yes, switch to dense.
- One sparse, one dense: iterate over the sparse stream and write into the dense buffer.
- Both dense: vectorized dense element-wise addition in either input buffer; no new stream allocated.
5.2 Efficient Collectives on Sparse Streams
Section opener paragraph:
- Collective ops defined over the elements present at every node.
- Targets: allgather and allreduce (per MPI spec [21]).
- Supports arbitrary coordinate-wise associative reductions where a neutral element is defined (e.g., 0 for sum).
Analytical Model paragraph:
- Bidirectional, direct point-to-point cost model: T(L) = alpha + beta * L (Latency-Bandwidth model).
- alpha = latency, beta = transfer time per word.
- For sparse messages: beta_s = transfer time per (index,value) pair, beta_d = transfer time per word, with beta_d < beta_s (sparse pairs cost more per unit because of the index overhead).
Goal paragraph:
- Goal: at every node, compute the element-wise sum over the N dimensions of the allreduce vector while minimizing communication cost in the alpha-beta model.
Assumptions paragraph:
- Each node initially has k elements, |H_i| = k; P is a power of 2, P > 4; N is divisible by P. Relaxations are discussed in supplementary material.
5.3 Communication Algorithms
Section opener paragraph:
- Composes sparse streams into sparse collective communication algorithms.
- Modifies two existing dense algorithms to handle sparse vectors. Each node does not need global information about the amount of nnz across nodes.
- But all algorithms need a rough idea of K — the final size of the result.
SSAR vs DSAR paragraph:
- Static sparse allreduce (SSAR): K stays below delta — never switches to dense.
- Dynamic sparse allreduce (DSAR): K >= delta — starts sparse but switches to dense at some intermediate point.
Two extremes paragraph (uniform-nnz assumption):
- Case (1): no overlap of nonzero indices; final x has kP non-zeros.
- Case (2): all elements overlap; H_i = H_j for all i,j.
- Case (1) could be done with a simple allgather (no computation). Case (2) is equivalent to a dense allreduce of size k.
- General case lies between; communication time bounded above by log_2(P) alpha + (P-1) k beta_d [10]; bounded below by log_2(P) alpha + 2 ((P-1)/P) k beta_d [10] (latter only valid for negligible compute cost).
Lemma 5.1 paragraph:
Lemma 5.1. The time T for sparse allreduce is bounded by T >= log_2(P) alpha + (P-1) k beta_d if K = kP, and T >= log_2(P) alpha + 2 ((P-1)/P) k beta_d if K = k and computation for reduction is perfectly parallelized.
Switching paragraph:
- Allreduce implementations switch between algorithms by message size and process count [48]; SparCML distinguishes small and large message cases.
5.3.1 The Small Data Case — SSAR_Recursive_double
- When the overall reduced data is small, latency dominates the bandwidth term.
- Adopts recursive doubling:
- Round 1: nodes at distance 1 exchange + perform local sparse-stream reduction.
- Round 2: nodes at distance 2 exchange the reduced data.
- Round t: nodes at distance 2^{t-1} exchange the previously reduced 2^{t-1} k items.
- Figure 2 shows the structure on P = 8.
- The same recursive-doubling pattern works for dense allreduce/allgather [29].
Latency analysis:
- Latency: L_1(P) = log_2(P) alpha (latency-optimal, data-independent).
- Bandwidth bounds:
- Lower: L_1(P) + log_2(P) k beta_s (when k items fully overlap each round)
- Upper: L_1(P) + (P-1) k beta_s (when no overlap, summing geometric series yields k(P-1) total items transmitted).
5.3.2 The Large Data Case — SSAR_Split_allgather
- For large data, dense allreduce uses Rabenseifner's algorithm [42]: reduce-scatter (recursive halving) + recursive-doubling allgather; total T_ar_rab = 2 log_2(P) alpha + 2 ((P-1)/P) k beta_s — within 2 of bandwidth lower bound on the latency term.
- SparCML's sparse version splits the problem in two phases:
- Split phase: uniformly partition N into P partitions; each node sends the indices of its sparse vector belonging to partition j directly to the owner of partition j. Each node then reduces what it received into the final result for its partition.
- Sparse Allgather phase: each node's intermediate of size k/P (since final K = k after non-overlap upper bound) is gathered to all nodes via the previously-defined recursive-doubling sparse allgather.
- Bound on split phase: (P-1)alpha + 0beta_s <= T_split <= (P-1)alpha + kbeta_s.
- Bound on sparse_ag (allgather): L_1(P) + ((P-1)/P) k beta_s <= T_sparse_ag <= L_1(P) + (P-1) k beta_s.
- Combined: L_2(P) + ((P-1)/P) k beta_s <= T_ssar_split_ag <= L_2(P) + P k beta_s, where L_2(P) = (P-1)*alpha + L_1(P).
- The latency term is data-independent.
5.3.3 The Dynamic Case — Switching to Dense
Analysis paragraph:
- DSAR — assume final K > delta, so a sparse representation is wasteful. Specifically, K = kappa*N where kappa = delta/N (so sparse format is no longer efficient: about delta items in N entries).
- Bandwidth savings vs. dense are limited to a constant relative to dense.
Lemma 5.2 paragraph:
Lemma 5.2. Any algorithm solving the DSAR problem needs at least log_2(P) alpha + delta beta_d, where the lower bound on the bandwidth required is at least (1/2) kappa of any bandwidth-optimal fully-dense allreduce algorithm, with kappa = delta/N.
- Proof sketch: optimal latency lower bound from [10]. For full dense allreduce with k = N, the bandwidth lower bound is 2 ((P-1)/P) N beta_d. By the same analysis the DSAR problem has minimum bandwidth term delta*beta_d, which yields the (1/2) kappa factor lower bound.
Algorithm paragraph (DSAR_Split_allgather):
- Same first phase as SSAR_Split_allgather (sparse split): each node receives data in sparse format from all others into its partition.
- Each node then switches the representation to dense and performs a dense allgather (reusing highly-optimized MPI dense allgather).
- Combined bound: L_2(P) + k beta_s + ((P-1)/P) N beta_d <= T_dsar_split_ag <= L_2(P) + k beta_s + ((P-1)/P) N beta_d.
- Achievable speedup of sparse allreduce vs. fully dense (when end result cannot be stored sparsely) is at most (2/kappa)x. With kappa = 0.5 this yields a max speedup of 4x compared to dense — pure sparsity exploitation alone cannot do better; further compression (e.g. quantization) is needed to exceed this.
6. Supporting Low-Precision Communication
- Bandwidth cost in the dynamic case is bounded by a constant fraction of dense; experimental evidence (Section 8) confirms this scenario is common in large-scale deployments.
- To shave further bandwidth, SparCML supports lower-precision representations — 2, 4, 8 bits per entry — using stochastic quantization following the QSGD scheme [4]. Convergence is provably preserved.
- Sketch implementation: each dense stream is split into buckets of size B (order of 1024 consecutive entries); each bucket is quantized independently and stochastically; each bucket carries a full-precision scaling factor applied to all entries.
- Low-precision is applied only on the dense second phase of DSAR_Split_allgather (the moment data becomes dense). The bandwidth cost of that phase is reduced by the constant quantization factor.
7. Artifact and Additional Features
Interface and Code paragraph:
- SparCML library interface mirrors standard MPI calls with the caveat that data is assumed to be a sparse stream.
- Changes needed to existing MPI-enabled code to add SparCML are minor.
- Implementation: ~2,000 lines of native C++11 (excluding benchmark/test infrastructure). Adding SparCML to CNTK required ~100 lines.
Non-Blocking Operations paragraph:
- Algorithms also implemented in non-blocking form (per MPI-3 [27,28]).
- A thread can trigger a collective and continue local computation while the collective progresses in the background [26].
MPI-OPT paragraph:
- MPI-OPT is the authors' from-scratch framework for distributed optimization algorithms (SGD, SCD).
- Native C++11; can link external libraries such as SparCML and MPI for communication.
- Implements parallel stochastic optimization (gradient & coordinate descent) on multiple nodes via MPI-IO data ingest.
- Supports parallel data partition, multi-thread per-node compute, parametrized learning-rate schedules; can use SparCML as the comm layer for sparse, dense, synchronous, and asynchronous aggregation.
Microsoft Cognitive Toolkit (CNTK) paragraph:
- Modify CNTK [57] v2.0 to use SparCML as its communication layer.
- CNTK is a directed-computation-graph deep learning platform; supports popular NN architectures.
- Trains via SGD with autodiff. Default communication is MPI-based; SparCML swaps in.
8. Experiments
Setup paragraph:
- Two scenarios: supercomputing and cloud computing.
- Supercomputing: CSCS Piz Daint, Cray XC50 nodes, each with a 12-core HT-enabled Intel Xeon E5-2690 v3, 4 GB GPU RAM, NVIDIA Tesla P100 16 GB GPU. Most powerful supercomputer in Europe at the time; high-performance Aries interconnect, Dragonfly topology.
- Cloud computing: multiple nodes with relatively older NVIDIA K80 GPUs connected via Gigabit Ethernet to simulate a standard cloud deployment; no background traffic.
- Additional cluster called Greina (CX50 nodes) with InfiniBand FDR or Gigabit Ethernet interconnect, plus a production-grade GPU cluster.
- Baseline: MPI allreduce on fully dense vectors. Generally Open MPI
default; on Piz Daint, custom Cray-MPICH (compared against). Index
storage is a 32- bit
unsigned intsince N > 65 K.
8.1 Micro-Benchmarks
Setup paragraph:
- Synthetic data on Piz Daint and Greina (GigE).
- Vary N (data dimension) and d (per-node density). k = d*N indices selected uniformly at random per node, assigned random value.
- Run sparse allreduce variants to validate correctness and analytical ordering. Five experiments with newly generated data; each run 10 times; 50 runtime values yield 25/75-percentile quantiles. Reported on log-log scale.
Figure 3 results paragraph:
- Two plots:
- Piz Daint: Reduction time vs. number of nodes (2 -> 128) at N = 16M, d = 0.781%. Algorithms: MPI_AllReduce, SSAR_Recursive_double, SSAR_Split_allgather, DSAR_Split_allgather, Dense AllReduce Ring.
- Greina (GigE): Reduction time vs. data density (0.195% -> 6.250%) at N = 16M, P = 8.
- Analytical predictions confirmed:
- SSAR_Recursive_double wins for small data (latency dominates).
- At higher P or larger nnz, SSAR_Split_allgather dominates SSAR_Recursive_ double.
- DSAR_Split_allgather dominates SSAR_Split_allgather as nnz grows large relative to N.
- Ring-based MPI dense allreduce is faster than any sparse variant on a small / fast network at low node count, but sparse variants win at high node count even at low density.
- DSAR_Split_allgather provides only constant-factor improvement at very large node counts (matches Lemma 5.2).
- The slower the network, the larger the relative SparCML benefit (Greina GigE shows greater separation than Piz Daint).
8.2 Large-Scale Classification
Goal paragraph:
- Linear classifiers (Logistic Regression, SVM) on large-scale datasets via SGD and stochastic coordinate descent (SCD); examine speedup just by exploiting data sparsity — no quantization or sparsification of the gradient updates.
Datasets — Table 1:
| Name | # Classes | # Samples | Dimension |
|---|---|---|---|
| URL [38] | 2 | 2,396,130 | 3,231,961 |
| Webspam [51] | 2 | 350,000 | 16,609,143 |
| CIFAR-10 [34] | 10 | 60,000 | 32 x 32 x 3 |
| ImageNet-1K [43] | 1000 | 1.3M | 224 x 224 x 3 |
| ATIS [24] | 128 | 4,978 sentences / 56,590 words | — |
| Hansards [41] | — | 948 K sentences / 15,657 K words | — |
Sparsity-driven SGD paragraph:
- For SGD, samples are trigrams — extremely sparse text-based data. Loss function is linear in features, so resulting gradients are sparse.
- Communication is lossless, so convergence is preserved; only communication time changes.
- Large batches per node (1,000 * P) used.
Table 2 — Distributed optimization with MPI-OPT (full-dataset pass times, communication time in brackets):
| System | Dataset | Model | # nodes | Baseline (s) | Algorithm | Algo Time (s) | Speedup |
|---|---|---|---|---|---|---|---|
| Piz Daint | Webspam | LR | 32 | 24.0 (21.6) | SSAR_Recursive_double | 6.8 (2.6) | 3.53x (6.17x) |
| SVM | 32 | 16.2 (14.2) | 6.5 (4.4) | 2.49x (3.23x) | |||
| Piz Daint | URL | LR | 32 | 26.4 (25.8) | SSAR_Recursive_double | 7.5 (7.0) | 3.52x (3.69x) |
| SVM | 32 | 19.8 (19.3) | 5.6 (5.3) | 3.54x (3.64x) | |||
| Piz Daint | Webspam | LR | 8 | 46.7 (37.9) | SSAR_Split_allgather | 25.6 (15.8) | 1.82x (2.40x) |
| URL | LR | 8 | 37.7 (35.3) | 20.9 (15.0) | 1.80x (2.35x) | ||
| Greina (IB) | Webspam | LR | 8 | 65.2 (46.7) | SSAR_Split_allgather | 36.3 (19.0) | 1.80x (2.46x) |
| URL | LR | 8 | 81.4 (44.7) | 61.1 (24.9) | 1.33x (1.80x) | ||
| Greina (GigE) | Webspam | LR | 8 | 768.0 (759.5) | SSAR_Split_allgather | 37.9 (29.5) | 20.26x (25.75x) |
| URL | LR | 8 | 1045.0 (1004.6) | 80.26 (42.2) | 12.65x (23.81x) |
SCD paragraph:
- Run MPI-OPT's SCD implementation following the distributed random block CD algorithm of [53].
- 8 nodes of Piz Daint, logistic regression on URL, 100 coordinates per iteration. Each iteration requires sparse allgather of those coords (compared to dense allgather baseline).
- MPI-OPT with dense allgather: 49 s/epoch, 24 s in communication.
- MPI-OPT with sparse allgather: 26 s/epoch, 4.5 s in communication.
- Overall speedup factor 1.8x, due to 5.3x speedup in communication time.
Apache Spark comparison paragraph:
- Spark v1.6 is officially supported by CSCS [8]. Spark uses its own communication layer that does not exploit sparsity.
- 8 nodes Piz Daint, URL dataset: MPI-OPT vs. Spark — 63x speedup to convergence, due to 185x communication-time speedup: average epoch reduced from 378 s (319 s in comm) to 6 s (1.7 s in comm).
- MPI-OPT with standard Cray-optimized dense allreduce already achieves 31x speedup to convergence vs. Spark, due to 43x speedup in communication time. So Spark itself is the bigger handicap; SparCML doubles the win on top.
- Per-epoch: 13 s on average, 8.6 s in communication.
- 8-node cluster with GigE: Spark = 1,274 s/epoch (1042 s comm). MPI-OPT = 14 s/epoch (12 s comm). 86x epoch / 12x comm vs. Spark dense.
Discussion paragraph:
- Spark comparison should be taken with a grain of salt because Spark implements non-trivial features such as fault tolerance.
- Conclusion: sparse communication delivers significant savings whenever sparsity is naturally present in the data.
8.3 Training Deep Neural Networks
Setup paragraph:
- Distributed CNTK on academic datasets.
- Implement Top-k SGD [2,18,46] with low-precision quantization (Algorithm 1).
- Three task types: image classification on CIFAR-10, NLU on ATIS, machine translation on Hansards.
- For CIFAR-10: ResNet-110 [23]. For NLU/MT: encoder-decoder with two LSTM cells [25].
- Default single-GPU 32-bit hyperparameters from open-source CNTK 2.0 repo [39]. Sparsification + quantization via optimized GPU kernels.
- Communication is layer-wise (non-blocking calls) — overhead on overall computation is < 1%.
Bandwidth reduction example paragraph:
- LSTM on ATIS: ~20 M parameters = ~80 MB per minibatch in full precision.
- Compressed gradient received at every node in SparCML totals less than 0.5 MB.
Accuracy & Speedup paragraph (CIFAR-10):
- For CIFAR-10, end accuracy matches the full-precision baseline when selecting k = 16 of every 512 elements (= ~3% density), and for k = 8/512 the accuracy is 1% above the 32-bit variant (Figure 4a).
- For ATIS / Hansards: training and test losses + BLEU within 1% of full-precision baselines; ATIS shown in Figure 4b. Quantization parameters for ATIS: k = 2, k = 4 entries per bucket of 512 (~0.4% / ~0.8% density).
End-to-end training time paragraph:
- CIFAR-10: speedup 1.12x to full convergence vs. full-precision baseline, on 8 nodes.
- ATIS (20 epochs): 5.99x speedup.
- Hansards (5,200 iterations): 1.5x speedup.
- Variance explained by communication-vs-compute ratio: CIFAR-10 is compute-dominated; ATIS is communication-dominated (so sparsity has much larger impact).
8.4 Large Workload Experiments
ImageNet ResNet-50 paragraph:
- 64 P100 GPU nodes on CSCS Piz Daint; ResNet-50 on ImageNet-1K (~25 M parameters across 50 layers).
- Negative result: Cray MPI baseline = ~2,071 s/epoch; SparCML at ~99% sparsity is only ~6% faster (~1,950 s/epoch).
- Three reasons: (1) gradients become dense during aggregation at this parameter setting (limiting speedup); higher sparsity hurts model convergence even with momentum correction / warmup [37]; (2) overhead of sparsification + densification is non-negligible relative to the small per-layer transmission cost in ResNet-50; (3) Cray's proprietary MPI implementation gets no extra parameter tuning here.
- Implication: sparsification is not a universal solution at very large node counts when the per-node compressed gradient becomes dense post-sum.
Wide ResNet (4xResNet18, 4xResNet34) on ImageNet-1K paragraph:
- Wide residual nets [58] have wider channels per block; achieve similar accuracy with shallower depth, less sensitive to hyperparameters.
- Train with TopK SGD k = 1/512, ~0.2% nnz per node; global batch size 512.
- Convergence: final accuracy of sparse variants <0.9% lower in Top-1 and <0.5% lower in Top-5 vs. dense baseline.
- Speedup: ~2x for 4xResNet18, ~1.85x for 4xResNet34 vs. Cray MPI baseline. <0.4% Top-5 accuracy loss vs. fully-dense baseline.
- Most of the speedup comes from the reduced aggregation time of the last fully-connected layer (>2M parameters).
Automated Speech Recognition (ASR) paragraph:
- Production LSTM model with attention powering a popular Microsoft personal digital assistant. ~60M params, 2.4M in attention layer.
- Dataset: ~30,000 hours (3.5 years) of annotated speech.
- Cluster: 32 servers, each with 4 NVIDIA V100 GPUs = 128 V100 GPUs; InfiniBand interconnect; intra-node aggregation via NCCL on NVLink.
- Baseline: 4-node, 16-GPU block-momentum SGD (BMUF) [11] without sparsity or quantization. Higher node counts for the full-precision variant led to negative scalability and divergence.
- BMUF baseline already does non-trivial communication reduction since it syncs less frequently than minibatch SGD (which is infeasible at this size).
- 16-GPU BMUF baseline: ~14 days to train.
- SparCML setup: Top-k SGD with groups of 512 consecutive coordinates, 4 largest sent per group; fixed global batch size 512 (strong scaling).
- Six full passes over the dataset; record time-to-final-accuracy.
- Figure 6a: error-vs-time. Sparse implementation reaches similar accuracy to full-precision baseline in a fraction of the time. At 32 nodes (128 GPUs), training reduces to <1.78 days.
- Figure 6b: scalability — sub-linear but strong from 16 -> 128 GPUs.
- Word-error-rate (WER) on validation: trained models <1% higher WER than full-precision (and can be up to 1% lower in some splits — sometimes better due to noise regularization).
Hyperparameter Tuning paragraph:
- Authors enforce no hyperparameter retuning under sparsity; in most cases recovered accuracy at standard hyperparameters.
- Two notable exceptions: (i) ResNet50 — high sparsity + large batch induced significant accuracy loss; (ii) ASR — required maintaining a small global batch size to preserve convergence.
- Suggests non-trivial interaction between sparse gradients, batch size, and convergence — left to future work.
9. Related Work
Reduced communication techniques paragraph:
- Seide et al. [44]: 1-bit SGD — first to propose quantization for reducing bandwidth/latency cost of training deep networks.
- Alistarh et al. [4]: QSGD — theoretically justified quantization with a controllable compression-vs-convergence trade-off. SparCML implements QSGD as default quantization.
- Dryden et al. [18], Aji & Heafield [2]: alternative communication-reduction by sparsifying the gradient updates with top-k. Subsequent work [37,46] shows extreme sparsity (<0.1%) can be supported by CNNs and RNNs while preserving accuracy with careful tuning.
- SparCML's contribution complements this: combines stochastic quantization and sparsification with a convergence proof; provides efficient sparsity- and-quantization-aware system support.
Lossless methods paragraph:
- Factorization is a lossless compression effective for fully-connected layers but less so for conv layers [14,54].
- Executing extremely large batches [3,20,55,56] is another lossless method; SparCML's compression methods are orthogonal — they aim to reduce bandwidth at fixed batch size; sparsification can be applied with additional tuning at fixed batch size.
- However, with very large node count, batches grow large and aggregated gradients become dense — sparsity cannot scale linearly with node count. This regime is left for future work; SparCML already implements optimizations such as merging adjacent gradients (tensor fusion) and non-blocking ops [55].
Communication frameworks paragraph:
- NVIDIA's NCCL [40] reduces communication cost when nodes are NVIDIA GPUs with proprietary NVLINK — not the case in supercomputing settings. NCCL also implements only a very restricted set of reduction operations.
- Custom frameworks for specific workloads exist — Livermore Big Artificial Neural Network Toolkit (LBANN [50]), S-Caffe [6] — efficient in specific instances but do not leverage reduced-communication techniques or sparsity.
Sparse reduction paragraph:
- Hofmann & Rünger [30] propose a runlength encoding approach for sparse reductions — SparCML extends this with the observation that data may become dense during reduction, and that an efficient adaptive representation is needed.
- Träff [49] proposes a general approach for sparsity in MPI by ignoring neutral elements in MPI reductions; SparCML's sparse allreduce can be seen as a special case, but it specifies the reduction algorithms and provides performance bounds for small and large message scenarios.
- Kylix [60] considers sparse many-to-many reductions for graph data on community clusters — but assumes knowledge of the data distribution and performs multiple passes; not applicable to SparCML's setting.
- Dryden et al. [18]: a sparse allreduce variant via pairwise
reduce-scatter
- ring allgather. Keeps amount of data constant by re-merging top-k after receiving values; the ability to preserve a local residual is specific to Top-k SGD. SparCML's framework is more general; performance is similar to SSAR_Split_allgather.
10. Conclusions and Further Work
- Described and analyzed SparCML: a high-performance communication framework for sparse and low-precision communication for ML.
- Integrates with existing computational frameworks; provides order-of- magnitude speedups on real applications.
- Future work: (1) other distributed ML applications that benefit from sparsity, (2) interaction between sparsity and other parallelization strategies (e.g., large-batch training).
- Belief: simple but effective sparsity schemes can play a significant role in reducing comm cost in future ML systems.
Acknowledgments / Funding
- ERC Horizon 2020 grant 805223; Swiss National Supercomputing Centre Small Development Project (code d94); ERC Horizon 2020 DAPP grant 678880. CSCS support team (Tal Ben-Nun, Salvatore Di Girolamo) thanked.
Cross-Cutting Empirical Take-Aways
| Take-away | Derived from |
|---|---|
| Sparsification + sum across many nodes naturally densifies the result | Figure 1 (Sec. 3) |
| Recursive doubling wins for small data; split-allgather wins for large data | Sec. 5.3, Figure 3 |
| Pure sparsity can yield at most (2/kappa)x = 4x speedup vs. dense allreduce | Lemma 5.2 |
| Combined sparsification + stochastic quantization preserves convergence | Theorem 4.1 |
| Sparse comm benefit is bigger on slower networks | Greina GigE vs. Piz Daint (Fig. 3, Table 2) |
| ASR LSTM scaled to 128 GPUs in 1.78 days vs. 14 days BMUF baseline | Sec. 8.4 |
| ResNet-50 doesn't benefit from extreme sparsity at scale (densifies quickly) | Sec. 8.4 |
| Wide ResNets (4xResNet18/34) yield ~2x speedup at 64 GPUs with <0.5% Top-5 loss | Sec. 8.4 |
| Linear classifier on URL: 31x to 63x speedup vs. Spark, depending on stack | Sec. 8.2 |
| GigE small-cluster speedups exceed IB at the same node count (slower link, more comm bound) | Table 2 |
Limitations of the Paper
- Theory only proves ergodic convergence to a stationary point; not to a global minimum. Aligned with state-of-the-art for non-convex training but weaker than per-iteration guarantees.
- Hyperparameter retuning is required for some workloads (ResNet-50 + large batch; ASR with global batch constraint).
- Maximum speedup of pure sparse allreduce vs. dense is bounded by 4x (Lemma 5.2); larger gains require quantization on top.
- Sparsity does not scale linearly with node count: at very high P, summed gradients become dense, eroding savings (ResNet-50 on 64 GPUs). Limits applicability to extreme-scale CNN training.
- Architecture / network specific: gains amplified on slower networks (GigE) and reduced on highly optimized supercomputing fabrics (Cray Aries).
- NCCL-based intra-node speedups not separately measured; SparCML targets inter-node MPI-style fabric; relationship to NCCL collectives only briefly discussed (NCCL has restricted reduction op set; SparCML is more general).
Notable Relationship to NCCL
- The paper explicitly notes NCCL [40] reduces communication cost on NVIDIA GPUs over NVLink, but is restricted to a small set of reduction operations and to NVLink-equipped clusters — not the supercomputing scenario.
- SparCML aims for the orthogonal axis: a general MPI-style sparse collective layer that runs over arbitrary HPC fabrics (Cray Aries, InfiniBand, GigE).
- The ASR experiment uses NCCL for intra-node (NVLink) aggregation and SparCML for inter-node sparse aggregation — i.e. the two layers compose hierarchically, with NCCL handling fast-path within a server and SparCML handling the cross-server sparse allreduce.