SparCML: High-Performance Sparse Communication for Machine Learning — Detailed Summary

Cedric Renggli (ETH Zurich), Saleh Ashkboos (IST Austria), Mehdi Aghagolzadeh (Microsoft), Dan Alistarh (IST Austria), Torsten Hoefler (ETH Zurich) | SC '19, Denver, CO, November 17-22, 2019 | DOI: 10.1145/3295500.3356222 | Code: gitlab.com/renggli/sparcml

Per-section summary organized by paper headings. Each section includes paragraph-level bullets and exact quantitative results where the paper provides them.

Abstract

Modern ML on large datasets requires highly-scalable algorithms; the standard data-parallel pattern sums every node's gradient via a global allreduce — this allreduce is the single dominant scaling bottleneck.
Empirical observation: gradient values are frequently zero or close to zero, so allreduce inputs are sparsifiable.
Contribution: a set of communication-efficient sparse-input collective protocols, plus an MPI-extending library called SparCML that adds non- blocking (asynchronous) and low-precision support.
The protocols generalize standard collectives so that processes can contribute arbitrary sparse input vectors — i.e. each rank's nnz indices may differ.
Authors position SparCML as a foundation for future highly scalable ML frameworks; superscript footnote: "SparCML" reads as sparse ML.

1. Introduction and Motivation

Motivation paragraph:

ML workloads are growing fast: AlexNet trained in days on a 2010-era GPU system; BERT (cited as ref [17]) would take more than a year to train on a single GPU.
Network parameter counts have grown from a handful (regression) up to ~200 MB for AlexNet up to 340 million parameters / 11 GB at 32-bit precision for the largest BERT.

Standard distribution strategy paragraph:

Data parallelism partitions the dataset and replicates the model. Global sum is performed by allreduce or a parameter server [35]; this paper focuses on allreduce due to PS scaling limits.
The allreduce reduction is the bottleneck because its size equals the model size and is not reduced as more nodes are added.
Concretely: at very large node counts, "hundreds of megabytes must be summed globally every few microseconds."

Existing optimization landscape paragraph:

All major frameworks chase efficient communication [1,13,33,44,57]; GPU vendors develop specific layers like NCCL [40].
Research community has proposed quantization [4,44,45], asynchronous training [59], structured sparsification [2,18,46], and large-batch methods [20,56].
The unexplored angle the authors target: how to exploit sparsity in the global summation itself. Their layer extends MPI primitives.

Conceptual contribution paragraph:

SparCML reduces communication and synchronization cost by exploiting sparsity and relaxed consistency.
A node may compute updates against a partially inconsistent view of parameters; the implication is that updates are either naturally sparse [51] or can be sparsified in a principled manner without loss of convergence [2,4,18,46].

Technical contribution paragraph:

Thesis: sparse-aware communication and compression should be standard but is not available in MPI [19] or in ML communication libraries [40].
Reason: building general sparse collectives is non-trivial because it adds a new dimension to existing system trade-offs in collective design [48].

Performance/feasibility paragraph:

For some workload parameters, SparCML can be shown to be within constant factors of optimal in the bandwidth/latency cost model.
It achieves order-of-magnitude speedups vs. highly optimized dense collectives or naive sparse implementations, both in synthetic and real applications.
Adds two extra features: efficient reduced-precision support and non- blocking operations. Example: gradient exchange at 4 bits per coordinate, overlapping computation and communication.

Targets paragraph:

Two large-scale ML targets: deep neural networks and large-scale regularized classification.
Two system scenarios: (1) supercomputing — high-performance, optimized network, (2) datacenters — relatively slow networks (InfiniBand or Gigabit Ethernet).

Challenges paragraph:

Implementing allreduce sums where many nodes contribute sparse vectors is hard because non-zero indices overlap unpredictably; the size of the reduced result is unknown a priori.
An adaptive set of techniques is needed for the various input regimes.
An additional ML-side challenge: avoiding hyperparameter retuning when enabling sparsity. The paper reports this is achievable with few exceptions.

Experimental results paragraph:

Three classes of benchmarks: (1) synthetic, (2) academic ML datasets, (3) large-scale image classification (ImageNet ResNet) and automated speech recognition (LSTM, internal Microsoft assistant model).
Synthetic shows order-of-magnitude speedups vs. dense; limited overhead in the dense case.
Integrations: CNTK (Microsoft) and MPI-OPT (the authors' own framework).
Headline supercomputing result: SparCML reduces end-to-end convergence time of a state-of-the-art NLU network by ~6x.
Headline datacenter result: completes a large-scale URL classification task 31x faster than a Cray MPI-based variant (which doesn't exploit sparsity). Speedups are larger on slower networks.

Large workloads paragraph:

ImageNet wide-residual networks [58] on 64 GPUs: end-to-end training time reduced by ~2x with negligible accuracy loss (<0.5% Top-5 validation), no extra hyperparameter tuning.
Gains are negligible for plain ResNet-50 [20,23] (smaller, less amenable to sparsity).
ASR task: state-of-the-art LSTM training time reduced on 128 GPUs by almost 10x, from 14 days to 1.78 days, no accuracy loss.

2. Preliminaries

Notation table (Section 2 opener):

Variable	Description
P	Number of nodes
N	Problem dimension
p_i	Node i, 1 <= i <= P
H_i	Set of non-zero indices node p_i communicates
k	max_i
K	total nnz in global sum:
d	density of nnz: k/N

2.1 Data Parallelism and Communication Costs

Data parallel paragraph:

P nodes share a large dataset; each holds its own model copy x_t.
Updates are exchanged either by global averaging or via a parameter server [35]. Each iteration runs SGD: each node computes a stochastic gradient on a randomly sampled mini-batch, then nodes globally sum.
Standard SGD iteration:

x_{t+1} = x_t - eta * sum_{i=1..P} grad F_i(x_t)

The trade-off is parallelism vs. communication: P nodes give P times more samples per iteration, but require an additional global sum to maintain consistency.
This communication cost motivates the wave of communication-reduction techniques that follow.

2.2 Communication-Reduction Techniques

Structured Sparsification paragraph:

Top-k SGD [2,18]: each node communicates only the k largest-magnitude components of its local gradient. k is fixed to some percentage, "even lower than 1%" per [37].
Forces sparsity at each node, although chosen indices may differ across nodes.
Components not sent are accumulated (residual / error feedback) into the next iteration's gradient.
Algorithm 1 pseudocode (paper's Algorithm 1, SparCML Quantized TopK SGD at node i):

v_0 = eps_0^i = 0
for each step t >= 1 do
   acc_t^i  <- eps_{t-1}^i + alpha * grad F_i(v_{t-1})
   eps_t^i  <- acc_t^i - TopK(acc_t^i)              # update error
   g_t^i    <- allreduce( Q(TopK(acc_t^i)), SUM )    # sparse contribution
   v_t^i    <- v_{t-1}^i - g_t^i
end for

The quantization function Q can be the identity (no quantization) or a stochastic quantizer (Section 6).

Quantization paragraph (orthogonal axis):

Quantization [4,16,44,52] reduces bits per gradient value:

x_{t+1} = x_t - sum_{i=1..P} Q( grad F_i(x_t) )

Q is element-wise; reduces precision; preserves convergence as long as Q is zero-mean, but the added variance can slow convergence [4].

3. Communication-Reduction: A Critical View

Section opener — examines techniques in the supercomputing / cloud regime.

Structured sparsification paragraph (positives):

Sparsification preserves convergence even on non-convex objectives [5]; empirical neural-net experiments allow nodes to send <1% of local gradient without losing convergence [37].

Figure 1 paragraph (caveat at scale):

High sparsity (>99%) requires careful momentum / learning-rate hyperparameter tuning — error-prone and time-consuming.
Lower sparsity (5-10% nnz/node) is more stable, but summing across many nodes can cause the reduced vector to become dense — at which point communication is bottlenecked again.
Figure 1: 3D plot of result density (%) vs. number of nodes N (1-128) and per-node density k (0-30%) on ResNet20 / CIFAR-10 at training epoch 5. Result density rises sharply with both axes. (The same shape holds across training stages and models — ResNet, DenseNet.)

Quantization paragraph (opposite trade-off):

Quantization compression rate is independent of node count — it does not fill in.
However, quantization can yield only 4-8x compression before added variance begins to hurt accuracy [4,22].

4. Communication Reduction in SparCML

Section opener — neither sparsification nor quantization is ideal alone at high node count, so SparCML combines them.

Sparse Quantized Reduction paragraph:

High-level: each node maintains residual error eps^i locally; on each step the residual + new gradient produces an accumulator; the accumulator is truncated to obtain the value g_t^i sent to others; the new residual is updated.
The allreduce sums truncated values across nodes.
Because the summed result may become dense, SparCML may quantize that intermediate dense vector with stochastic QSGD quantization [4] to bound bandwidth.
Novel claim: even though sparsification and stochastic quantization were proposed independently in [2,18,37] and [4,44] respectively, this paper is the first to combine them and prove convergence of the combined method.

Convergence Proof paragraph:

Theorem 4.1 (proof in appendix, builds on the convergence proof of TopK SGD [5]; the novelty is adding stochastic quantization on top):

Theorem 4.1. Consider the SparCML SGD algorithm minimising a smooth, non-convex function f. There exists a learning-rate schedule (alpha_t) such that: min_{t in {1..T}} E[||grad f(x_t)||^2] -> 0 as T -> infinity.

Discussion paragraph (limits of the theorem):

The result is ergodic convergence to a stationary point of expected zero gradient — weaker than convergence to a global minimum, but matches the state-of-the-art for non-convex with quantization [36].
Like most theory results, it does not prescribe a precise hyperparameter set beyond "learning rates should be diminishing."

5. Supporting Sparsity in SparCML

5.1 Data Representation: Sparse Streams

Sparse stream definition paragraph:

The data type is called a sparse stream: it is designed for efficient computation and communication of sparse vectors.
Implemented in C++11.
Discussion focuses on summation as the binary op, but applies to other component-wise ops.

Vector representations paragraph:

Each H_i is a sparse subset of size k = max_i |H_i| << N (universe N).
Density d_i = |H_i|/N; d = max_i d_i = k/N.
Total nnz after reduction: K = |U_{i=1..P} H_i|. So k <= K <= min{N, P*k}.

Storage paragraph:

Stored as a sequence of (index, value) pairs in a contiguous array.
Index width yields isize bytes per non-zero. Single or double precision values supported; lower precision support discussed in Section 7.

Switching to a Dense Format paragraph:

Even if the input is sparse, the intermediate reduced vector may approach the universe size N — at which point the sparse representation is wasteful.
Density model: dense format transmits N * isize bytes; sparse format with nnz non-zeros transmits nnz*(c + isize) bytes, where c >= ceil(log_2(N)/8) is the index size in bytes. Sparse beats dense iff nnz <= delta = N*isize / (c + isize).
In practice delta is set even smaller (because summing sparse vectors is computationally more expensive than summing dense vectors).
Initial nnz is bounded by k, but as P grows, nnz can exceed delta. SparCML prepends a flag byte to each stream that marks dense/sparse — and pre- allocates N * isize bytes so the worst case is bounded.

Efficient Summation paragraph:

Summing two streams u_1, u_2 — distinguishes: (1) indices of u_1 and u_2 may overlap — must compute the size of |H_1 union H_2|; overlap is upper bounded by |H_1| + |H_2|; tightness depends on the unknown sparsity distribution. (2) indices are disjoint — concatenate.
Three sub-cases inside (1):
- Both sparse, possibly overlapping: pre-check if union exceeds delta; if yes, switch to dense.
- One sparse, one dense: iterate over the sparse stream and write into the dense buffer.
- Both dense: vectorized dense element-wise addition in either input buffer; no new stream allocated.

5.2 Efficient Collectives on Sparse Streams

Section opener paragraph:

Collective ops defined over the elements present at every node.
Targets: allgather and allreduce (per MPI spec [21]).
Supports arbitrary coordinate-wise associative reductions where a neutral element is defined (e.g., 0 for sum).

Analytical Model paragraph:

Bidirectional, direct point-to-point cost model: T(L) = alpha + beta * L (Latency-Bandwidth model).
alpha = latency, beta = transfer time per word.
For sparse messages: beta_s = transfer time per (index,value) pair, beta_d = transfer time per word, with beta_d < beta_s (sparse pairs cost more per unit because of the index overhead).

Goal paragraph:

Goal: at every node, compute the element-wise sum over the N dimensions of the allreduce vector while minimizing communication cost in the alpha-beta model.

Assumptions paragraph:

Each node initially has k elements, |H_i| = k; P is a power of 2, P > 4; N is divisible by P. Relaxations are discussed in supplementary material.

5.3 Communication Algorithms

Section opener paragraph:

Composes sparse streams into sparse collective communication algorithms.
Modifies two existing dense algorithms to handle sparse vectors. Each node does not need global information about the amount of nnz across nodes.
But all algorithms need a rough idea of K — the final size of the result.

SSAR vs DSAR paragraph:

Static sparse allreduce (SSAR): K stays below delta — never switches to dense.
Dynamic sparse allreduce (DSAR): K >= delta — starts sparse but switches to dense at some intermediate point.

Two extremes paragraph (uniform-nnz assumption):

Case (1): no overlap of nonzero indices; final x has kP non-zeros.
Case (2): all elements overlap; H_i = H_j for all i,j.
Case (1) could be done with a simple allgather (no computation). Case (2) is equivalent to a dense allreduce of size k.
General case lies between; communication time bounded above by log_2(P) alpha + (P-1) k beta_d [10]; bounded below by log_2(P) alpha + 2 ((P-1)/P) k beta_d [10] (latter only valid for negligible compute cost).

Lemma 5.1 paragraph:

Lemma 5.1. The time T for sparse allreduce is bounded by T >= log_2(P) alpha + (P-1) k beta_d if K = kP, and T >= log_2(P) alpha + 2 ((P-1)/P) k beta_d if K = k and computation for reduction is perfectly parallelized.

Switching paragraph:

Allreduce implementations switch between algorithms by message size and process count [48]; SparCML distinguishes small and large message cases.

5.3.1 The Small Data Case — SSAR_Recursive_double

When the overall reduced data is small, latency dominates the bandwidth term.
Adopts recursive doubling:
- Round 1: nodes at distance 1 exchange + perform local sparse-stream reduction.
- Round 2: nodes at distance 2 exchange the reduced data.
- Round t: nodes at distance 2^{t-1} exchange the previously reduced 2^{t-1} k items.
Figure 2 shows the structure on P = 8.
The same recursive-doubling pattern works for dense allreduce/allgather [29].

Latency analysis:

Latency: L_1(P) = log_2(P) alpha (latency-optimal, data-independent).
Bandwidth bounds:
- Lower: L_1(P) + log_2(P) k beta_s (when k items fully overlap each round)
- Upper: L_1(P) + (P-1) k beta_s (when no overlap, summing geometric series yields k(P-1) total items transmitted).

5.3.2 The Large Data Case — SSAR_Split_allgather

For large data, dense allreduce uses Rabenseifner's algorithm [42]: reduce-scatter (recursive halving) + recursive-doubling allgather; total T_ar_rab = 2 log_2(P) alpha + 2 ((P-1)/P) k beta_s — within 2 of bandwidth lower bound on the latency term.
SparCML's sparse version splits the problem in two phases:
- Split phase: uniformly partition N into P partitions; each node sends the indices of its sparse vector belonging to partition j directly to the owner of partition j. Each node then reduces what it received into the final result for its partition.
- Sparse Allgather phase: each node's intermediate of size k/P (since final K = k after non-overlap upper bound) is gathered to all nodes via the previously-defined recursive-doubling sparse allgather.
Bound on split phase: (P-1)alpha + 0beta_s <= T_split <= (P-1)alpha + kbeta_s.
Bound on sparse_ag (allgather): L_1(P) + ((P-1)/P) k beta_s <= T_sparse_ag <= L_1(P) + (P-1) k beta_s.
Combined: L_2(P) + ((P-1)/P) k beta_s <= T_ssar_split_ag <= L_2(P) + P k beta_s, where L_2(P) = (P-1)*alpha + L_1(P).
The latency term is data-independent.

5.3.3 The Dynamic Case — Switching to Dense

Analysis paragraph:

DSAR — assume final K > delta, so a sparse representation is wasteful. Specifically, K = kappa*N where kappa = delta/N (so sparse format is no longer efficient: about delta items in N entries).
Bandwidth savings vs. dense are limited to a constant relative to dense.

Lemma 5.2 paragraph:

Lemma 5.2. Any algorithm solving the DSAR problem needs at least log_2(P) alpha + delta beta_d, where the lower bound on the bandwidth required is at least (1/2) kappa of any bandwidth-optimal fully-dense allreduce algorithm, with kappa = delta/N.

Proof sketch: optimal latency lower bound from [10]. For full dense allreduce with k = N, the bandwidth lower bound is 2 ((P-1)/P) N beta_d. By the same analysis the DSAR problem has minimum bandwidth term delta*beta_d, which yields the (1/2) kappa factor lower bound.

Algorithm paragraph (DSAR_Split_allgather):

Same first phase as SSAR_Split_allgather (sparse split): each node receives data in sparse format from all others into its partition.
Each node then switches the representation to dense and performs a dense allgather (reusing highly-optimized MPI dense allgather).
Combined bound: L_2(P) + k beta_s + ((P-1)/P) N beta_d <= T_dsar_split_ag <= L_2(P) + k beta_s + ((P-1)/P) N beta_d.
Achievable speedup of sparse allreduce vs. fully dense (when end result cannot be stored sparsely) is at most (2/kappa)x. With kappa = 0.5 this yields a max speedup of 4x compared to dense — pure sparsity exploitation alone cannot do better; further compression (e.g. quantization) is needed to exceed this.

6. Supporting Low-Precision Communication

Bandwidth cost in the dynamic case is bounded by a constant fraction of dense; experimental evidence (Section 8) confirms this scenario is common in large-scale deployments.
To shave further bandwidth, SparCML supports lower-precision representations — 2, 4, 8 bits per entry — using stochastic quantization following the QSGD scheme [4]. Convergence is provably preserved.
Sketch implementation: each dense stream is split into buckets of size B (order of 1024 consecutive entries); each bucket is quantized independently and stochastically; each bucket carries a full-precision scaling factor applied to all entries.
Low-precision is applied only on the dense second phase of DSAR_Split_allgather (the moment data becomes dense). The bandwidth cost of that phase is reduced by the constant quantization factor.

7. Artifact and Additional Features

Interface and Code paragraph:

SparCML library interface mirrors standard MPI calls with the caveat that data is assumed to be a sparse stream.
Changes needed to existing MPI-enabled code to add SparCML are minor.
Implementation: ~2,000 lines of native C++11 (excluding benchmark/test infrastructure). Adding SparCML to CNTK required ~100 lines.

Non-Blocking Operations paragraph:

Algorithms also implemented in non-blocking form (per MPI-3 [27,28]).
A thread can trigger a collective and continue local computation while the collective progresses in the background [26].

MPI-OPT paragraph:

MPI-OPT is the authors' from-scratch framework for distributed optimization algorithms (SGD, SCD).
Native C++11; can link external libraries such as SparCML and MPI for communication.
Implements parallel stochastic optimization (gradient & coordinate descent) on multiple nodes via MPI-IO data ingest.
Supports parallel data partition, multi-thread per-node compute, parametrized learning-rate schedules; can use SparCML as the comm layer for sparse, dense, synchronous, and asynchronous aggregation.

Microsoft Cognitive Toolkit (CNTK) paragraph:

Modify CNTK [57] v2.0 to use SparCML as its communication layer.
CNTK is a directed-computation-graph deep learning platform; supports popular NN architectures.
Trains via SGD with autodiff. Default communication is MPI-based; SparCML swaps in.

8. Experiments

Setup paragraph:

Two scenarios: supercomputing and cloud computing.
Supercomputing: CSCS Piz Daint, Cray XC50 nodes, each with a 12-core HT-enabled Intel Xeon E5-2690 v3, 4 GB GPU RAM, NVIDIA Tesla P100 16 GB GPU. Most powerful supercomputer in Europe at the time; high-performance Aries interconnect, Dragonfly topology.
Cloud computing: multiple nodes with relatively older NVIDIA K80 GPUs connected via Gigabit Ethernet to simulate a standard cloud deployment; no background traffic.
Additional cluster called Greina (CX50 nodes) with InfiniBand FDR or Gigabit Ethernet interconnect, plus a production-grade GPU cluster.
Baseline: MPI allreduce on fully dense vectors. Generally Open MPI default; on Piz Daint, custom Cray-MPICH (compared against). Index storage is a 32- bit unsigned int since N > 65 K.

8.1 Micro-Benchmarks

Setup paragraph:

Synthetic data on Piz Daint and Greina (GigE).
Vary N (data dimension) and d (per-node density). k = d*N indices selected uniformly at random per node, assigned random value.
Run sparse allreduce variants to validate correctness and analytical ordering. Five experiments with newly generated data; each run 10 times; 50 runtime values yield 25/75-percentile quantiles. Reported on log-log scale.

Figure 3 results paragraph:

Two plots:
- Piz Daint: Reduction time vs. number of nodes (2 -> 128) at N = 16M, d = 0.781%. Algorithms: MPI_AllReduce, SSAR_Recursive_double, SSAR_Split_allgather, DSAR_Split_allgather, Dense AllReduce Ring.
- Greina (GigE): Reduction time vs. data density (0.195% -> 6.250%) at N = 16M, P = 8.
Analytical predictions confirmed:
- SSAR_Recursive_double wins for small data (latency dominates).
- At higher P or larger nnz, SSAR_Split_allgather dominates SSAR_Recursive_ double.
- DSAR_Split_allgather dominates SSAR_Split_allgather as nnz grows large relative to N.
- Ring-based MPI dense allreduce is faster than any sparse variant on a small / fast network at low node count, but sparse variants win at high node count even at low density.
- DSAR_Split_allgather provides only constant-factor improvement at very large node counts (matches Lemma 5.2).
The slower the network, the larger the relative SparCML benefit (Greina GigE shows greater separation than Piz Daint).

8.2 Large-Scale Classification

Goal paragraph:

Linear classifiers (Logistic Regression, SVM) on large-scale datasets via SGD and stochastic coordinate descent (SCD); examine speedup just by exploiting data sparsity — no quantization or sparsification of the gradient updates.

Datasets — Table 1:

Name	# Classes	# Samples	Dimension
URL [38]	2	2,396,130	3,231,961
Webspam [51]	2	350,000	16,609,143
CIFAR-10 [34]	10	60,000	32 x 32 x 3
ImageNet-1K [43]	1000	1.3M	224 x 224 x 3
ATIS [24]	128	4,978 sentences / 56,590 words	—
Hansards [41]	—	948 K sentences / 15,657 K words	—

Sparsity-driven SGD paragraph:

For SGD, samples are trigrams — extremely sparse text-based data. Loss function is linear in features, so resulting gradients are sparse.
Communication is lossless, so convergence is preserved; only communication time changes.
Large batches per node (1,000 * P) used.

Table 2 — Distributed optimization with MPI-OPT (full-dataset pass times, communication time in brackets):

System	Dataset	Model	# nodes	Baseline (s)	Algorithm	Algo Time (s)	Speedup
Piz Daint	Webspam	LR	32	24.0 (21.6)	SSAR_Recursive_double	6.8 (2.6)	3.53x (6.17x)
		SVM	32	16.2 (14.2)		6.5 (4.4)	2.49x (3.23x)
Piz Daint	URL	LR	32	26.4 (25.8)	SSAR_Recursive_double	7.5 (7.0)	3.52x (3.69x)
		SVM	32	19.8 (19.3)		5.6 (5.3)	3.54x (3.64x)
Piz Daint	Webspam	LR	8	46.7 (37.9)	SSAR_Split_allgather	25.6 (15.8)	1.82x (2.40x)
	URL	LR	8	37.7 (35.3)		20.9 (15.0)	1.80x (2.35x)
Greina (IB)	Webspam	LR	8	65.2 (46.7)	SSAR_Split_allgather	36.3 (19.0)	1.80x (2.46x)
	URL	LR	8	81.4 (44.7)		61.1 (24.9)	1.33x (1.80x)
Greina (GigE)	Webspam	LR	8	768.0 (759.5)	SSAR_Split_allgather	37.9 (29.5)	20.26x (25.75x)
	URL	LR	8	1045.0 (1004.6)		80.26 (42.2)	12.65x (23.81x)

SCD paragraph:

Run MPI-OPT's SCD implementation following the distributed random block CD algorithm of [53].
8 nodes of Piz Daint, logistic regression on URL, 100 coordinates per iteration. Each iteration requires sparse allgather of those coords (compared to dense allgather baseline).
MPI-OPT with dense allgather: 49 s/epoch, 24 s in communication.
MPI-OPT with sparse allgather: 26 s/epoch, 4.5 s in communication.
Overall speedup factor 1.8x, due to 5.3x speedup in communication time.

Apache Spark comparison paragraph:

Spark v1.6 is officially supported by CSCS [8]. Spark uses its own communication layer that does not exploit sparsity.
8 nodes Piz Daint, URL dataset: MPI-OPT vs. Spark — 63x speedup to convergence, due to 185x communication-time speedup: average epoch reduced from 378 s (319 s in comm) to 6 s (1.7 s in comm).
MPI-OPT with standard Cray-optimized dense allreduce already achieves 31x speedup to convergence vs. Spark, due to 43x speedup in communication time. So Spark itself is the bigger handicap; SparCML doubles the win on top.
Per-epoch: 13 s on average, 8.6 s in communication.
8-node cluster with GigE: Spark = 1,274 s/epoch (1042 s comm). MPI-OPT = 14 s/epoch (12 s comm). 86x epoch / 12x comm vs. Spark dense.

Discussion paragraph:

Spark comparison should be taken with a grain of salt because Spark implements non-trivial features such as fault tolerance.
Conclusion: sparse communication delivers significant savings whenever sparsity is naturally present in the data.

8.3 Training Deep Neural Networks

Setup paragraph:

Distributed CNTK on academic datasets.
Implement Top-k SGD [2,18,46] with low-precision quantization (Algorithm 1).
Three task types: image classification on CIFAR-10, NLU on ATIS, machine translation on Hansards.
For CIFAR-10: ResNet-110 [23]. For NLU/MT: encoder-decoder with two LSTM cells [25].
Default single-GPU 32-bit hyperparameters from open-source CNTK 2.0 repo [39]. Sparsification + quantization via optimized GPU kernels.
Communication is layer-wise (non-blocking calls) — overhead on overall computation is < 1%.

Bandwidth reduction example paragraph:

LSTM on ATIS: ~20 M parameters = ~80 MB per minibatch in full precision.
Compressed gradient received at every node in SparCML totals less than 0.5 MB.

Accuracy & Speedup paragraph (CIFAR-10):

For CIFAR-10, end accuracy matches the full-precision baseline when selecting k = 16 of every 512 elements (= ~3% density), and for k = 8/512 the accuracy is 1% above the 32-bit variant (Figure 4a).
For ATIS / Hansards: training and test losses + BLEU within 1% of full-precision baselines; ATIS shown in Figure 4b. Quantization parameters for ATIS: k = 2, k = 4 entries per bucket of 512 (~0.4% / ~0.8% density).

End-to-end training time paragraph:

CIFAR-10: speedup 1.12x to full convergence vs. full-precision baseline, on 8 nodes.
ATIS (20 epochs): 5.99x speedup.
Hansards (5,200 iterations): 1.5x speedup.
Variance explained by communication-vs-compute ratio: CIFAR-10 is compute-dominated; ATIS is communication-dominated (so sparsity has much larger impact).

8.4 Large Workload Experiments

ImageNet ResNet-50 paragraph:

64 P100 GPU nodes on CSCS Piz Daint; ResNet-50 on ImageNet-1K (~25 M parameters across 50 layers).
Negative result: Cray MPI baseline = ~2,071 s/epoch; SparCML at ~99% sparsity is only ~6% faster (~1,950 s/epoch).
Three reasons: (1) gradients become dense during aggregation at this parameter setting (limiting speedup); higher sparsity hurts model convergence even with momentum correction / warmup [37]; (2) overhead of sparsification + densification is non-negligible relative to the small per-layer transmission cost in ResNet-50; (3) Cray's proprietary MPI implementation gets no extra parameter tuning here.
Implication: sparsification is not a universal solution at very large node counts when the per-node compressed gradient becomes dense post-sum.

Wide ResNet (4xResNet18, 4xResNet34) on ImageNet-1K paragraph:

Wide residual nets [58] have wider channels per block; achieve similar accuracy with shallower depth, less sensitive to hyperparameters.
Train with TopK SGD k = 1/512, ~0.2% nnz per node; global batch size 512.
Convergence: final accuracy of sparse variants <0.9% lower in Top-1 and <0.5% lower in Top-5 vs. dense baseline.
Speedup: ~2x for 4xResNet18, ~1.85x for 4xResNet34 vs. Cray MPI baseline. <0.4% Top-5 accuracy loss vs. fully-dense baseline.
Most of the speedup comes from the reduced aggregation time of the last fully-connected layer (>2M parameters).

Automated Speech Recognition (ASR) paragraph:

Production LSTM model with attention powering a popular Microsoft personal digital assistant. ~60M params, 2.4M in attention layer.
Dataset: ~30,000 hours (3.5 years) of annotated speech.
Cluster: 32 servers, each with 4 NVIDIA V100 GPUs = 128 V100 GPUs; InfiniBand interconnect; intra-node aggregation via NCCL on NVLink.
Baseline: 4-node, 16-GPU block-momentum SGD (BMUF) [11] without sparsity or quantization. Higher node counts for the full-precision variant led to negative scalability and divergence.
BMUF baseline already does non-trivial communication reduction since it syncs less frequently than minibatch SGD (which is infeasible at this size).
16-GPU BMUF baseline: ~14 days to train.
SparCML setup: Top-k SGD with groups of 512 consecutive coordinates, 4 largest sent per group; fixed global batch size 512 (strong scaling).
Six full passes over the dataset; record time-to-final-accuracy.
Figure 6a: error-vs-time. Sparse implementation reaches similar accuracy to full-precision baseline in a fraction of the time. At 32 nodes (128 GPUs), training reduces to <1.78 days.
Figure 6b: scalability — sub-linear but strong from 16 -> 128 GPUs.
Word-error-rate (WER) on validation: trained models <1% higher WER than full-precision (and can be up to 1% lower in some splits — sometimes better due to noise regularization).

Hyperparameter Tuning paragraph:

Authors enforce no hyperparameter retuning under sparsity; in most cases recovered accuracy at standard hyperparameters.
Two notable exceptions: (i) ResNet50 — high sparsity + large batch induced significant accuracy loss; (ii) ASR — required maintaining a small global batch size to preserve convergence.
Suggests non-trivial interaction between sparse gradients, batch size, and convergence — left to future work.

Reduced communication techniques paragraph:

Seide et al. [44]: 1-bit SGD — first to propose quantization for reducing bandwidth/latency cost of training deep networks.
Alistarh et al. [4]: QSGD — theoretically justified quantization with a controllable compression-vs-convergence trade-off. SparCML implements QSGD as default quantization.
Dryden et al. [18], Aji & Heafield [2]: alternative communication-reduction by sparsifying the gradient updates with top-k. Subsequent work [37,46] shows extreme sparsity (<0.1%) can be supported by CNNs and RNNs while preserving accuracy with careful tuning.
SparCML's contribution complements this: combines stochastic quantization and sparsification with a convergence proof; provides efficient sparsity- and-quantization-aware system support.

Lossless methods paragraph:

Factorization is a lossless compression effective for fully-connected layers but less so for conv layers [14,54].
Executing extremely large batches [3,20,55,56] is another lossless method; SparCML's compression methods are orthogonal — they aim to reduce bandwidth at fixed batch size; sparsification can be applied with additional tuning at fixed batch size.
However, with very large node count, batches grow large and aggregated gradients become dense — sparsity cannot scale linearly with node count. This regime is left for future work; SparCML already implements optimizations such as merging adjacent gradients (tensor fusion) and non-blocking ops [55].

Communication frameworks paragraph:

NVIDIA's NCCL [40] reduces communication cost when nodes are NVIDIA GPUs with proprietary NVLINK — not the case in supercomputing settings. NCCL also implements only a very restricted set of reduction operations.
Custom frameworks for specific workloads exist — Livermore Big Artificial Neural Network Toolkit (LBANN [50]), S-Caffe [6] — efficient in specific instances but do not leverage reduced-communication techniques or sparsity.

Sparse reduction paragraph:

Hofmann & Rünger [30] propose a runlength encoding approach for sparse reductions — SparCML extends this with the observation that data may become dense during reduction, and that an efficient adaptive representation is needed.
Träff [49] proposes a general approach for sparsity in MPI by ignoring neutral elements in MPI reductions; SparCML's sparse allreduce can be seen as a special case, but it specifies the reduction algorithms and provides performance bounds for small and large message scenarios.
Kylix [60] considers sparse many-to-many reductions for graph data on community clusters — but assumes knowledge of the data distribution and performs multiple passes; not applicable to SparCML's setting.
Dryden et al. [18]: a sparse allreduce variant via pairwise reduce-scatter
- ring allgather. Keeps amount of data constant by re-merging top-k after receiving values; the ability to preserve a local residual is specific to Top-k SGD. SparCML's framework is more general; performance is similar to SSAR_Split_allgather.

10. Conclusions and Further Work

Described and analyzed SparCML: a high-performance communication framework for sparse and low-precision communication for ML.
Integrates with existing computational frameworks; provides order-of- magnitude speedups on real applications.
Future work: (1) other distributed ML applications that benefit from sparsity, (2) interaction between sparsity and other parallelization strategies (e.g., large-batch training).
Belief: simple but effective sparsity schemes can play a significant role in reducing comm cost in future ML systems.

Acknowledgments / Funding

ERC Horizon 2020 grant 805223; Swiss National Supercomputing Centre Small Development Project (code d94); ERC Horizon 2020 DAPP grant 678880. CSCS support team (Tal Ben-Nun, Salvatore Di Girolamo) thanked.

Cross-Cutting Empirical Take-Aways

Take-away	Derived from
Sparsification + sum across many nodes naturally densifies the result	Figure 1 (Sec. 3)
Recursive doubling wins for small data; split-allgather wins for large data	Sec. 5.3, Figure 3
Pure sparsity can yield at most (2/kappa)x = 4x speedup vs. dense allreduce	Lemma 5.2
Combined sparsification + stochastic quantization preserves convergence	Theorem 4.1
Sparse comm benefit is bigger on slower networks	Greina GigE vs. Piz Daint (Fig. 3, Table 2)
ASR LSTM scaled to 128 GPUs in 1.78 days vs. 14 days BMUF baseline	Sec. 8.4
ResNet-50 doesn't benefit from extreme sparsity at scale (densifies quickly)	Sec. 8.4
Wide ResNets (4xResNet18/34) yield ~2x speedup at 64 GPUs with <0.5% Top-5 loss	Sec. 8.4
Linear classifier on URL: 31x to 63x speedup vs. Spark, depending on stack	Sec. 8.2
GigE small-cluster speedups exceed IB at the same node count (slower link, more comm bound)	Table 2

Limitations of the Paper

Theory only proves ergodic convergence to a stationary point; not to a global minimum. Aligned with state-of-the-art for non-convex training but weaker than per-iteration guarantees.
Hyperparameter retuning is required for some workloads (ResNet-50 + large batch; ASR with global batch constraint).
Maximum speedup of pure sparse allreduce vs. dense is bounded by 4x (Lemma 5.2); larger gains require quantization on top.
Sparsity does not scale linearly with node count: at very high P, summed gradients become dense, eroding savings (ResNet-50 on 64 GPUs). Limits applicability to extreme-scale CNN training.
Architecture / network specific: gains amplified on slower networks (GigE) and reduced on highly optimized supercomputing fabrics (Cray Aries).
NCCL-based intra-node speedups not separately measured; SparCML targets inter-node MPI-style fabric; relationship to NCCL collectives only briefly discussed (NCCL has restricted reduction op set; SparCML is more general).

Notable Relationship to NCCL

The paper explicitly notes NCCL [40] reduces communication cost on NVIDIA GPUs over NVLink, but is restricted to a small set of reduction operations and to NVLink-equipped clusters — not the supercomputing scenario.
SparCML aims for the orthogonal axis: a general MPI-style sparse collective layer that runs over arbitrary HPC fabrics (Cray Aries, InfiniBand, GigE).
The ASR experiment uses NCCL for intra-node (NVLink) aggregation and SparCML for inter-node sparse aggregation — i.e. the two layers compose hierarchically, with NCCL handling fast-path within a server and SparCML handling the cross-server sparse allreduce.

SparCML: High-Performance Sparse Communication for Machine Learning — Detailed Summary

Abstract

1. Introduction and Motivation

2. Preliminaries

2.1 Data Parallelism and Communication Costs

2.2 Communication-Reduction Techniques

3. Communication-Reduction: A Critical View

4. Communication Reduction in SparCML

5. Supporting Sparsity in SparCML

5.1 Data Representation: Sparse Streams

5.2 Efficient Collectives on Sparse Streams

5.3 Communication Algorithms

5.3.1 The Small Data Case — SSAR_Recursive_double

5.3.2 The Large Data Case — SSAR_Split_allgather

5.3.3 The Dynamic Case — Switching to Dense

6. Supporting Low-Precision Communication

7. Artifact and Additional Features

8. Experiments

8.1 Micro-Benchmarks

8.2 Large-Scale Classification

8.3 Training Deep Neural Networks

8.4 Large Workload Experiments

9. Related Work

10. Conclusions and Further Work

Acknowledgments / Funding

Cross-Cutting Empirical Take-Aways

Limitations of the Paper

Notable Relationship to NCCL