SparCML: High-Performance Sparse Communication for Machine Learning

Cedric Renggli (ETH Zurich), Saleh Ashkboos (IST Austria), Mehdi Aghagolzadeh (Microsoft), Dan Alistarh (IST Austria), Torsten Hoefler (ETH Zurich) | SC '19, Denver, CO, November 17-22, 2019 | DOI: 10.1145/3295500.3356222 | Code: gitlab.com/renggli/sparcml

Problem

Modern data-parallel ML training synchronizes gradients via a global allreduce once per minibatch step. The size of that reduction equals the model size and, crucially, does not shrink as more workers are added — so allreduce becomes the dominant scaling bottleneck for large models (e.g., 340M-parameter BERT-Large = 11 GB at 32-bit precision). Hundreds of megabytes must be summed globally every few microseconds at high node counts. Standard MPI [19] and ML-specific communication libraries (NCCL [40]) provide only dense collective primitives, even though gradient values in many ML workloads are naturally or artificially sparse — implementing efficient sparse-input collectives is non-trivial because the reduced-result size is unknown a priori and the sparsity pattern of summed vectors changes during the reduction itself.

Core Insight

Sparse and low-precision communication for ML can be made first-class via two adaptive sparse-allreduce algorithms (SSAR_Recursive_double for small data, SSAR/DSAR_Split_allgather for large data) plus a sparse stream data type that switches between sparse and dense representations on the fly when the intermediate reduced vector exceeds a density threshold delta. Combining this with stochastic quantization (QSGD) gives a method that provably preserves convergence and achieves order-of-magnitude speedups vs. dense allreduce on real ML workloads.

Method

SparCML extends MPI with sparse-input allgather and allreduce on arbitrary sparsity patterns. It plugs into CNTK and into the authors' MPI-OPT optimization framework as a drop-in communication layer.

+----------------------------------------------------------+
|  SparCML (~2,000 lines C++11; ~100-line CNTK patch)       |
+----------------------------------------------------------+
|  Sparse Stream Data Type                                  |
|    (index,value) pairs; delta = N*isize/(c+isize) switch  |
|    flag byte marks dense/sparse; pre-alloc N*isize bytes  |
+----------------------------------------------------------+
|  Sparse Collective Algorithms                             |
|    SSAR_Recursive_double  - small data, latency-optimal   |
|    SSAR_Split_allgather   - large data, sparse 2-phase    |
|    DSAR_Split_allgather   - large data, sparse->dense     |
+----------------------------------------------------------+
|  Low-Precision Layer                                      |
|    QSGD stochastic quantization, 2/4/8 bits per entry,    |
|    bucket size B ~= 1024 with full-precision scaling      |
|    factor; applied on the dense second phase              |
+----------------------------------------------------------+
|  Non-Blocking (MPI-3) interface for comp/comm overlap     |
+----------------------------------------------------------+

Algorithm 1 (paper) defines SparCML Quantized TopK SGD with error feedback: each node accumulates the residual of components that did not get sent, picks the top-k entries of eps + alpha*grad, optionally quantizes them via Q, and then calls allreduce(SUM) on the sparse result. Theorem 4.1 proves ergodic convergence to a stationary point of the smooth non-convex objective under standard assumptions — the first such proof for the combined sparsification+stochastic-quantization regime. Lemma 5.2 establishes a theoretical ceiling: pure sparsity exploitation alone cannot beat dense allreduce by more than (2/kappa)x, where kappa = delta/N; with kappa=0.5 the maximum speedup is 4x, motivating the addition of low-precision support.

Experimental Setup

Component	Value
Supercomputing testbed	CSCS Piz Daint (Cray XC50, Aries Dragonfly fabric)
Per-node hardware	12-core HT Intel Xeon E5-2690 v3, 4 GB RAM, NVIDIA Tesla P100 16 GB
Cloud testbed	NVIDIA K80 nodes over Gigabit Ethernet (no background traffic)
Additional cluster	Greina (CX50, InfiniBand FDR or Gigabit Ethernet)
Production cluster (ASR)	32 servers x 4 NVIDIA V100 = 128 GPUs, InfiniBand inter-node, NCCL/NVLink intra-node
Index storage	32-bit unsigned int (N > 65 K)
Frameworks	CNTK 2.0 + SparCML; MPI-OPT (authors' C++11)
Baselines	MPI allreduce on dense vectors (Open MPI / Cray-MPICH); Apache Spark v1.6; BMUF
Workloads	URL, Webspam (linear LR/SVM); CIFAR-10 (ResNet-110); ATIS / Hansards (LSTM); ImageNet-1K (ResNet-50, 4xResNet18/34); production ASR LSTM (60M params, 30,000 hours speech)

Microbenchmarks vary N (data dimension) and d (per-node density) on synthetic sparse vectors; 5 fresh datasets x 10 runs each yield 25/75-percentile bands.

Headline Quantitative Results

Network microbenchmarks (Figure 3): the algorithmic ordering predicted by the alpha-beta cost model holds — SSAR_Recursive_double dominates at small data; SSAR_Split_allgather dominates as nnz grows; DSAR_Split_allgather dominates once the result densifies; ring-based dense MPI wins only at low P on a fast network. The slower the link, the larger SparCML's relative win.

Linear classifiers via MPI-OPT (Table 2 — full-pass time, comm time in brackets):

System	Dataset	Model	P	Baseline	SparCML	Speedup
Piz Daint	Webspam	LR	32	24.0 (21.6) s	6.8 (2.6) s	3.53x (6.17x)
Piz Daint	Webspam	SVM	32	16.2 (14.2) s	6.5 (4.4) s	2.49x (3.23x)
Piz Daint	URL	LR	32	26.4 (25.8) s	7.5 (7.0) s	3.52x (3.69x)
Piz Daint	URL	SVM	32	19.8 (19.3) s	5.6 (5.3) s	3.54x (3.64x)
Greina (IB)	Webspam	LR	8	65.2 (46.7) s	36.3 (19.0) s	1.80x (2.46x)
Greina (GigE)	Webspam	LR	8	768.0 (759.5) s	37.9 (29.5) s	20.26x (25.75x)
Greina (GigE)	URL	LR	8	1045.0 (1004.6) s	80.26 (42.2) s	12.65x (23.81x)

SCD on URL (8 nodes Piz Daint): dense allgather = 49 s/epoch (24 s comm); sparse allgather = 26 s/epoch (4.5 s comm). 1.8x end-to-end / 5.3x in communication.

Apache Spark comparison on URL:

Piz Daint, 8 nodes: MPI-OPT + SparCML vs. Spark = 63x to convergence (185x in communication time): epoch 378 s -> 6 s.
8-node GigE: 86x epoch / 12x comm vs. Spark dense.
Note: MPI-OPT with Cray-optimized dense allreduce is already 31x faster than Spark (43x in comm) — so SparCML doubles a baseline that itself trounces Spark.

CNTK + SparCML deep nets (academic):

ATIS NLU LSTM (20 epochs): 5.99x end-to-end speedup. ~80 MB minibatch gradient compresses to <0.5 MB on the wire.
Hansards translation (5,200 iter): 1.5x.
CIFAR-10 ResNet-110: 1.12x (compute-dominated workload).
CIFAR-10 accuracy at k=16/512 matches full precision; at k=8/512 accuracy is 1% above the 32-bit baseline.

Large workloads:

ImageNet wide ResNets, 64 P100 GPUs: ~2x for 4xResNet18, ~1.85x for 4xResNet34 vs. Cray MPI baseline; <0.4% Top-5 accuracy loss; k=1/512 (~0.2% nnz/node); most savings from the >2M-parameter final FC layer.
ImageNet ResNet-50, 64 P100 GPUs: negative result — only ~6% faster (1,950 s/epoch vs. 2,071 s baseline) because gradients densify post-sum and per-layer ResNet-50 transmissions are too small to amortize sparsify /densify overhead.
Production ASR LSTM, 128 V100 GPUs (32 servers, IB + NVLink/NCCL intra- node): training reduced from 14 days (16-GPU BMUF baseline) to <1.78 days at 128 GPUs, ~10x end-to-end. WER stays within 1% of full precision (sometimes slightly better via implicit regularization).

Limitations

Theorem 4.1 proves only ergodic convergence to a stationary point — not per-iteration convergence and not to a global minimum.
Lemma 5.2 caps pure sparse-allreduce speedup at (2/kappa)x = 4x for kappa = 0.5 vs. dense; further wins require stacking quantization.
Sparsity does not scale linearly with node count: at high P, summed gradients densify, eroding savings — demonstrated by the negative ResNet-50 result at 64 GPUs.
Hyperparameter retuning is required for two of the largest workloads: ResNet-50 + large batch (high sparsity hurts convergence) and ASR (must hold global batch size at 512 to preserve convergence).
Wins are network-dependent — large on slow GigE, modest on highly optimized supercomputing fabrics like Cray Aries.
Relationship to NCCL is only briefly discussed; NCCL is used only intra- node in the ASR experiment, and the paper does not measure SparCML speedups against NCCL allreduce on a pure NVLink configuration.

Open Problems Called Out

Interaction between sparse gradients, batch size, and convergence — especially in regimes (ResNet-50 large batch; ASR global batch) where high sparsity degrades convergence.
How sparsity composes with other parallelization strategies — large-batch training, model parallelism — at extreme node counts where summed gradients densify regardless of per-node sparsity.
Application of sparse collectives beyond data-parallel SGD (e.g., pipelined or graph-data workloads).
Whether sparsity can be made adaptive at run time per layer / per iteration, instead of using a single fixed k for the whole network.

Note on NCCL Tuning

The paper notes that NCCL implements only a very restricted set of reduction operations and is most useful inside an NVLink-connected box, while SparCML covers the inter-node MPI fabric where reductions span heterogeneous links; in the production ASR experiment the two layers compose hierarchically (NCCL intra-node, SparCML inter-node). Lemma 5.2's (2/kappa)x ceiling is a useful prior for any tuner working on collectives whose payload size depends on content (e.g., post-sparsification gradients): once kappa exceeds ~0.5 the optimal choice is the dense path, so a configuration tuner should switch algorithm/protocol to the dense-optimized regime rather than continue exploring sparse representations. Empirically the paper also shows that the optimal sparse algorithm flips between recursive-doubling (small data, latency-bound) and split-allgather (large data, bandwidth-bound) — the same small-vs-large message switchover that drives Tree/Ring and LL/Simple selection in NCCL.