SparCML: High-Performance Sparse Communication for Machine Learning

Cedric Renggli (ETH Zurich), Saleh Ashkboos (IST Austria), Mehdi Aghagolzadeh (Microsoft), Dan Alistarh (IST Austria), Torsten Hoefler (ETH Zurich) | SC '19, Denver, CO, November 17-22, 2019 | DOI: 10.1145/3295500.3356222 | Code: gitlab.com/renggli/sparcml


Problem

Modern data-parallel ML training synchronizes gradients via a global allreduce once per minibatch step. The size of that reduction equals the model size and, crucially, does not shrink as more workers are added — so allreduce becomes the dominant scaling bottleneck for large models (e.g., 340M-parameter BERT-Large = 11 GB at 32-bit precision). Hundreds of megabytes must be summed globally every few microseconds at high node counts. Standard MPI [19] and ML-specific communication libraries (NCCL [40]) provide only dense collective primitives, even though gradient values in many ML workloads are naturally or artificially sparse — implementing efficient sparse-input collectives is non-trivial because the reduced-result size is unknown a priori and the sparsity pattern of summed vectors changes during the reduction itself.


Core Insight

Sparse and low-precision communication for ML can be made first-class via two adaptive sparse-allreduce algorithms (SSAR_Recursive_double for small data, SSAR/DSAR_Split_allgather for large data) plus a sparse stream data type that switches between sparse and dense representations on the fly when the intermediate reduced vector exceeds a density threshold delta. Combining this with stochastic quantization (QSGD) gives a method that provably preserves convergence and achieves order-of-magnitude speedups vs. dense allreduce on real ML workloads.


Method

SparCML extends MPI with sparse-input allgather and allreduce on arbitrary sparsity patterns. It plugs into CNTK and into the authors' MPI-OPT optimization framework as a drop-in communication layer.

+----------------------------------------------------------+
|  SparCML (~2,000 lines C++11; ~100-line CNTK patch)       |
+----------------------------------------------------------+
|  Sparse Stream Data Type                                  |
|    (index,value) pairs; delta = N*isize/(c+isize) switch  |
|    flag byte marks dense/sparse; pre-alloc N*isize bytes  |
+----------------------------------------------------------+
|  Sparse Collective Algorithms                             |
|    SSAR_Recursive_double  - small data, latency-optimal   |
|    SSAR_Split_allgather   - large data, sparse 2-phase    |
|    DSAR_Split_allgather   - large data, sparse->dense     |
+----------------------------------------------------------+
|  Low-Precision Layer                                      |
|    QSGD stochastic quantization, 2/4/8 bits per entry,    |
|    bucket size B ~= 1024 with full-precision scaling      |
|    factor; applied on the dense second phase              |
+----------------------------------------------------------+
|  Non-Blocking (MPI-3) interface for comp/comm overlap     |
+----------------------------------------------------------+

Algorithm 1 (paper) defines SparCML Quantized TopK SGD with error feedback: each node accumulates the residual of components that did not get sent, picks the top-k entries of eps + alpha*grad, optionally quantizes them via Q, and then calls allreduce(SUM) on the sparse result. Theorem 4.1 proves ergodic convergence to a stationary point of the smooth non-convex objective under standard assumptions — the first such proof for the combined sparsification+stochastic-quantization regime. Lemma 5.2 establishes a theoretical ceiling: pure sparsity exploitation alone cannot beat dense allreduce by more than (2/kappa)x, where kappa = delta/N; with kappa=0.5 the maximum speedup is 4x, motivating the addition of low-precision support.


Experimental Setup

Component Value
Supercomputing testbed CSCS Piz Daint (Cray XC50, Aries Dragonfly fabric)
Per-node hardware 12-core HT Intel Xeon E5-2690 v3, 4 GB RAM, NVIDIA Tesla P100 16 GB
Cloud testbed NVIDIA K80 nodes over Gigabit Ethernet (no background traffic)
Additional cluster Greina (CX50, InfiniBand FDR or Gigabit Ethernet)
Production cluster (ASR) 32 servers x 4 NVIDIA V100 = 128 GPUs, InfiniBand inter-node, NCCL/NVLink intra-node
Index storage 32-bit unsigned int (N > 65 K)
Frameworks CNTK 2.0 + SparCML; MPI-OPT (authors' C++11)
Baselines MPI allreduce on dense vectors (Open MPI / Cray-MPICH); Apache Spark v1.6; BMUF
Workloads URL, Webspam (linear LR/SVM); CIFAR-10 (ResNet-110); ATIS / Hansards (LSTM); ImageNet-1K (ResNet-50, 4xResNet18/34); production ASR LSTM (60M params, 30,000 hours speech)

Microbenchmarks vary N (data dimension) and d (per-node density) on synthetic sparse vectors; 5 fresh datasets x 10 runs each yield 25/75-percentile bands.


Headline Quantitative Results

Network microbenchmarks (Figure 3): the algorithmic ordering predicted by the alpha-beta cost model holds — SSAR_Recursive_double dominates at small data; SSAR_Split_allgather dominates as nnz grows; DSAR_Split_allgather dominates once the result densifies; ring-based dense MPI wins only at low P on a fast network. The slower the link, the larger SparCML's relative win.

Linear classifiers via MPI-OPT (Table 2 — full-pass time, comm time in brackets):

System Dataset Model P Baseline SparCML Speedup
Piz Daint Webspam LR 32 24.0 (21.6) s 6.8 (2.6) s 3.53x (6.17x)
Piz Daint Webspam SVM 32 16.2 (14.2) s 6.5 (4.4) s 2.49x (3.23x)
Piz Daint URL LR 32 26.4 (25.8) s 7.5 (7.0) s 3.52x (3.69x)
Piz Daint URL SVM 32 19.8 (19.3) s 5.6 (5.3) s 3.54x (3.64x)
Greina (IB) Webspam LR 8 65.2 (46.7) s 36.3 (19.0) s 1.80x (2.46x)
Greina (GigE) Webspam LR 8 768.0 (759.5) s 37.9 (29.5) s 20.26x (25.75x)
Greina (GigE) URL LR 8 1045.0 (1004.6) s 80.26 (42.2) s 12.65x (23.81x)

SCD on URL (8 nodes Piz Daint): dense allgather = 49 s/epoch (24 s comm); sparse allgather = 26 s/epoch (4.5 s comm). 1.8x end-to-end / 5.3x in communication.

Apache Spark comparison on URL:

CNTK + SparCML deep nets (academic):

Large workloads:


Limitations


Open Problems Called Out

  1. Interaction between sparse gradients, batch size, and convergence — especially in regimes (ResNet-50 large batch; ASR global batch) where high sparsity degrades convergence.
  2. How sparsity composes with other parallelization strategies — large-batch training, model parallelism — at extreme node counts where summed gradients densify regardless of per-node sparsity.
  3. Application of sparse collectives beyond data-parallel SGD (e.g., pipelined or graph-data workloads).
  4. Whether sparsity can be made adaptive at run time per layer / per iteration, instead of using a single fixed k for the whole network.

Note on NCCL Tuning

The paper notes that NCCL implements only a very restricted set of reduction operations and is most useful inside an NVLink-connected box, while SparCML covers the inter-node MPI fabric where reductions span heterogeneous links; in the production ASR experiment the two layers compose hierarchically (NCCL intra-node, SparCML inter-node). Lemma 5.2's (2/kappa)x ceiling is a useful prior for any tuner working on collectives whose payload size depends on content (e.g., post-sparsification gradients): once kappa exceeds ~0.5 the optimal choice is the dense path, so a configuration tuner should switch algorithm/protocol to the dense-optimized regime rather than continue exploring sparse representations. Empirically the paper also shows that the optimal sparse algorithm flips between recursive-doubling (small data, latency-bound) and split-allgather (large data, bandwidth-bound) — the same small-vs-large message switchover that drives Tree/Ring and LL/Simple selection in NCCL.