SparCML: High-Performance Sparse Communication for Machine Learning
Cedric Renggli (ETH Zurich), Saleh Ashkboos (IST Austria), Mehdi Aghagolzadeh (Microsoft), Dan Alistarh (IST Austria), Torsten Hoefler (ETH Zurich) | SC '19, Denver, CO, November 17-22, 2019 | DOI: 10.1145/3295500.3356222 | Code: gitlab.com/renggli/sparcml
Problem
Modern data-parallel ML training synchronizes gradients via a global allreduce once per minibatch step. The size of that reduction equals the model size and, crucially, does not shrink as more workers are added — so allreduce becomes the dominant scaling bottleneck for large models (e.g., 340M-parameter BERT-Large = 11 GB at 32-bit precision). Hundreds of megabytes must be summed globally every few microseconds at high node counts. Standard MPI [19] and ML-specific communication libraries (NCCL [40]) provide only dense collective primitives, even though gradient values in many ML workloads are naturally or artificially sparse — implementing efficient sparse-input collectives is non-trivial because the reduced-result size is unknown a priori and the sparsity pattern of summed vectors changes during the reduction itself.
Core Insight
Sparse and low-precision communication for ML can be made first-class via two adaptive sparse-allreduce algorithms (SSAR_Recursive_double for small data, SSAR/DSAR_Split_allgather for large data) plus a sparse stream data type that switches between sparse and dense representations on the fly when the intermediate reduced vector exceeds a density threshold delta. Combining this with stochastic quantization (QSGD) gives a method that provably preserves convergence and achieves order-of-magnitude speedups vs. dense allreduce on real ML workloads.
Method
SparCML extends MPI with sparse-input allgather and allreduce on arbitrary sparsity patterns. It plugs into CNTK and into the authors' MPI-OPT optimization framework as a drop-in communication layer.
+----------------------------------------------------------+
| SparCML (~2,000 lines C++11; ~100-line CNTK patch) |
+----------------------------------------------------------+
| Sparse Stream Data Type |
| (index,value) pairs; delta = N*isize/(c+isize) switch |
| flag byte marks dense/sparse; pre-alloc N*isize bytes |
+----------------------------------------------------------+
| Sparse Collective Algorithms |
| SSAR_Recursive_double - small data, latency-optimal |
| SSAR_Split_allgather - large data, sparse 2-phase |
| DSAR_Split_allgather - large data, sparse->dense |
+----------------------------------------------------------+
| Low-Precision Layer |
| QSGD stochastic quantization, 2/4/8 bits per entry, |
| bucket size B ~= 1024 with full-precision scaling |
| factor; applied on the dense second phase |
+----------------------------------------------------------+
| Non-Blocking (MPI-3) interface for comp/comm overlap |
+----------------------------------------------------------+
Algorithm 1 (paper) defines SparCML Quantized TopK SGD with error
feedback: each node accumulates the residual of components that did not
get sent, picks the top-k entries of eps + alpha*grad,
optionally quantizes them via Q, and then calls
allreduce(SUM) on the sparse result. Theorem 4.1 proves
ergodic convergence to a stationary point of the smooth non-convex
objective under standard assumptions — the first such proof for the
combined sparsification+stochastic-quantization regime. Lemma 5.2
establishes a theoretical ceiling: pure sparsity exploitation alone
cannot beat dense allreduce by more than (2/kappa)x, where kappa =
delta/N; with kappa=0.5 the maximum speedup is 4x, motivating the
addition of low-precision support.
Experimental Setup
| Component | Value |
|---|---|
| Supercomputing testbed | CSCS Piz Daint (Cray XC50, Aries Dragonfly fabric) |
| Per-node hardware | 12-core HT Intel Xeon E5-2690 v3, 4 GB RAM, NVIDIA Tesla P100 16 GB |
| Cloud testbed | NVIDIA K80 nodes over Gigabit Ethernet (no background traffic) |
| Additional cluster | Greina (CX50, InfiniBand FDR or Gigabit Ethernet) |
| Production cluster (ASR) | 32 servers x 4 NVIDIA V100 = 128 GPUs, InfiniBand inter-node, NCCL/NVLink intra-node |
| Index storage | 32-bit unsigned int (N > 65 K) |
| Frameworks | CNTK 2.0 + SparCML; MPI-OPT (authors' C++11) |
| Baselines | MPI allreduce on dense vectors (Open MPI / Cray-MPICH); Apache Spark v1.6; BMUF |
| Workloads | URL, Webspam (linear LR/SVM); CIFAR-10 (ResNet-110); ATIS / Hansards (LSTM); ImageNet-1K (ResNet-50, 4xResNet18/34); production ASR LSTM (60M params, 30,000 hours speech) |
Microbenchmarks vary N (data dimension) and d (per-node density) on synthetic sparse vectors; 5 fresh datasets x 10 runs each yield 25/75-percentile bands.
Headline Quantitative Results
Network microbenchmarks (Figure 3): the algorithmic ordering predicted by the alpha-beta cost model holds — SSAR_Recursive_double dominates at small data; SSAR_Split_allgather dominates as nnz grows; DSAR_Split_allgather dominates once the result densifies; ring-based dense MPI wins only at low P on a fast network. The slower the link, the larger SparCML's relative win.
Linear classifiers via MPI-OPT (Table 2 — full-pass time, comm time in brackets):
| System | Dataset | Model | P | Baseline | SparCML | Speedup |
|---|---|---|---|---|---|---|
| Piz Daint | Webspam | LR | 32 | 24.0 (21.6) s | 6.8 (2.6) s | 3.53x (6.17x) |
| Piz Daint | Webspam | SVM | 32 | 16.2 (14.2) s | 6.5 (4.4) s | 2.49x (3.23x) |
| Piz Daint | URL | LR | 32 | 26.4 (25.8) s | 7.5 (7.0) s | 3.52x (3.69x) |
| Piz Daint | URL | SVM | 32 | 19.8 (19.3) s | 5.6 (5.3) s | 3.54x (3.64x) |
| Greina (IB) | Webspam | LR | 8 | 65.2 (46.7) s | 36.3 (19.0) s | 1.80x (2.46x) |
| Greina (GigE) | Webspam | LR | 8 | 768.0 (759.5) s | 37.9 (29.5) s | 20.26x (25.75x) |
| Greina (GigE) | URL | LR | 8 | 1045.0 (1004.6) s | 80.26 (42.2) s | 12.65x (23.81x) |
SCD on URL (8 nodes Piz Daint): dense allgather = 49 s/epoch (24 s comm); sparse allgather = 26 s/epoch (4.5 s comm). 1.8x end-to-end / 5.3x in communication.
Apache Spark comparison on URL:
- Piz Daint, 8 nodes: MPI-OPT + SparCML vs. Spark = 63x to convergence (185x in communication time): epoch 378 s -> 6 s.
- 8-node GigE: 86x epoch / 12x comm vs. Spark dense.
- Note: MPI-OPT with Cray-optimized dense allreduce is already 31x faster than Spark (43x in comm) — so SparCML doubles a baseline that itself trounces Spark.
CNTK + SparCML deep nets (academic):
- ATIS NLU LSTM (20 epochs): 5.99x end-to-end speedup. ~80 MB minibatch gradient compresses to <0.5 MB on the wire.
- Hansards translation (5,200 iter): 1.5x.
- CIFAR-10 ResNet-110: 1.12x (compute-dominated workload).
- CIFAR-10 accuracy at k=16/512 matches full precision; at k=8/512 accuracy is 1% above the 32-bit baseline.
Large workloads:
- ImageNet wide ResNets, 64 P100 GPUs: ~2x for 4xResNet18, ~1.85x for 4xResNet34 vs. Cray MPI baseline; <0.4% Top-5 accuracy loss; k=1/512 (~0.2% nnz/node); most savings from the >2M-parameter final FC layer.
- ImageNet ResNet-50, 64 P100 GPUs: negative result — only ~6% faster (1,950 s/epoch vs. 2,071 s baseline) because gradients densify post-sum and per-layer ResNet-50 transmissions are too small to amortize sparsify /densify overhead.
- Production ASR LSTM, 128 V100 GPUs (32 servers, IB + NVLink/NCCL intra- node): training reduced from 14 days (16-GPU BMUF baseline) to <1.78 days at 128 GPUs, ~10x end-to-end. WER stays within 1% of full precision (sometimes slightly better via implicit regularization).
Limitations
- Theorem 4.1 proves only ergodic convergence to a stationary point — not per-iteration convergence and not to a global minimum.
- Lemma 5.2 caps pure sparse-allreduce speedup at (2/kappa)x = 4x for kappa = 0.5 vs. dense; further wins require stacking quantization.
- Sparsity does not scale linearly with node count: at high P, summed gradients densify, eroding savings — demonstrated by the negative ResNet-50 result at 64 GPUs.
- Hyperparameter retuning is required for two of the largest workloads: ResNet-50 + large batch (high sparsity hurts convergence) and ASR (must hold global batch size at 512 to preserve convergence).
- Wins are network-dependent — large on slow GigE, modest on highly optimized supercomputing fabrics like Cray Aries.
- Relationship to NCCL is only briefly discussed; NCCL is used only intra- node in the ASR experiment, and the paper does not measure SparCML speedups against NCCL allreduce on a pure NVLink configuration.
Open Problems Called Out
- Interaction between sparse gradients, batch size, and convergence — especially in regimes (ResNet-50 large batch; ASR global batch) where high sparsity degrades convergence.
- How sparsity composes with other parallelization strategies — large-batch training, model parallelism — at extreme node counts where summed gradients densify regardless of per-node sparsity.
- Application of sparse collectives beyond data-parallel SGD (e.g., pipelined or graph-data workloads).
- Whether sparsity can be made adaptive at run time per layer / per iteration, instead of using a single fixed k for the whole network.
Note on NCCL Tuning
The paper notes that NCCL implements only a very restricted set of reduction operations and is most useful inside an NVLink-connected box, while SparCML covers the inter-node MPI fabric where reductions span heterogeneous links; in the production ASR experiment the two layers compose hierarchically (NCCL intra-node, SparCML inter-node). Lemma 5.2's (2/kappa)x ceiling is a useful prior for any tuner working on collectives whose payload size depends on content (e.g., post-sparsification gradients): once kappa exceeds ~0.5 the optimal choice is the dense path, so a configuration tuner should switch algorithm/protocol to the dense-optimized regime rather than continue exploring sparse representations. Empirically the paper also shows that the optimal sparse algorithm flips between recursive-doubling (small data, latency-bound) and split-allgather (large data, bandwidth-bound) — the same small-vs-large message switchover that drives Tree/Ring and LL/Simple selection in NCCL.