1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed — Detailed Summary

Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He | Microsoft / University of Rochester / ETH Zurich | ICML 2021 (PMLR Vol. 139)

Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.


Abstract


1. Introduction

Why communication dominates large-scale training:

Existing compression — and why it stops at SGD:

Contributions:


Communication-efficient learning:

Error compensation:

Adam variants:


3. Motivation and Insights

3.1 Profiling: communication is the bottleneck

Network Nodes GPUs BS/GPU Total BS Grad Accum Forward (ms) Allreduce (ms) Backward Other (ms) Step (ms) Allreduce %
Ethernet 16 64 1 64 1 36.65 2205.86 33.63 74.96 94%
Ethernet 16 64 16 1024 1 35.71 2275.43 60.81 75.59 93%
Ethernet 16 64 16 4096 4 137.80 2259.36 243.72 74.92 83%
Ethernet 8 32 16 512 1 37.91 2173.35 60.71 75.63 93%
Ethernet 4 16 16 256 1 36.94 2133.24 62.82 76.85 92%
Ethernet 2 8 16 128 1 34.95 1897.21 61.23 75.26 92%
Ethernet 1 4 16 64 1 35.99 239.76 59.95 74.21 58%
InfiniBand 8 64 1 64 1 25.36 316.18 23.25 58.49 75%
InfiniBand 8 64 16 1024 1 32.81 336.40 59.99 57.79 69%
InfiniBand 8 64 16 4096 4 131.04 339.52 237.92 56.91 44%
InfiniBand 4 32 16 512 1 33.45 297.28 56.81 57.98 67%
InfiniBand 2 16 16 256 1 32.86 183.74 56.49 58.60 55%
InfiniBand 1 8 16 128 1 32.74 28.18 59.73 57.29 16%

3.2 Why naive 1-bit Adam fails

3.3 Key observation: variance becomes stable


4. The 1-bit Adam Algorithm

4.1 Why error compensation works for SGD

4.2 Why it breaks for Adam

4.3 The 1-bit Adam algorithm (Algorithm 1)

The full pseudocode as printed in the paper:

Algorithm 1: 1-bit Adam
1.  Initialize: x_0; learning rate gamma; initial error delta = 0;
    m_0 = 0; v_0 = 0; total iterations T; warm-up steps T_w;
    Adam decay factors beta_1, beta_2, eta.
2.  Run original Adam for T_w steps; store v_{T_w}.
3.  for t = T_w, ..., T do
4.    (on i-th worker)
5.    Sample zeta_t^(i); compute g_t^(i) = grad F_i(x_t^(i), zeta_t^(i)).
6.    m_t^(i) = beta_1 * m_{t-1} + (1 - beta_1) * g_t^(i).
7.    Compress: hat_m_t^(i) = C_omega[m_t^(i) + delta_{t-1}^(i)];
                 delta_t^(i)  = m_t^(i) + delta_{t-1}^(i) - hat_m_t^(i).
8.    Send hat_m_t^(i) to server.
9.    (on server)
10.   bar_m_t = C_omega[(1/n) * sum_i hat_m_t^(i) + delta_{t-1}];
        delta_t = (1/n) * sum_i hat_m_t^(i) + delta_{t-1} - bar_m_t.
11.   Send bar_m_t to all workers.
12.   (on i-th worker)
13.   m_t = bar_m_t; x_{t+1} = x_t - gamma * m_t / sqrt(v_{T_w}).
14. end for

4.4 Compression-rate arithmetic


5. Theoretical Analysis


6. System Implementation

6.1 Compressed allreduce design

6.2 Two implementations

6.3 Integration with DeepSpeed


7. Experimental Setup

Hardware:

Models / Datasets:

Hyperparameters:

Setting Value
BERT pre-training peak LR 4e-4
BERT pre-training LR schedule linear warmup over 12.5k steps, then 0.99 decay every 520 steps
BERT batch size 4096 (total)
Adam beta_1, beta_2 0.9, 0.999
BERT-Base seq128 warmup steps T_w 16k (out of 118k)
BERT-Base seq512 warmup steps T_w 1.5k (out of 22k)
BERT-Large seq128 warmup steps T_w 23k (out of 152k)
BERT-Large seq512 warmup steps T_w 1.5k (out of 10k)
SQuAD batch size 96 over 32 GPUs
SQuAD LR 3e-5
SQuAD warmup steps 400 of 1848 total
ResNet-18 batch size 1024 (8 GPUs)
ResNet-18 LR 1e-4
ResNet-18 warmup epochs 13 of 200

Auto-detect for T_w:


8. Results

8.1 Convergence parity

BERT pre-training step counts (Table 2):

Model Seq 128 total (warmup) Seq 512 total (warmup)
BERT-Base Adam baseline 118K (N/A) 22K (N/A)
BERT-Base 1-bit Adam 118K (16K) 22K (1.5K)
BERT-Large Adam baseline 152K (N/A) 10K (N/A)
BERT-Large 1-bit Adam 152K (23K) 10K (1.5K)

GLUE downstream (Table 3, median over 10 fine-tuning seeds):

Model RTE MRPC CoLA SST-2 QNLI QQP MNLI-(m/mm)
BERT-Base (original) 66.4 84.8 52.1 93.5 90.5 89.2 84.6/83.4
BERT-Base (uncompressed re-run) 68.2 84.8 56.8 91.8 90.9 90.9 83.6/83.5
BERT-Base (1-bit Adam) 69.0 84.8 55.6 91.6 90.8 90.9 83.6/83.9
BERT-Large (original) 70.1 85.4 60.5 94.9 92.7 89.3 86.7/85.9
BERT-Large (uncompressed re-run) 70.3 86.0 60.3 93.1 92.2 91.4 86.1/86.2
BERT-Large (1-bit Adam) 70.4 86.1 62.0 93.8 91.9 91.5 85.7/85.4

SQuAD 1.1 fine-tuning:

8.2 Throughput and end-to-end speedup

8.3 Communication-volume reduction

8.4 Cross-network parity

8.5 Comparison to naive baseline

8.6 Robustness across workloads


9. Cited Systems and Prior Art

System / Paper Technique Headline result
1-bit SGD (Seide 2014) 1-bit gradient quantization with error feedback First demonstration of 1-bit-compressible SGD
QSGD (Alistarh 2017) Stochastic quantization with optimal trade-off Convergence rate analysis
signSGD (Bernstein 2018) Element-wise sign with majority-vote aggregation Communication-efficient training
TernGrad (Wen 2017) Ternary gradient quantization 32x compression near-baseline accuracy
DGC (Lin 2018) Top-k sparsification with momentum correction 270-600x compression
Stich et al. 2018 Sparsified SGD with memory Convergence proof for biased compressors
Karimireddy et al. 2019 signSGD with error feedback Provable EF convergence
NCCL (referenced; v < 2.7) Allreduce only; no Alltoall, no send/recv Motivated MPI-based custom collective
MVAPICH2-GDR CUDA-aware MPI with GPUDirect Underlies CUDA-aware compressed allreduce
DeepSpeed Microsoft's distributed-training stack 1-bit Adam shipped as DeepSpeed optimizer

10. Limitations


11. Open Problems Implicit in the Paper

  1. A theoretical condition for variance stability. The paper offers only an empirical observation; identifying which model/optimizer/data combinations admit stable v would let practitioners predict whether 1-bit Adam will converge before running an expensive warmup.
  2. Compression for other adaptive optimizers. The same recipe (freeze the non-linear quantity once stable, compress the linear one) might extend to LAMB, Adafactor, or Lion — but the empirical stability check has to be redone per optimizer.
  3. Removing the warmup. Can the variance be initialized or warm- started from a prior run, eliminating the per-job warmup cost entirely?
  4. Scaling beyond 256 GPUs. The compressed-allreduce design uses MPI Alltoall + Allgather; whether this remains competitive with tree/ring algorithms on thousand-GPU clusters is open.
  5. Extension to model/pipeline parallelism. When a model is tensor-sharded or pipeline-parallel, the optimizer state itself is sharded; 1-bit Adam would need to integrate with ZeRO-style optimizer-state partitioning.

12. Cross-Cutting Empirical Take-Aways

Take-away Derived from
Allreduce dominates BERT-Large training: 94% of step on 64-GPU Ethernet, 75% on IB Table 1 profiling
Variance term of Adam is approximately constant after a workload-dependent number of steps Section 3.3, Figure 2
Naive error-compensated 1-bit Adam diverges; freezing variance is the missing ingredient Section 4.2, Figure 1
Compressing momentum (not gradient) is the right quantity to compress in 1-bit Adam Algorithm 1, line 7
1-bit Adam delivers 3.3x throughput / 3.4x training-time speedup at no accuracy cost Section 8 results
Compression substitutes for hardware: 40 GbE + 1-bit ≈ 100 Gb IB + FP16 Adam Section 8.4
NCCL < 2.7 lacks Alltoall and send/recv, forcing MPI-based custom collective Section 6.1

Note on NCCL Tuning

The paper documents a concrete NCCL constraint relevant to collective configuration: NCCL versions prior to 2.7 expose only sum/min/max allreduce on uncompressed tensors, with no Alltoall and no send/recv, which forced the authors to bypass NCCL entirely and build their compressed allreduce on MPI (Section 6.1). The Table 1 measurement that allreduce consumes 94% of BERT-Large iteration time on 64-GPU Ethernet versus 75% on InfiniBand is also a useful upper bound on what any collective tuner can recover on bandwidth-limited interconnects when the collective payload is large and frequent. Modern NCCL exposes the missing primitives, so the same compressed-allreduce recipe is now implementable inside a tuner-plugin path rather than as a parallel stack.