1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed — Detailed Summary
Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He | Microsoft / University of Rochester / ETH Zurich | ICML 2021 (PMLR Vol. 139)
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.
Abstract
- Scalable training of large models (BERT, GPT-3) is heavily bottlenecked by inter-GPU and inter-node communication, especially on commodity systems whose interconnect is TCP/Ethernet rather than RDMA.
- Error-compensated 1-bit gradient compression is the standard answer for SGD-class linear optimizers, but the technique fails outright when applied directly to Adam because Adam's variance term is a non-linear function of the gradient.
- The paper's central empirical observation is that during BERT-Large pre-training Adam's variance term (the second moment) becomes effectively stable early in training; from that point on it can be frozen and used as a fixed precondition while only the (linear) momentum term needs to be communicated.
- This insight motivates a two-phase optimizer: a warmup phase running vanilla Adam to stabilize the variance, followed by a compression phase that 1-bit-compresses the momentum updates with error compensation while reusing the frozen variance.
- Headline results: communication volume reduced by up to 5x, identical end-task accuracy versus uncompressed Adam, and up to 3.3x higher throughput for BERT-Large pre-training on a 64-GPU Ethernet cluster.
1. Introduction
Why communication dominates large-scale training:
- Compute density of GPUs (V100, A100) has scaled rapidly while inter-GPU bandwidth has lagged, putting allreduce on the critical path.
- The gap is most acute on commodity clusters with Ethernet interconnects, where TCP-stack overheads further depress effective bandwidth.
Existing compression — and why it stops at SGD:
- Quantization (1-bit SGD, QSGD, TernGrad) and sparsification (DGC) cut gradient volume by 1-2 orders of magnitude.
- Error-compensation techniques (Seide 2014; Stich 2018; Karimireddy 2019) recycle the per-step quantization residual into the next step, restoring asymptotic convergence rates for biased compressors when the underlying optimizer is linear in gradients.
- Adam, however, is the practical optimizer of record for transformer pre-training (BERT, GPT) because vanilla SGD does not converge well on these tasks; the contribution gap is therefore the absence of a communication-efficient counterpart for Adam.
Contributions:
- Identification of the variance-stability empirical phenomenon that lets Adam be split into a non-stationary warmup and a stationary compressed phase.
- The 1-bit Adam algorithm, with theoretical convergence guarantees
that match distributed SGD's
O(1/sqrt(nT))linear-speedup rate. - A custom compressed-allreduce implementation (MPI-based Alltoall + Allgather) addressing the lack of necessary primitives in NCCL < 2.7.
- End-to-end evaluation on BERT-Base, BERT-Large, SQuAD 1.1, ResNet-18 / ResNet-152, and DCGAN.
2. Related Work
Communication-efficient learning:
- Quantization: 1-bit SGD (Seide et al. 2014), QSGD (Alistarh 2017), signSGD (Bernstein 2018), TernGrad (Wen 2017).
- Sparsification: top-k / random-k, Deep Gradient Compression (DGC).
- Sketching: count-sketch and random-projection-based aggregators.
Error compensation:
- Memorize the quantization residual
delta_tand add it back into the next step's input; allows biased compressors to retain convergence. - Theoretical tools: Stich et al. 2018 (sparsified SGD with memory), Karimireddy et al. 2019 (signSGD with error feedback).
Adam variants:
- Adagrad, RMSprop, Adadelta, AdaBound — all use coordinate-wise adaptive learning rates, all share the same non-linearity barrier when paired with naive error-compensated compression.
3. Motivation and Insights
3.1 Profiling: communication is the bottleneck
- BERT-Large pre-training (sequence length 128) is profiled on two clusters and decomposed into forward, backward-allreduce, backward everything-else, and optimizer-step wall-clock components.
- The result is Table 1, reproduced verbatim:
| Network | Nodes | GPUs | BS/GPU | Total BS | Grad Accum | Forward (ms) | Allreduce (ms) | Backward Other (ms) | Step (ms) | Allreduce % |
|---|---|---|---|---|---|---|---|---|---|---|
| Ethernet | 16 | 64 | 1 | 64 | 1 | 36.65 | 2205.86 | 33.63 | 74.96 | 94% |
| Ethernet | 16 | 64 | 16 | 1024 | 1 | 35.71 | 2275.43 | 60.81 | 75.59 | 93% |
| Ethernet | 16 | 64 | 16 | 4096 | 4 | 137.80 | 2259.36 | 243.72 | 74.92 | 83% |
| Ethernet | 8 | 32 | 16 | 512 | 1 | 37.91 | 2173.35 | 60.71 | 75.63 | 93% |
| Ethernet | 4 | 16 | 16 | 256 | 1 | 36.94 | 2133.24 | 62.82 | 76.85 | 92% |
| Ethernet | 2 | 8 | 16 | 128 | 1 | 34.95 | 1897.21 | 61.23 | 75.26 | 92% |
| Ethernet | 1 | 4 | 16 | 64 | 1 | 35.99 | 239.76 | 59.95 | 74.21 | 58% |
| InfiniBand | 8 | 64 | 1 | 64 | 1 | 25.36 | 316.18 | 23.25 | 58.49 | 75% |
| InfiniBand | 8 | 64 | 16 | 1024 | 1 | 32.81 | 336.40 | 59.99 | 57.79 | 69% |
| InfiniBand | 8 | 64 | 16 | 4096 | 4 | 131.04 | 339.52 | 237.92 | 56.91 | 44% |
| InfiniBand | 4 | 32 | 16 | 512 | 1 | 33.45 | 297.28 | 56.81 | 57.98 | 67% |
| InfiniBand | 2 | 16 | 16 | 256 | 1 | 32.86 | 183.74 | 56.49 | 58.60 | 55% |
| InfiniBand | 1 | 8 | 16 | 128 | 1 | 32.74 | 28.18 | 59.73 | 57.29 | 16% |
- On 64-GPU Ethernet, allreduce consumes up to 94% of the iteration; on 64-GPU InfiniBand it still consumes 75%.
- Allreduce share grows with node count, falls with batch size (per Amdahl), and shrinks dramatically when running on a single node (just 16% on 8-GPU IB — local NVLink/PCIe is bandwidth-rich).
3.2 Why naive 1-bit Adam fails
- Direct port of error-compensated 1-bit compression to Adam: send
compressed gradient
\hat{g}_tto the server, update both momentumm_tand variancev_tfrom\hat{g}_t, return updated parameters. - Figure 1 in the paper shows this scheme diverges from baseline Adam in loss curves on BERT.
- Two structural reasons (Section 4.2):
- The variance update
v_t = beta_2 * v_{t-1} + (1 - beta_2) * g_t^2is quadratic in g, so the residual term(delta_{t-1} - delta_t)^2does not telescope across iterations the way it does for linear SGD updates. - The Adam learning rate
eta / sqrt(v_t + eps)is itself coordinate-dependent and time-varying, so there is no clean global correction factor that an error-compensation scheme can apply.
- The variance update
3.3 Key observation: variance becomes stable
- Empirical study of
||v_t||_1during BERT-Large pre-training shows that after roughly 23k steps the L1-norm of the variance plateaus (Figure 2 in the paper). - This stability is the algorithmic opening: once
v_tno longer changes meaningfully, freezing it atv_{T_w}and using it as a fixed precondition reduces Adam to a preconditioned momentum SGD — which IS linear in the gradient and IS amenable to error-compensated compression. - The measured stability ratio used as the auto-detect criterion is
||v_t||_1 / ||v_{t-Delta}||_1 >= 0.96(Section 7).
4. The 1-bit Adam Algorithm
4.1 Why error compensation works for SGD
- Linear optimizer:
x_{t+1} = x_t - gamma * C[g_t + delta_{t-1}]wheredelta_t = g_t + delta_{t-1} - C[g_t + delta_{t-1}]. - Telescoping gives
x_{t+1} = x_0 - gamma * sum_{s<=t} g_s + gamma * delta_t— the residual error never compounds beyond a single-step gap, which is why convergence is preserved.
4.2 Why it breaks for Adam
- The non-linear v-update introduces cross terms
(delta_{t-1} - delta_t)^2that don't cancel. - The time-varying coordinate-wise learning rate
eta / sqrt(v_t + eps)makes the natural correction factor itself a moving target.
4.3 The 1-bit Adam algorithm (Algorithm 1)
The full pseudocode as printed in the paper:
Algorithm 1: 1-bit Adam
1. Initialize: x_0; learning rate gamma; initial error delta = 0;
m_0 = 0; v_0 = 0; total iterations T; warm-up steps T_w;
Adam decay factors beta_1, beta_2, eta.
2. Run original Adam for T_w steps; store v_{T_w}.
3. for t = T_w, ..., T do
4. (on i-th worker)
5. Sample zeta_t^(i); compute g_t^(i) = grad F_i(x_t^(i), zeta_t^(i)).
6. m_t^(i) = beta_1 * m_{t-1} + (1 - beta_1) * g_t^(i).
7. Compress: hat_m_t^(i) = C_omega[m_t^(i) + delta_{t-1}^(i)];
delta_t^(i) = m_t^(i) + delta_{t-1}^(i) - hat_m_t^(i).
8. Send hat_m_t^(i) to server.
9. (on server)
10. bar_m_t = C_omega[(1/n) * sum_i hat_m_t^(i) + delta_{t-1}];
delta_t = (1/n) * sum_i hat_m_t^(i) + delta_{t-1} - bar_m_t.
11. Send bar_m_t to all workers.
12. (on i-th worker)
13. m_t = bar_m_t; x_{t+1} = x_t - gamma * m_t / sqrt(v_{T_w}).
14. end for
- Phase 1 (warmup, steps
0..T_w-1): standard Adam runs unchanged so variance can stabilize. - Phase 2 (compression, steps
T_w..T):- Each worker maintains a local momentum
m_t^(i)(not gradient) — momentum is the quantity actually compressed. This is a key design choice; momentum has lower variance than g and quantizes more cleanly. - 1-bit quantization is the sign function, with a per-block scaling
factor
||m + delta||_1 / ||sign(m + delta)||_1that preserves L1 magnitude. - Error compensation operates on the momentum residual, not the gradient residual.
- The frozen variance
v_{T_w}is used as the (now constant) precondition in the parameter update.
- Each worker maintains a local momentum
4.4 Compression-rate arithmetic
- FP32 baseline: 32 bits per coordinate uncompressed;
1 bit + scale factor amortized over the block. Bit-volume reduction:
1 - 1/32 ≈ 96.875%, rounded by the paper to 97% reduction (32× compression). - FP16 baseline:
1 - 1/16 = 93.75%, rounded to 94% reduction (16× compression).
5. Theoretical Analysis
- The paper proves that 1-bit Adam achieves the same asymptotic
convergence rate as distributed SGD:
O(1/sqrt(nT))for n workers and T iterations — i.e. linear speedup in the number of workers. - Assumptions (standard for compressed-distributed-optimization
analyses):
- Lipschitz-continuous gradient.
- Bounded gradient variance.
- Bounded compression error:
||C[x] - x||^2 <= omega * ||x||^2for compression factoromega < 1.
- Proof sketch (deferred to the appendix in the paper): once variance is frozen, the dynamics reduce to preconditioned momentum SGD with biased compression and error feedback — a regime for which sublinear convergence has prior precedent (Karimireddy et al. 2019).
6. System Implementation
6.1 Compressed allreduce design
- The optimizer-level operation needed is a sum-reduction across workers followed by a broadcast — i.e. allreduce — but on compressed momenta with their own scale factors, not on raw FP gradients.
- NCCL allreduce (at the time of writing, NCCL < 2.7) only supports sum/min/max on uncompressed tensors and exposes neither Alltoall nor point-to-point send/recv primitives, so the authors cannot inject per-rank quantization/dequantization between the reduce stages.
- They instead implement compressed allreduce as a three-phase
MPI-based collective:
- Gather (MPI_Alltoall): each worker partitions its local compressed momentum into n chunks and exchanges so each rank ends up with the i-th chunk from every worker.
- Local average: each rank dequantizes its received chunks, averages them, then re-quantizes (with new error feedback).
- Scatter (MPI_Allgather): the per-rank averaged chunks are gathered back so all workers hold the full averaged momentum.
6.2 Two implementations
- CUDA-aware version: uses MVAPICH2-GDR (GPUDirect RDMA) so MPI calls operate directly on GPU buffers, avoiding host staging — the path used on InfiniBand clusters.
- Basic version: generic MPI implementation that copies tensors between GPU and CPU buffers, used on TCP/Ethernet clusters where GPUDirect is unavailable.
6.3 Integration with DeepSpeed
- The 1-bit Adam optimizer is shipped as a DeepSpeed plugin so it is drop-in compatible with the existing distributed-training frontend (PyTorch + DeepSpeed) — only the optimizer class needs to change.
7. Experimental Setup
Hardware:
- Ethernet cluster: 4 NVIDIA V100 GPUs per node, 40 GbE TCP — measured effective bandwidth approximately 4.1 Gbps per link.
- InfiniBand cluster: 8 V100 GPUs per node, 100 Gbps EDR IB.
- Scaling tested up to 256 GPUs.
Models / Datasets:
- BERT-Base (110M params, 12 layers): sequence-length-128 phase + seq-512 phase.
- BERT-Large (340M params, 24 layers): seq-128 + seq-512 phase.
- SQuAD 1.1 fine-tuning on BERT-Large.
- ResNet-18 on CIFAR-10; ResNet-152 on ImageNet.
- DCGAN (qualitative robustness test).
Hyperparameters:
| Setting | Value |
|---|---|
| BERT pre-training peak LR | 4e-4 |
| BERT pre-training LR schedule | linear warmup over 12.5k steps, then 0.99 decay every 520 steps |
| BERT batch size | 4096 (total) |
| Adam beta_1, beta_2 | 0.9, 0.999 |
| BERT-Base seq128 warmup steps T_w | 16k (out of 118k) |
| BERT-Base seq512 warmup steps T_w | 1.5k (out of 22k) |
| BERT-Large seq128 warmup steps T_w | 23k (out of 152k) |
| BERT-Large seq512 warmup steps T_w | 1.5k (out of 10k) |
| SQuAD batch size | 96 over 32 GPUs |
| SQuAD LR | 3e-5 |
| SQuAD warmup steps | 400 of 1848 total |
| ResNet-18 batch size | 1024 (8 GPUs) |
| ResNet-18 LR | 1e-4 |
| ResNet-18 warmup epochs | 13 of 200 |
Auto-detect for T_w:
- The paper proposes a stability ratio
||v_t||_1 / ||v_{t-Delta}||_1; warmup ends when this ratio first exceeds 0.96. - For BERT-Large seq128 the auto-detect produces approximately 22,173 steps — closely matching the manually chosen 23k.
8. Results
8.1 Convergence parity
BERT pre-training step counts (Table 2):
| Model | Seq 128 total (warmup) | Seq 512 total (warmup) |
|---|---|---|
| BERT-Base Adam baseline | 118K (N/A) | 22K (N/A) |
| BERT-Base 1-bit Adam | 118K (16K) | 22K (1.5K) |
| BERT-Large Adam baseline | 152K (N/A) | 10K (N/A) |
| BERT-Large 1-bit Adam | 152K (23K) | 10K (1.5K) |
- Total step count is identical to baseline for both model sizes — 1-bit Adam reaches equivalent loss in equivalent steps.
GLUE downstream (Table 3, median over 10 fine-tuning seeds):
| Model | RTE | MRPC | CoLA | SST-2 | QNLI | QQP | MNLI-(m/mm) |
|---|---|---|---|---|---|---|---|
| BERT-Base (original) | 66.4 | 84.8 | 52.1 | 93.5 | 90.5 | 89.2 | 84.6/83.4 |
| BERT-Base (uncompressed re-run) | 68.2 | 84.8 | 56.8 | 91.8 | 90.9 | 90.9 | 83.6/83.5 |
| BERT-Base (1-bit Adam) | 69.0 | 84.8 | 55.6 | 91.6 | 90.8 | 90.9 | 83.6/83.9 |
| BERT-Large (original) | 70.1 | 85.4 | 60.5 | 94.9 | 92.7 | 89.3 | 86.7/85.9 |
| BERT-Large (uncompressed re-run) | 70.3 | 86.0 | 60.3 | 93.1 | 92.2 | 91.4 | 86.1/86.2 |
| BERT-Large (1-bit Adam) | 70.4 | 86.1 | 62.0 | 93.8 | 91.9 | 91.5 | 85.7/85.4 |
- Within-noise on every task; 1-bit Adam is statistically indistinguishable from the uncompressed baseline.
SQuAD 1.1 fine-tuning:
- Baseline F1 = 93.33; 1-bit Adam F1 = 93.32 — identical to the reported precision.
8.2 Throughput and end-to-end speedup
- BERT-Large seq128, 64-GPU Ethernet cluster: 1-bit Adam delivers up to 3.3x higher throughput than vanilla Adam.
- BERT-Large total wall-clock training: baseline = 174.3 hours, 1-bit Adam = 51.5 hours — a 3.4x end-to-end training-time reduction.
- SQuAD fine-tuning: up to 2.9x higher throughput.
- Compression-stage-only speedup: 5.48x for BERT-Large and 6.17x for SQuAD — these isolate the post-warmup phase from the warmup phase that runs vanilla Adam at baseline speed.
8.3 Communication-volume reduction
- During the compression phase, payload per allreduce is reduced to 6% of baseline when running FP16 gradients (16x compression) and 3% of baseline for FP32 (32x compression).
- Including the warmup phase, end-to-end communication volume drops by approximately 5x for typical BERT runs.
8.4 Cross-network parity
- 1-bit Adam on 40 GbE Ethernet (~4.1 Gbps effective) achieves throughput comparable to vanilla FP16 Adam on ~100 Gbps InfiniBand — a striking demonstration that compression can substitute for expensive interconnect hardware.
8.5 Comparison to naive baseline
- "Adam (1-bit Naive)": apply error-compensated 1-bit compression to gradient and run standard Adam updates on the compressed gradient.
- Figure 1 (loss curves) shows naive 1-bit Adam fails to converge for BERT pre-training — it diverges from the baseline curve and never recovers. Same step budget, fundamentally different loss.
8.6 Robustness across workloads
- ResNet-18 on CIFAR-10 (Figure 6): identical loss / accuracy curve to baseline Adam (sample-wise).
- ResNet-152 on ImageNet: validated as a sanity check on a larger CNN.
- DCGAN on faces: 20% warmup; loss and generated-image quality match vanilla Adam (Figure 8).
9. Cited Systems and Prior Art
| System / Paper | Technique | Headline result |
|---|---|---|
| 1-bit SGD (Seide 2014) | 1-bit gradient quantization with error feedback | First demonstration of 1-bit-compressible SGD |
| QSGD (Alistarh 2017) | Stochastic quantization with optimal trade-off | Convergence rate analysis |
| signSGD (Bernstein 2018) | Element-wise sign with majority-vote aggregation | Communication-efficient training |
| TernGrad (Wen 2017) | Ternary gradient quantization | 32x compression near-baseline accuracy |
| DGC (Lin 2018) | Top-k sparsification with momentum correction | 270-600x compression |
| Stich et al. 2018 | Sparsified SGD with memory | Convergence proof for biased compressors |
| Karimireddy et al. 2019 | signSGD with error feedback | Provable EF convergence |
| NCCL (referenced; v < 2.7) | Allreduce only; no Alltoall, no send/recv | Motivated MPI-based custom collective |
| MVAPICH2-GDR | CUDA-aware MPI with GPUDirect | Underlies CUDA-aware compressed allreduce |
| DeepSpeed | Microsoft's distributed-training stack | 1-bit Adam shipped as DeepSpeed optimizer |
10. Limitations
- Two-phase design requires a warmup phase running uncompressed Adam — for short fine-tuning jobs the warmup fraction can be a meaningful share of total steps (e.g. ~20% for DCGAN).
- The variance-stability assumption is empirical and observed for the studied models; the paper does not provide a sufficient condition for when variance will stabilize, only the auto-detect heuristic.
- Custom MPI allreduce path means deployment requires MPI + the 1-bit collective implementation; cannot piggyback on a stock NCCL allreduce.
- Evaluation is restricted to data-parallel training; pipeline / tensor / ZeRO-style sharded optimizers are not measured.
- Scaling tested up to 256 GPUs; behavior at thousand-GPU scale is not reported.
- Auto-detect threshold (0.96 stability ratio) is workload-validated for BERT but not stress-tested across optimizers (e.g. AdamW, LAMB).
11. Open Problems Implicit in the Paper
- A theoretical condition for variance stability. The paper offers only an empirical observation; identifying which model/optimizer/data combinations admit stable v would let practitioners predict whether 1-bit Adam will converge before running an expensive warmup.
- Compression for other adaptive optimizers. The same recipe (freeze the non-linear quantity once stable, compress the linear one) might extend to LAMB, Adafactor, or Lion — but the empirical stability check has to be redone per optimizer.
- Removing the warmup. Can the variance be initialized or warm- started from a prior run, eliminating the per-job warmup cost entirely?
- Scaling beyond 256 GPUs. The compressed-allreduce design uses MPI Alltoall + Allgather; whether this remains competitive with tree/ring algorithms on thousand-GPU clusters is open.
- Extension to model/pipeline parallelism. When a model is tensor-sharded or pipeline-parallel, the optimizer state itself is sharded; 1-bit Adam would need to integrate with ZeRO-style optimizer-state partitioning.
12. Cross-Cutting Empirical Take-Aways
| Take-away | Derived from |
|---|---|
| Allreduce dominates BERT-Large training: 94% of step on 64-GPU Ethernet, 75% on IB | Table 1 profiling |
| Variance term of Adam is approximately constant after a workload-dependent number of steps | Section 3.3, Figure 2 |
| Naive error-compensated 1-bit Adam diverges; freezing variance is the missing ingredient | Section 4.2, Figure 1 |
| Compressing momentum (not gradient) is the right quantity to compress in 1-bit Adam | Algorithm 1, line 7 |
| 1-bit Adam delivers 3.3x throughput / 3.4x training-time speedup at no accuracy cost | Section 8 results |
| Compression substitutes for hardware: 40 GbE + 1-bit ≈ 100 Gb IB + FP16 Adam | Section 8.4 |
| NCCL < 2.7 lacks Alltoall and send/recv, forcing MPI-based custom collective | Section 6.1 |
Note on NCCL Tuning
The paper documents a concrete NCCL constraint relevant to collective configuration: NCCL versions prior to 2.7 expose only sum/min/max allreduce on uncompressed tensors, with no Alltoall and no send/recv, which forced the authors to bypass NCCL entirely and build their compressed allreduce on MPI (Section 6.1). The Table 1 measurement that allreduce consumes 94% of BERT-Large iteration time on 64-GPU Ethernet versus 75% on InfiniBand is also a useful upper bound on what any collective tuner can recover on bandwidth-limited interconnects when the collective payload is large and frequent. Modern NCCL exposes the missing primitives, so the same compressed-allreduce recipe is now implementable inside a tuner-plugin path rather than as a parallel stack.