1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed
Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He | Microsoft / University of Rochester / ETH Zurich | ICML 2021 (PMLR Vol. 139)
Problem
Large-model pre-training (BERT, GPT) is dominated by inter-GPU
allreduce of full-precision gradients. On a 64-GPU 40 GbE Ethernet
cluster, profiling of BERT-Large shows allreduce consumes up to 94% of
each iteration; even on 64-GPU 100 Gb InfiniBand it consumes 75%.
Error-compensated 1-bit compression already exists for SGD-class linear
optimizers, but Adam — the de-facto optimizer for transformers, since
vanilla SGD does not converge on these models — resists naive
compression because its variance term
v_t = beta_2 * v_{t-1} + (1 - beta_2) * g_t^2 is non-linear
in the gradient. The square introduces residual cross-terms
(delta_{t-1} - delta_t)^2 that do not telescope, and the
coordinate-wise time-varying learning rate
eta / sqrt(v_t + eps) admits no clean global correction
factor. The result is that direct error-compensated 1-bit Adam diverges
on BERT (paper's Figure 1).
Core Insight
During BERT-Large pre-training the L1 norm of Adam's variance term
v_t becomes approximately stable after roughly 23k steps
(paper's Figure 2); once frozen at v_{T_w}, Adam reduces to
preconditioned momentum SGD — which is linear in the gradient and
therefore amenable to error-compensated 1-bit compression of the
momentum term itself.
Method
The 1-bit Adam algorithm (Algorithm 1 in the paper) has two phases:
Phase 1 — warmup (0..T_w-1): Run
vanilla Adam unchanged so the variance can stabilize. Store
v_{T_w}.
Phase 2 — compression (T_w..T):
On worker i:
g_t^(i) = grad F_i(x_t^(i), zeta_t^(i))
m_t^(i) = beta_1 * m_{t-1} + (1 - beta_1) * g_t^(i)
hat_m_t^(i) = C_omega[m_t^(i) + delta_{t-1}^(i)] # 1-bit compress
delta_t^(i) = m_t^(i) + delta_{t-1}^(i) - hat_m_t^(i)
send hat_m_t^(i) to server
On server:
bar_m_t = C_omega[(1/n) * sum_i hat_m_t^(i) + delta_{t-1}]
delta_t = (1/n) * sum_i hat_m_t^(i) + delta_{t-1} - bar_m_t
broadcast bar_m_t
On worker i:
x_{t+1} = x_t - gamma * bar_m_t / sqrt(v_{T_w})
Key design choices:
- Compress momentum, not raw gradient — momentum has lower variance and quantizes more cleanly.
- Sign-based 1-bit quantization with per-block scale
factor
||m + delta||_1 / ||sign(m + delta)||_1to preserve L1 magnitude. - Auto-detect for
T_wvia stability ratio||v_t||_1 / ||v_{t-Delta}||_1 >= 0.96.
The system implements compressed allreduce as a three-phase MPI collective (MPI_Alltoall → local average + re-quantize → MPI_Allgather) because NCCL < 2.7 exposed neither Alltoall nor send/recv, and its allreduce only supported sum/min/max on uncompressed tensors. CUDA-aware (MVAPICH2-GDR) and basic (CPU-staged) variants ship as a DeepSpeed optimizer plugin.
Convergence is proven to match distributed SGD's
O(1/sqrt(nT)) linear-speedup rate under standard
assumptions (Lipschitz gradient, bounded variance, bounded compression
error).
Experimental Setup
| Component | Value |
|---|---|
| Ethernet cluster | 4 V100 GPUs/node, 40 GbE TCP (~4.1 Gbps effective) |
| InfiniBand cluster | 8 V100 GPUs/node, 100 Gbps EDR IB |
| Max GPUs tested | 256 |
| Models | BERT-Base (110M), BERT-Large (340M), ResNet-18 (CIFAR-10), ResNet-152 (ImageNet), DCGAN |
| Fine-tuning | SQuAD 1.1, GLUE |
| Frameworks | PyTorch + DeepSpeed |
| MPI | MVAPICH2-GDR (CUDA-aware) / generic MPI (basic) |
| BERT pre-training peak LR | 4e-4 |
| BERT LR schedule | linear warmup over 12.5k steps, then 0.99 decay every 520 steps |
| BERT total batch size | 4096 |
| Adam beta_1, beta_2 | 0.9, 0.999 |
Warmup T_w (BERT-Large seq128) |
23k of 152k steps |
Warmup T_w (BERT-Large seq512) |
1.5k of 10k steps |
| SQuAD batch / LR / warmup | 96 / 3e-5 / 400 of 1848 steps |
| ResNet-18 batch / LR / warmup | 1024 / 1e-4 / 13 of 200 epochs |
Headline Quantitative Results
Communication-share profiling (Table 1, BERT-Large seq128):
- Ethernet 64 GPUs, BS=64: allreduce = 94% of step (2205.86 ms allreduce vs. 36.65 ms forward).
- InfiniBand 64 GPUs, BS=64: allreduce = 75% of step.
- InfiniBand 8 GPUs (single node): allreduce = 16% — local NVLink/PCIe is bandwidth-rich.
Throughput / training-time:
- BERT-Large seq128, 64-GPU Ethernet: up to 3.3× higher throughput.
- BERT-Large total wall-clock: 174.3 h baseline → 51.5 h 1-bit Adam = 3.4× end-to-end.
- SQuAD fine-tuning: up to 2.9× higher throughput.
- Compression-stage-only speedup: 5.48× (BERT-Large) and 6.17× (SQuAD).
Communication-volume reduction:
- FP32 baseline → 1-bit: 97% reduction (32× compression).
- FP16 baseline → 1-bit: 94% reduction (16× compression).
- End-to-end (warmup included) ≈ 5× reduction.
Cross-network parity: 1-bit Adam on 40 GbE matches vanilla FP16 Adam on ~100 Gb InfiniBand in throughput.
Convergence parity (Table 2): total step counts identical to baseline: BERT-Base 118K seq128 + 22K seq512; BERT-Large 152K seq128 + 10K seq512.
GLUE (Table 3, median over 10 seeds): within-noise on every task. E.g., BERT-Large MNLI-(m/mm): baseline uncompressed 86.1/86.2 vs. 1-bit Adam 85.7/85.4. SQuAD F1: baseline 93.33 vs. 1-bit 93.32.
Naive 1-bit Adam baseline: diverges on BERT (Figure 1) — the freezing of variance is the missing ingredient.
Limitations
- Warmup phase runs uncompressed Adam, so for short fine-tuning jobs the warmup fraction can be ~20% (e.g. DCGAN).
- Variance stability is empirically observed for the studied models; no sufficient theoretical condition is given, only the 0.96 auto-detect heuristic.
- Custom MPI compressed-allreduce path required because NCCL < 2.7 lacked Alltoall and send/recv; cannot run on stock NCCL allreduce alone.
- Evaluation is data-parallel only; ZeRO / pipeline / tensor parallelism not measured.
- Scaling capped at 256 GPUs; thousand-GPU regime is open.
- Auto-detect threshold validated for BERT only; not stress-tested across AdamW, LAMB, etc.
Open Problems
- A theoretical sufficient condition for variance stability — would let practitioners predict 1-bit Adam viability without running warmup.
- Generalizing the recipe (freeze the non-linear state once stable, compress the linear state) to LAMB, Adafactor, Lion.
- Eliminating the warmup phase by warm-starting
v_{T_w}from a prior run. - Compressed-allreduce scaling beyond 256 GPUs — whether MPI Alltoall + Allgather remains competitive with ring/tree at thousand-GPU scale.
- Integration with model-, pipeline-, and ZeRO-style sharded optimizer states where Adam's state itself is partitioned.
Note on NCCL Tuning
The paper documents that NCCL versions prior to 2.7 supported only sum/min/max allreduce on uncompressed tensors and exposed neither Alltoall nor send/recv (Section 6.1), forcing the authors to build their compressed allreduce on MPI rather than NCCL — a concrete capability gap that any NCCL-tuner work should be aware of when targeting compressed-collective workloads. The Table 1 finding that allreduce consumes 94% of BERT-Large iteration time on 64-GPU Ethernet versus 75% on InfiniBand also bounds the ceiling that algorithm/protocol selection can reach when payloads are large and frequent on bandwidth-constrained interconnects.